CN117078959A

CN117078959A - Multi-modal salient target detection method based on cross-modal uncertainty region correction

Info

Publication number: CN117078959A
Application number: CN202311053812.8A
Authority: CN
Inventors: 金国强; 秦琦; 张强; 宋国鹏; 张振伟; 沈乾坤
Original assignee: Xian Thermal Power Research Institute Co Ltd
Current assignee: Xian Thermal Power Research Institute Co Ltd
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-17

Abstract

The invention discloses a multi-mode saliency target detection method based on cross-mode uncertain region correction, wherein a cross-mode feature enhancement module utilizes an attention mechanism to enhance effective information in shallow features and simultaneously suppresses interference information; according to the cross-modal feature correction module of the uncertain region perception, in the deep single-modal feature extraction process, cross-modal correction is carried out on uncertain regions existing in two modal deep features in a bidirectional interaction mode, so that single-modal features with better discrimination are obtained; performing cross-modal fusion on the visible light image features and the depth image features according to a multi-scale cross-modal feature fusion module, and fully mining multi-scale context information and complementary information in two modes; predicting a final significance map according to the cross-modal characteristics with discrimination to obtain a significance prediction map; and obtaining network model parameters by adopting a supervised learning model for the significance prediction graph. A complete and fine saliency prediction map is obtained.

Description

A multi-modal salient target detection based on cross-modal uncertainty region correction method

技术领域Technical field

本发明涉及图像处理技术领域，尤其涉及一种基于跨模态不确定区域校正的多模态显著性目标检测方法。The invention relates to the field of image processing technology, and in particular to a multi-modal salient target detection method based on cross-modal uncertainty region correction.

背景技术Background technique

显著性目标检测旨在模拟人类视觉系统对场景中最具视觉吸引力的目标进行分割。近年来显著性目标检测广泛应用于许多计算机视觉领域，如目标识别、视频分割、行人重识别、语义分割和图像质量评估等领域，具有重要的研究意义和广泛的应用价值。Salient object detection aims to simulate the human visual system to segment the most visually attractive objects in the scene. In recent years, salient target detection has been widely used in many computer vision fields, such as target recognition, video segmentation, pedestrian re-identification, semantic segmentation and image quality assessment. It has important research significance and wide application value.

早期的大多数基于可见光深度图像的显著性目标检测方法主要通过设计手工特征和利用不同的先验信息进行可见光深度图像的显著性目标检测算法的设计。然而手工制作的特征表征能力有限，缺乏深层语义信息的指导，难以准确地检测出显著性目标，遇到了性能瓶颈。鉴于深度卷积神经网络强大的表征能力，其已成功应用于显著性目标检测并得到了快速的发展。Most of the early salient target detection methods based on visible light depth images mainly designed manual features and used different prior information to design salient target detection algorithms for visible light depth images. However, hand-made feature representation capabilities are limited and lack the guidance of deep semantic information, making it difficult to accurately detect salient targets and encountering performance bottlenecks. In view of the powerful representation capabilities of deep convolutional neural networks, they have been successfully applied to salient target detection and have developed rapidly.

尽管卷积神经网络已经取得了不错的检测结果，但这些方法仍然面临着一些挑战。在不同模态图像成像的过程中，受成像条件(如低光照、雾霾等特殊环境)，或成像技术(如低分辨率相机和外界设备干扰等)的限制，使得在不同模态图像成像过程中，某个模态的成像传感器可能受到噪声影响，导致生成低质量的图像。当利用这些低质量深度图像进行显著性目标检测时，低质量图像不可避免地引入一定的噪声，进而降低融合特征的辨别力，退化模型的性能。Although convolutional neural networks have achieved good detection results, these methods still face some challenges. In the process of imaging different modal images, due to limitations of imaging conditions (such as low light, haze and other special environments) or imaging technology (such as low-resolution cameras and interference from external equipment, etc.), the imaging of different modal images During the process, the imaging sensor of a certain mode may be affected by noise, resulting in the generation of low-quality images. When these low-quality depth images are used for salient target detection, the low-quality images inevitably introduce a certain amount of noise, which in turn reduces the discriminability of the fused features and degrades the performance of the model.

为了缓解上述问题，一些基于图像质量问题的显著性检测方法被提出，现有的大多数方法主要从特征级选择的方法来解决图像质量引起的模型性能退化问题。除此之外，还有从图像增强方面来解决，通过估计新的深度图像，然后从估计的深度图像和原始深度图像中提取特征进行融合，进而对深度图像进行增强，解决深度图像可能存在的图像质量问题。In order to alleviate the above problems, some saliency detection methods based on image quality issues have been proposed. Most existing methods mainly use feature-level selection methods to solve the problem of model performance degradation caused by image quality. In addition, there are also solutions from the aspect of image enhancement. By estimating a new depth image, and then extracting features from the estimated depth image and the original depth image for fusion, the depth image is enhanced to solve the possible problems of the depth image. Image quality issues.

然而，上述方法仅考虑了低质量的深度图像，但是当可见光图像具有低的视觉质量时，上述模型对显著性目标不能进行很好的检测和分割。However, the above method only considers low-quality depth images, but when the visible light image has low visual quality, the above model cannot detect and segment salient objects well.

发明内容Contents of the invention

鉴于上述问题，提出了本发明一种基于跨模态不确定区域校正的方法。In view of the above problems, a method based on cross-modal uncertainty region correction is proposed in this invention.

根据本发明的一个方面，提供了一种基于跨模态不确定区域校正的多模态显著性目标检测方法，所述检测方法包括：According to one aspect of the present invention, a multi-modal salient target detection method based on cross-modal uncertainty region correction is provided. The detection method includes:

根据所述的基于跨模态不确定区域校正的多模态图像显著性目标检测方法同时考虑低质量可见光图像和低质量深度图像；According to the multi-modal image saliency target detection method based on cross-modal uncertainty region correction, low-quality visible light images and low-quality depth images are simultaneously considered;

根据所述的跨模态特征增强模块，利用注意力机制对浅层特征中的有效信息进行增强，对干扰信息进行抑制；According to the cross-modal feature enhancement module, the attention mechanism is used to enhance effective information in shallow features and suppress interference information;

根据所述的不确定区域感知的跨模态特征校正模块，在单模态特征提取过程中，通过双向信息交互的方式对两种模态深层特征存在的不确定区域进行跨模态校正，以获得更具辨别力的单模态特征；According to the cross-modal feature correction module of uncertain area perception, in the single-modal feature extraction process, the uncertain areas existing in the deep features of the two modalities are cross-modally corrected through two-way information interaction, so as to Obtain more discriminative single-modal features;

根据所述的多尺度跨模态特征融合模块对提取的可见光图像特征和深度图像特征进行跨模态融合以充分挖掘两种模态特征中的多尺度上下文信息和互补信息；Perform cross-modal fusion on the extracted visible light image features and depth image features according to the multi-scale cross-modal feature fusion module to fully exploit the multi-scale contextual information and complementary information in the two modal features;

根据所述的具有辨别力的跨模态融合特征进行逐级解码预测显著性图，获得最终的显著性预测结果；Perform step-by-step decoding to predict the saliency map based on the discriminative cross-modal fusion features to obtain the final saliency prediction result;

对所述显著性预测图采用监督学习模型得到网络模型参数。A supervised learning model is used for the saliency prediction map to obtain network model parameters.

根据所述的跨模态特征增强模块，浅层特征主要提供丰富的外观信息来帮助网络细化显著性目标的边界，然而如之前所述，输入可见光图像和深度图像有时具有较低的视觉质量，这使得从低质量输入图像中提取的浅层特征中包含大量的干扰信息，进一步会降低跨模态融合特征的辨别力。因此，为了缓解这一问题，利用注意力机制对浅层特征中的有效信息进行增强，同时对干扰信息进行抑制。According to the cross-modal feature enhancement module, shallow features mainly provide rich appearance information to help the network refine the boundaries of salient targets. However, as mentioned before, input visible light images and depth images sometimes have lower visual quality. , which makes the shallow features extracted from low-quality input images contain a large amount of interference information, which will further reduce the discriminability of cross-modal fusion features. Therefore, in order to alleviate this problem, the attention mechanism is used to enhance the effective information in shallow features while suppressing the interference information.

根据所述的不确定区域感知的跨模态特征校正模块，深层特征主要提供语义信息，能够帮助网络模型更好地定位显著性目标。然而，受低质量输入图像中干扰信息的影响，深层特征中不可避免地存在一些不确定区域。在这些不确定区域中，通常会存在一种模态图像的特征辨别力较强，而另一种模态图像的特征存在干扰信息。因此，我们可以通过信息交互的方式，利用辨别力较强的特征对存在干扰信息的特征进行校正，以缓解干扰信息对不确定区域显著性预测的影响。因此通过我们设计的不确定区域感知的跨模态特征校正模块，对两种模态深层特征存在的不确定区域进行跨模态校正，以获得更具辨别力的单模态特征，提高模型对不确定区域显著性目标检测的准确性。According to the cross-modal feature correction module of uncertainty area perception, deep features mainly provide semantic information, which can help the network model better locate salient targets. However, affected by the interference information in low-quality input images, there are inevitably some uncertain regions in deep features. In these uncertain areas, there are usually features of one modality image with strong discriminability, while features of another modality image have interfering information. Therefore, through information interaction, we can use features with strong discriminability to correct features with interference information to alleviate the impact of interference information on the saliency prediction of uncertain areas. Therefore, through the uncertain area-aware cross-modal feature correction module we designed, cross-modal correction is performed on the uncertain areas where deep features of the two modalities exist to obtain more discriminative single-modal features and improve model accuracy. Accuracy of saliency object detection in uncertain regions.

根据所述的多尺度跨模态特征融合模块，在显著性目标检测任务中，其难点之一在于显著性目标的尺度、形状和位置具有多样性。通过充分挖掘可见光图像和深度图像之间的多尺度跨模态互补信息，有助于解决显著性目标检测中的显著目标的大小和尺寸多样性问题。因此，我们利用所提出的多尺度跨模态特征融合模块充分挖掘可见光特征和深度特征中的多尺度跨模态互补信息。According to the multi-scale cross-modal feature fusion module, one of the difficulties in the salient target detection task is that the scale, shape and position of the salient target are diverse. By fully mining the multi-scale cross-modal complementary information between visible light images and depth images, it helps to solve the problem of size and size diversity of salient targets in salient target detection. Therefore, we utilize the proposed multi-scale cross-modal feature fusion module to fully mine the multi-scale cross-modal complementary information in visible light features and depth features.

所述显著性预测图采用监督学习模型得到网络模型参数具体包括：The saliency prediction map uses a supervised learning model to obtain network model parameters, which specifically include:

在训练数据集上，采用监督学习模型对预测的显著性图，端对端地完成算法网络训练，得到网络模型参数：On the training data set, the supervised learning model is used to predict the saliency map, and the algorithm network training is completed end-to-end to obtain the network model parameters:

在训练数据集上，采用监督学习机制，求取网络模型中显著性图预测结果与真值的损失函数L_joint：On the training data set, a supervised learning mechanism is used to obtain the loss function L _joint between the prediction result of the saliency map and the true value in the network model:

L_joint(S,G)＝l_bce(S,G)+l_iou(S,G)L _joint (S,G)＝l _bce (S,G)+l _iou (S,G)

其中l_bce和l_iou分别为交叉熵损失函数和交并比边界损失函数；总的损失函数设置为：where l _bce and l _iou are the cross-entropy loss function and the intersection-to-union ratio boundary loss function respectively; the overall loss function is set to:

本发明提供的基于跨模态不确定区域校正的多模态显著性目标检测方法，对算法进行端对端地训练，通过训练整体的显著性检测网络后，得到模型参数；在训练显著性检测网络参数时，为避免训练数据集出现过拟合现象，对数据集可见光图像和深度图像进行水平翻转、随机裁剪的数据增广操作。The multi-modal saliency target detection method based on cross-modal uncertainty region correction provided by the present invention trains the algorithm end-to-end, and obtains model parameters after training the overall saliency detection network; after training the saliency detection When changing the network parameters, in order to avoid overfitting in the training data set, data augmentation operations of horizontal flipping and random cropping were performed on the visible light images and depth images of the data set.

附图说明Description of the drawings

图1为本发明公开的一种基于跨模态不确定区域校正的多模态显著性目标检测方法的流程图；Figure 1 is a flow chart of a multi-modal salient target detection method based on cross-modal uncertainty region correction disclosed in the present invention;

图2为本发明提出的一种基于跨模态不确定区域校正的多模态显著性目标检测方法的算法网络框图；Figure 2 is an algorithm network diagram of a multi-modal salient target detection method based on cross-modal uncertainty region correction proposed by the present invention;

图3为本发明提出的跨模态特征增强模块框架图；Figure 3 is a framework diagram of the cross-modal feature enhancement module proposed by the present invention;

图4为本发明提出的不确定区域跨模态特征校正模块框架图；Figure 4 is a framework diagram of the cross-modal feature correction module for uncertain areas proposed by the present invention;

图5为本发明提出的双向跨模态特征交互子模块和特征选择子模块框架图Figure 5 is a framework diagram of the bidirectional cross-modal feature interaction sub-module and feature selection sub-module proposed by the present invention.

图6为本发明提出的多尺度跨模态融合模块框架图Figure 6 is a framework diagram of the multi-scale cross-modal fusion module proposed by the present invention.

图7为本发明提出的评价结果仿真图。Figure 7 is a simulation diagram of the evaluation results proposed by the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a thorough understanding of the disclosure, and to fully convey the scope of the disclosure to those skilled in the art.

本发明的说明书实施例和权利要求书及附图中的术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元。The terms "comprising" and "having" and any variations thereof in the description, claims and drawings of the present invention are intended to cover non-exclusive inclusion, for example, the inclusion of a series of steps or units.

下面结合附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solution of the present invention will be described in further detail below with reference to the accompanying drawings and examples.

如图1所示，一种基于跨模态不确定区域校正的多模态图像显著性目标检测方法，包括如下步骤：As shown in Figure 1, a multi-modal image salient target detection method based on cross-modal uncertainty area correction includes the following steps:

具体地来说，在基于跨模态不确定区域校正的多模态图像显著性目标检测网络中包含了三个关键模块：双分支特征提取网络、跨模态特征增强模块、不确定区域感知的跨模态特征校正模块和多尺度跨模态融合模块。Specifically, the multi-modal image salient target detection network based on cross-modal uncertainty area correction contains three key modules: a dual-branch feature extraction network, a cross-modal feature enhancement module, and an uncertainty area-aware Cross-modal feature correction module and multi-scale cross-modal fusion module.

步骤一：双分支特征提取网络。双分支特征提取网络包括一个可见光特征相关的分支、一个深度特征相关的分支、两个跨模态特征增强模块和三个不确定区域感知的跨模态特征校正模块。这两个主要的分支均采用VGG-16网络进行特征提取。任意一个分支中，中间层级的输入是相对应的前一个跨模态特征增强模块或者不确定区域感知的跨模态特征校正模块的输出。五个可见光/深度VGG层级的输出特征分别表示为rⁱ、dⁱ(i＝1,2,3,4,5)，其中i表示对应的特征层级。r和d分别表示为可见光模态和深度模态数据。利用跨模态特征增强模块和不确定区域感知的跨模态特征校正模块进行跨模态增强与校正后的特征分别表示为和/> Step 1: Dual-branch feature extraction network. The dual-branch feature extraction network includes a visible light feature-related branch, a depth feature-related branch, two cross-modal feature enhancement modules, and three uncertain region-aware cross-modal feature correction modules. Both main branches use the VGG-16 network for feature extraction. In any branch, the input of the intermediate level is the output of the corresponding previous cross-modal feature enhancement module or the uncertain area-aware cross-modal feature correction module. The output features of the five visible light/depth VGG levels are represented as r ⁱ and ^di (i=1, 2, 3, 4, 5) respectively, where i represents the corresponding feature level. r and d are represented as visible light mode and depth mode data respectively. The features after cross-modal enhancement and correction using the cross-modal feature enhancement module and the uncertain area-aware cross-modal feature correction module are respectively expressed as and/>

步骤二：首先跨模态特征增强模块被利用去对浅层特征中的有效信息进行增强，同时对干扰信息进行抑制，获得增强后的可见光特征和深度特征/>具体过程如下：Step 2: First, the cross-modal feature enhancement module is used to enhance the effective information in the shallow features, while suppressing the interference information to obtain enhanced visible light features. and deep features/> The specific process is as follows:

第一步：首先利用空间注意力机制对两种模态特征在空间维度上进行选择和增强，然后对增强后的单模态特征利用通道注意力机制在通道维度上对重要的特征进行选择，进一步抑制单模态特征中的干扰信息。具体来说，给定特定层级的单模态RGB特征rⁱ和深度特征dⁱ(i＝1,2)，我们首先计算它们的共有空间注意力图SA_i：The first step: first use the spatial attention mechanism to select and enhance the two modal features in the spatial dimension, and then use the channel attention mechanism to select important features in the channel dimension for the enhanced single-modal features. Further suppress interference information in single-modal features. Specifically, given a specific level of single-modal RGB features r ⁱ and depth features d ⁱ (i=1,2), we first calculate their shared spatial attention map SA _i :

其中，表示逐元素相乘操作。W_sa(*)表示空间权重生成函数：in, Represents element-wise multiplication operation. W _sa (*) represents the spatial weight generation function:

这里Avg(*)代表沿通道的平均池化操作，Max(*)代表沿通道维度的最大池化操作。接下来，共有的空间注意力图作为RGB特征和深度特征的权重去选择具有辨别力的特征进行增强，同时抑制干扰信息，表示为：Here Avg(*) represents the average pooling operation along the channel, and Max(*) represents the maximum pooling operation along the channel dimension. Next, the shared spatial attention map is used as the weight of RGB features and depth features to select discriminative features for enhancement while suppressing interference information, expressed as:

然后，对在空间维度增强的特征和/>分别执行通道注意力，生成通道注意力图在通道维度上来选择重要的单模态特征，通道注意力图/>和/>分别通过下式计算：Then, for features enhanced in the spatial dimension and/> Perform channel attention separately and generate channel attention maps to select important single-modal features in the channel dimension. Channel attention maps/> and/> Calculated respectively by the following formula:

步骤二：不确定区域感知的跨模态特征校正。深层特征主要提供语义信息，能够帮助网络模型更好地定位显著性目标。然而，受低质量输入图像中干扰信息的影响，深层特征中不可避免地存在一些不确定区域。在这些不确定区域中，通常会存在一种模态图像的特征辨别力较强，而另一种模态图像的特征存在干扰信息。因此，我们可以通过信息交互的方式，利用辨别力较强的特征对存在干扰信息的特征进行校正，以缓解干扰信息对不确定区域显著性预测的影响。所提出的不确定区域感知的跨模态特征校正模块包括双向跨模态特征交互子模块和特征选择子模块。详细地，双向跨模态特征交互子模块主要对两种模态深层特征存在的不确定区域特征进行跨模态校正，特征选择子模块主要从校正后的特征中选择对不确定区域校正较好的特征作为该不确定区域的特征。Step 2: Cross-modal feature correction for uncertain area perception. Deep features mainly provide semantic information and can help the network model better locate salient targets. However, affected by the interference information in low-quality input images, there are inevitably some uncertain regions in deep features. In these uncertain areas, there are usually features of one modality image with strong discriminability, while features of another modality image have interfering information. Therefore, through information interaction, we can use features with strong discriminability to correct features with interference information to alleviate the impact of interference information on the saliency prediction of uncertain areas. The proposed cross-modal feature correction module for uncertainty area perception includes a bidirectional cross-modal feature interaction submodule and a feature selection submodule. In detail, the two-way cross-modal feature interaction sub-module mainly performs cross-modal correction on the uncertain region features that exist in the deep features of the two modalities, and the feature selection sub-module mainly selects from the corrected features to correct the uncertain region better. as the characteristics of the uncertain region.

具体来说，所提出的不确定区域感知的跨模态校正模块首先通过计算深层RGB特征和深度特征在不同局部区域内的共有信息和不确定区域信息，以建立RGB特征和深度特征的交互关系。然后，通过构建RGB特征和深度特征的不确定区域的权重，以获取在对应模态中的不确定区域特征。由于我们获取的是两种模态深层特征存在的不确定区域及其在对应模态区域的特征，不能事先确定哪个模态对应的不确定区域特征更具辨别力。因此，首先利用提出的双向跨模态特征交互子模块利用双向交互的方式对两种模态的不确定区域特征进行校正。然后，通过设计的特征选择子模块来选择对不确定区域校正较好的特征，再分别对原始RGB特征和深度特征进行增强。详细步骤如下：Specifically, the proposed cross-modal correction module for uncertainty area perception first establishes the interactive relationship between RGB features and depth features by calculating the common information and uncertainty area information of deep RGB features and depth features in different local areas. . Then, the uncertainty region features in the corresponding modality are obtained by constructing the weights of the uncertainty regions of the RGB features and depth features. Since we obtain the uncertain regions that exist in the deep features of the two modalities and their characteristics in the corresponding modal regions, we cannot determine in advance which modality's corresponding uncertain region features are more discriminative. Therefore, the proposed two-way cross-modal feature interaction sub-module is first used to correct the uncertain region features of the two modes using two-way interaction. Then, the designed feature selection sub-module is used to select features that are better at correcting uncertain areas, and then the original RGB features and depth features are enhanced respectively. The detailed steps are as follows:

第一步：利用特征转换函数首先将深层级的RGB特征rⁱ和深度特征dⁱ(i＝3,4,5)映射到同一特征空间中，即：Step 1: Use the feature conversion function to first map the deep-level RGB features r ⁱ and depth features d ⁱ (i=3,4,5) into the same feature space, that is:

Rⁱ＝Conv(rⁱ,σ¹),R ⁱ =Conv(r ⁱ ,σ ¹ ),

Dⁱ＝Conv(dⁱ,σ¹),D ⁱ =Conv(d ⁱ ,σ ¹ ),

其中，Conv(*,σ¹)表示1×1卷积操作，作为特征转换函数，其对应的参数为σ¹，Rⁱ和Dⁱ表示映射后的特征。Among them, Conv (*, σ ¹ ) represents a 1×1 convolution operation. As the feature conversion function, its corresponding parameter is σ ¹ , and R ⁱ and D ⁱ represent the mapped features.

第二步：利用映射后的特征Rⁱ和Dⁱ，通过以下公式，计算深层级RGB特征和深度特征共有的区域特征，进而计算各自模态的不确定区域的特征。Step 2: Use the mapped features R ⁱ and D ⁱ to calculate the regional features shared by the deep-level RGB features and depth features through the following formula, and then calculate the characteristics of the uncertainty regions of the respective modes.

其中，F_co表示RGB特征和深度特征之间的共有信息。特征表示RGB特征中的不确定区域的信息。特征/>能够反映深度特征中的不确定区域的信息。Among them, F _co represents the shared information between RGB features and depth features. feature Information representing uncertain areas in RGB features. Features/> Information that can reflect uncertain areas in depth features.

第三步：通过空间注意力操作分别获得两种模态深层特征中不确定区域的空间权重，然后利用该权重选择另一模态对应区域的特征，即：Step 3: Obtain the spatial weights of the uncertain areas in the deep features of the two modalities through spatial attention operations, and then use this weight to select the features of the corresponding area of the other modality, that is:

其中，表示在跨模态特征增强模块中所述的空间注意力权重生成函数。表示深度特征中的不确定区域在对应RGB特征中的可辨别性特征。/>表示RGB特征中的不确定区域在对应深度特征中的可辨别性特征。同时，我们通过上述类似操作获取RGB特征和深度特征中不确定区域的特征/>和/>即：in, Represents the spatial attention weight generation function described in the cross-modal feature enhancement module. Represents the discriminability characteristics of the uncertain areas in the depth features in the corresponding RGB features. /> Represents the discriminability characteristics of the uncertain areas in the RGB features in the corresponding depth features. At the same time, we obtain the features of uncertain areas in RGB features and depth features through similar operations mentioned above/> and/> Right now:

第四步：利用双向跨模态特征交互子模块对获得的和/>和/>分别利用双向交互的方式进行跨模态校正。在双向跨模态特征交互子模块中，我们主要通过跨注意力操作来进行跨模态特征校正。具体来说，以/>和/>为例，首先，与self-attention类似，利用1×1卷积和重塑(reshape)操作将输入特征从/>转换为/>和/>之后对/>及其转置进行矩阵乘法运算和归一化操作，进一步计算生成位置相关矩阵/>然后，利用该位置相关矩阵对/>进行校正。整个过程可以表示为：Step 4: Use the bidirectional cross-modal feature interaction sub-module pair to obtain and/> and/> Cross-modal correction is performed using two-way interaction. In the bidirectional cross-modal feature interaction submodule, we mainly perform cross-modal feature correction through cross-attention operations. Specifically, /> and/> For example, first, similar to self-attention, 1×1 convolution and reshape operations are used to transform the input features from/> Convert to/> and/> Later/> and its transposition to perform matrix multiplication and normalization operations, and further calculate to generate a position correlation matrix/> Then, using the position correlation matrix pair/> Make corrections. The whole process can be expressed as:

其中，Nor(*)表示将通道相关矩阵中的值归一化为[0,1]，Reshape(*)表示将从大小C₁×H×W转换为C₁×HW。Conv(*,σ²)和Conv(*,σ³)表示两个1×1卷积层及其它们的参数σ²和σ³。ReLU(*)表示ReLU激活函数。由于我们利用的是双向信息交互的方式，因此，对于/>和/>和/>和/>应用以上同样的校正过程得到和/> Among them, Nor(*) means normalizing the values in the channel correlation matrix to [0,1], and Reshape(*) means normalizing the values in the channel correlation matrix to [0,1]. Convert from size C ₁ ×H×W to C ₁ ×HW. Conv(*,σ ² ) and Conv(*,σ ³ ) represent two 1×1 convolutional layers and their parameters σ ² and σ ³ . ReLU(*) represents the ReLU activation function. Since we are using a two-way information exchange method, for/> and/> and/> and/> Applying the same correction process above, we get and/>

第五步：由于双向校正存在不确定性，即不确定哪种校正方式获得的特征更具有判别性，因此在获得跨模态交互校正后，我们设计了特征选择子模块对不确定区域校正后的特征进行选择。以RGB特征校正为例，首先，在提出的特征选择模块中我们对不确定区域两种方式校正后的特征和/>使用1×1卷积降维。其次，使用全局平均池化操作和全局最大池化操作分别将其压缩为一维特征向量。最后，分别通过两个级联的全连接层和一个Sigmoid函数操作，来生成特征权重Wⁱ∈R ^C，以选择其中更具有辨别力的特征。因此，特征选择子模块通过以下公式，利用权重Wⁱ来选择对不确定区域校正更准确的特征。本过程的数学表达式为：Step 5: Since there is uncertainty in the two-way correction, that is, it is uncertain which correction method obtains more discriminative features, so after obtaining the cross-modal interactive correction, we designed a feature selection sub-module to correct the uncertain area. characteristics to select. Taking RGB feature correction as an example, first, in the proposed feature selection module, we correct the features of the uncertain area in two ways and/> Dimensionality reduction using 1×1 convolution. Secondly, it is compressed into a one-dimensional feature vector using global average pooling operation and global max pooling operation respectively. Finally, the feature weights Wi ^∈ R ^C are generated through two cascaded fully connected layers and a Sigmoid function operation to select the more discriminative features. Therefore, the feature selection sub-module uses the weight ^Wi to select features that are more accurate in correcting the uncertain area through the following formula. The mathematical expression of this process is:

其中，σ(*)表示Sigmoid函数，GAP(*)表示全局平均池化操作，GMP(*)表示全局最大池化操作，FC(*,γ_ω)表示全连接层和其对应的参数γ_ω。我们按照同样的方式通过校正后特征和/>获得深度特征中确定区域校正后的特征/> Among them, σ(*) represents the Sigmoid function, GAP(*) represents the global average pooling operation, GMP(*) represents the global maximum pooling operation, FC(*, _γω ) represents the fully connected layer and its corresponding parameters _γω . We use the corrected features in the same way and/> Obtain the corrected features of the determined area in the depth feature/>

因此，经过以上操作获得校正好的RGB特征和深度特征分别为和/>本过程的数学表达式为：Therefore, the corrected RGB features and depth features obtained through the above operations are respectively and/> The mathematical expression of this process is:

步骤三：多尺度跨模态融合模块。在显著性目标检测任务中，其难点之一在于显著性目标的尺度、形状和位置具有多样性。通过充分挖掘RGB图像和深度图像之间的多尺度跨模态互补信息，有助于解决显著性目标检测中显著目标的大小和尺寸多样性问题。为此，本章提出了一个多尺度跨模态特征融合模块。充分挖掘RGB特征和深度特征中的多尺度跨模态互补信息。考虑到基于洞卷积的多尺度特征池化模块能够有效地提取多尺度特征。因此，在ASPP的基础上，本章提出了带有残差连接的基于洞卷积的多尺度特征提取模块(Re_ASPP)，并将其应用到跨模态融合特征模块中，在进行跨模态融合之前对单模态特征进行多尺度特征的提取。Re_ASPP采用四个分支，分别采用洞率不同的洞卷积层来提取不同尺度的单模态特征。同时，考虑到不同层级的单模态特征的感受野不同，如表4.1所示，对于不同层级所采用的Re_ASPP模块洞卷积的洞率不同。同时，在Re_ASPP中我们采用了残差连接来保证梯度的稳定性，Re_ASPP的数学表达式为具体实现如下：Step 3: Multi-scale cross-modal fusion module. In the salient target detection task, one of the difficulties lies in the diversity of scales, shapes, and locations of salient targets. By fully mining the multi-scale cross-modal complementary information between RGB images and depth images, it helps to solve the problem of size and size diversity of salient targets in salient target detection. To this end, this chapter proposes a multi-scale cross-modal feature fusion module. Fully mine the multi-scale cross-modal complementary information in RGB features and depth features. Considering that the multi-scale feature pooling module based on hole convolution can effectively extract multi-scale features. Therefore, on the basis of ASPP, this chapter proposes a multi-scale feature extraction module (Re_ASPP) based on hole convolution with residual connection, and applies it to the cross-modal fusion feature module to perform cross-modal fusion. Previously, multi-scale features were extracted from single-modal features. Re_ASPP uses four branches, each using hole convolution layers with different hole rates to extract single-modal features of different scales. At the same time, considering that the receptive fields of single-modal features at different levels are different, as shown in Table 4.1, the hole rates of the Re_ASPP module hole convolution used at different levels are different. At the same time, in Re_ASPP we use residual connection to ensure the stability of the gradient. The mathematical expression of Re_ASPP is as follows:

其中，δ(*)和Cat(*)分别表示ReLU激活函数和级联操作。表示不同洞率的卷积层和对应的参数/> 表示卷积核为1×1的卷积层及其对应的参数/> Among them, δ(*) and Cat(*) represent the ReLU activation function and cascade operation respectively. Represents convolutional layers with different hole rates and corresponding parameters/> Indicates a convolution layer with a convolution kernel of 1×1 and its corresponding parameters/>

因此，在多尺度跨模态特征融合模块中，我们首先通过Re_ASPP模块分别获得单模态的多尺度特征，再通过相乘操作挖掘两种模态的共有信息对原有单模态特征进行补充增强，最后通过拼接操作获得跨模态融合特征，具体计算过程如下：Therefore, in the multi-scale cross-modal feature fusion module, we first obtain the multi-scale features of the single modality through the Re_ASPP module, and then use the multiplication operation to mine the common information of the two modalities to supplement the original single-modal features. Enhancement, and finally the cross-modal fusion features are obtained through splicing operation. The specific calculation process is as follows:

其中，Conv(*,σ⁵)和Conv(*,σ⁶)表示2个卷积层及其它们的参数σ⁵和σ⁶。Among them, Conv(*,σ ⁵ ) and Conv(*,σ ⁶ ) represent two convolutional layers and their parameters σ ⁵ and σ ⁶ .

最后，通过通道注意力机制对跨模态融合特征在通道维度上进行增强，即：Finally, the cross-modal fusion features are enhanced in the channel dimension through the channel attention mechanism, namely:

其中，W_ca(*)表示上述的通道注意力权重生成函数。表示多尺度跨模态融合模块的输出特征。Among them, W _ca (*) represents the above-mentioned channel attention weight generation function. Represents the output features of the multi-scale cross-modal fusion module.

步骤四：显著性预测。为获取更多的细节信息，也融合相邻层级多模态特征进行显著性预测，以充分挖掘和利用多层级特征之间的互补关系，提高显著性目标检测算法的性能。上述过程中一共产生了五个显著性预测结果，即：Step 4: Significance prediction. In order to obtain more detailed information, adjacent-level multi-modal features are also integrated for saliency prediction to fully exploit and utilize the complementary relationships between multi-level features and improve the performance of the salient target detection algorithm. A total of five significant prediction results were produced during the above process, namely:

其中，S_i表示每个层级对应生成的显著性图，Conv(*,θ^s)表示一个1×1的卷积层。Cat(*)和Up(*)分别表示沿着通道维度级联特征映射操作和双线性插值上采样操作。Among them, S _i represents the saliency map generated corresponding to each level, and Conv(*,θ ^s ) represents a 1×1 convolution layer. Cat(*) and Up(*) represent cascaded feature mapping operations and bilinear interpolation upsampling operations along the channel dimension, respectively.

其中l_bce(*)和l_iou(*)分别为交叉熵损失函数和基于交并比的边界损失函数。两者的定义分别为：Among them, l _bce (*) and l _iou (*) are the cross-entropy loss function and the boundary loss function based on the intersection and union ratio, respectively. The definitions of the two are:

其中G(m,n)∈{0,1}是真值的每一个像素标签。P(m,n)∈{0,1}是预测显著性图每一个像素的概率。W和H分别表示输入图像的宽度和高度。where G(m,n)∈{0,1} is the true value of each pixel label. P(m,n)∈{0,1} is the probability of predicting each pixel of the saliency map. W and H represent the width and height of the input image respectively.

以下结合仿真实验，对本发明的技术效果作进一步说明：The following is a further explanation of the technical effects of the present invention in combination with simulation experiments:

1、仿真条件：所有仿真实验均在操作系统为Ubuntu 16.04.5，硬件环境为GPUNvidia GeForce GTX 1080Ti，采用PyTorch深度学习框架实现；1. Simulation conditions: All simulation experiments are conducted under the operating system Ubuntu 16.04.5, the hardware environment is GPUNvidia GeForce GTX 1080Ti, and are implemented using the PyTorch deep learning framework;

2、仿真内容及结果分析：2. Simulation content and result analysis:

仿真1Simulation 1

将本发明与现有的基于可见光深度图像的显著性检测方法在公共的六个可见光深度图像显著性检测数据集DUT-RGBD、NJU2K、NLPR、LFSD、RGBD135和STERE上进行显著性检测实验，部分实验结果进行直观的比较。The present invention and the existing saliency detection method based on visible light depth images were used to conduct saliency detection experiments on six public visible light depth image saliency detection data sets DUT-RGBD, NJU2K, NLPR, LFSD, RGBD135 and STERE. Partially Experimental results are visually compared.

相较于现有技术，本发明对输入可见光-深度图像对中低视觉质量图像的检测效果更好。得益于本发明中的跨模态特征增强模块，增强了浅层特征中的有效信息，抑制了干扰信息。得益于本发明中的不确定区域感知的跨模态特征校正模块，两种模态深层特征所存在的不确定区域通过跨模态特征双向交互模块以信息交互的方式实现了很好地校正，获得了更具辨别力的单模态特征，进而提高了不确定区域显著性预测的准确性。此外，得益于本发明中对可见光图像深度图像中的多尺度多模态互补信息的充分挖掘捕捉，使两线索充分结合并利用各自优势，复杂场景下的小目标和多目标能够更好地被分割出来，同时对于多目标图像也得到了较为完整的显著性检测结果。评价仿真结果如图6所示：Compared with the existing technology, the present invention has better detection effect on input visible light-depth image pairs of medium and low visual quality images. Thanks to the cross-modal feature enhancement module in the present invention, effective information in shallow features is enhanced and interference information is suppressed. Thanks to the cross-modal feature correction module for uncertainty area perception in the present invention, the uncertain areas existing in the deep features of the two modalities are well corrected in the form of information interaction through the cross-modal feature bidirectional interaction module. , more discriminative single-modal features are obtained, thereby improving the accuracy of saliency prediction in uncertain regions. In addition, thanks to the full mining and capture of the multi-scale and multi-modal complementary information in the visible light image depth image in the present invention, the two clues are fully combined and their respective advantages are utilized, and small targets and multiple targets in complex scenes can be better identified. were segmented, and at the same time, relatively complete saliency detection results were obtained for multi-target images. The evaluation simulation results are shown in Figure 6:

其中，(a)RGB图像；(b)Depth图像；(c)MMCI预测结果；(d)TANet预测结果；(e)DMRA预测结果；(f)CPFP预测结果；(g)ICNet预测结果；(h)CPFP预测结果；(i)S2MA预测结果；(j)D3Net预测结果；(k)A2dele预测结果；(l)ASIFNet预测结果；(m)SSF预测结果；(n)DRLF预测结果；(o)DQSD预测结果；(p)CCAFNet预测结果；(q)JL-DCF预测结果；(r)DFMNet预测结果；(s)Ours预测结果；(t)真值。从图6可以看出本发明对RGB-D图像预测的显著性图整体更完整，细节更精细，充分表明了本发明方法的有效性和优越性。Among them, (a) RGB image; (b) Depth image; (c) MMCI prediction result; (d) TANet prediction result; (e) DMRA prediction result; (f) CPFP prediction result; (g) ICNet prediction result; ( h) CPFP prediction results; (i) S2MA prediction results; (j) D3Net prediction results; (k) A2dele prediction results; (l) ASIFNet prediction results; (m) SSF prediction results; (n) DRLF prediction results; (o )DQSD prediction results; (p) CCAFNet prediction results; (q) JL-DCF prediction results; (r) DFMNet prediction results; (s) Ours prediction results; (t) true value. It can be seen from Figure 6 that the saliency map predicted by the present invention for RGB-D images is more complete overall and the details are finer, which fully demonstrates the effectiveness and superiority of the method of the present invention.

仿真2Simulation 2

将本发明与现有的基于RGB-D图像的多模态显著性检测方法在公共的六个RGB-D图像显著性检测数据集DUT-RGBD、NJU2K、NLPR、RGBD135、LFSD、STERE上进行显著性检测实验得到的结果，采用公认的评价指标进行客观评价，评价仿真结果如表1所示：The present invention is compared with the existing multi-modal saliency detection method based on RGB-D images on the six public RGB-D image saliency detection data sets DUT-RGBD, NJU2K, NLPR, RGBD135, LFSD, and STERE. The results obtained from the sex testing experiment are objectively evaluated using recognized evaluation indicators. The evaluation simulation results are shown in Table 1:

其中：in:

F_β表示查准率和查全率的加权调和的最大值；F _β represents the maximum value of the weighted harmonization of precision rate and recall rate;

E_m表示将局部像素值与图像级均值相结合，共同评价预测与地面真实值的相似性；E _m means combining local pixel values with image-level mean values to jointly evaluate the similarity between predictions and ground truth values;

S_α表示预测之间的对象感知和区域感知的结构相似性；S _α represents the structural similarity between predictions for object perception and region perception;

MetricMetric MMCIMMCI DMRADMRA D3NetD3Net ICNetICNet A2deleA2dele S2MAS2MA JL-DCFJL-DCF SSFSSF DQSDDQSD CCAFNetCCAFNet DFMNetDFMNet TMFNetTMFNet OURSOURS FβFβ 0.8520.852 0.8860.886 0.90.9 0.8910.891 0.8730.873 0.8890.889 0.9120.912 0.8960.896 0.90.9 0.910.91 0.9130.913 0.8820.882 0.9250.925 EmEm 0.9150.915 0.9270.927 0.9360.936 0.9260.926 0.9160.916 0.930.93 0.9490.949 0.9350.935 0.9360.936 0.9430.943 0.9490.949 0.910.91 0.9530.953 SαSα 0.8580.858 0.8860.886 0.90.9 0.8940.894 0.8690.869 0.8940.894 0.910.91 0.8990.899 0.8990.899 0.9090.909 0.9120.912 0.910.91 0.9180.918 MAEMAE 0.0790.079 0.0510.051 0.0460.046 0.0520.052 0.0510.051 0.0530.053 0.0380.038 0.0430.043 0.050.05 0.0370.037 0.0390.039 0.0410.041 0.0340.034 FβFβ 0.8150.815 0.880.88 0.8970.897 0.9080.908 0.880.88 0.9020.902 0.9150.915 0.8960.896 0.8980.898 0.9080.908 0.9120.912 0.8670.867 0.9100.910 EmEm 0.9130.913 0.9470.947 0.9530.953 0.9520.952 0.9450.945 0.9530.953 0.9630.963 0.9530.953 0.9520.952 0.9560.956 0.9610.961 0.9440.944 0.9560.956 SαSα 0.8560.856 0.8990.899 0.9120.912 0.9230.923 0.8960.896 0.9150.915 0.9260.926 0.9140.914 0.9160.916 0.9210.921 0.9230.923 0.9210.921 0.9240.924 MAEMAE 0.0590.059 0.0310.031 0.030.03 0.0280.028 0.0280.028 0.030.03 0.0240.024 0.0260.026 0.0290.029 0.0260.026 0.0260.026 0.0270.027 0.0240.024 FβFβ 0.7670.767 0.8980.898 0.7930.793 0.850.85 0.8920.892 0.9010.901 0.8780.878 0.9240.924 0.8270.827 0.9130.913 0.7470.747 -- 0.9260.926 EmEm 0.8590.859 0.9330.933 0.8290.829 0.8990.899 0.930.93 0.9370.937 0.920.92 0.9510.951 0.8780.878 0.9430.943 0.8440.844 -- 0.9500.950 SαSα 0.7910.791 0.8890.889 0.7730.773 0.8520.852 0.8850.885 0.9030.903 0.8810.881 0.9150.915 0.8450.845 0.9030.903 0.7910.791 -- 0.9170.917 MAEMAE 0.1130.113 0.0480.048 0.0980.098 0.0720.072 0.0420.042 0.0430.043 0.0550.055 0.0330.033 0.0720.072 0.0370.037 0.0920.092 -- 0.0340.034 FβFβ 0.8630.863 0.8570.857 0.8910.891 0.8980.898 0.8790.879 0.8820.882 0.8980.898 0.890.89 0.8860.886 0.8870.887 0.9040.904 -- 0.9060.906 EmEm 0.9270.927 0.9160.916 0.9380.938 0.9420.942 0.9280.928 0.9320.932 0.9420.942 0.9360.936 0.9350.935 0.9340.934 0.9480.948 -- 0.9460.946 SαSα 0.8730.873 0.8450.845 0.8990.899 0.9030.903 0.8790.879 0.890.89 0.90.9 0.8930.893 0.8920.892 0.8920.892 0.9080.908 -- 0.9040.904 MAEMAE 0.0680.068 0.0630.063 0.0460.046 0.0450.045 0.0450.045 0.0510.051 0.0420.042 0.0440.044 0.0510.051 0.0440.044 0.040.04 -- 0.0370.037 FβFβ 0.7710.771 0.8560.856 0.810.81 0.8710.871 0.8350.835 0.8350.835 0.8390.839 0.8660.866 0.8470.847 0.8320.832 0.8660.866 0.8460.846 0.8720.872 EmEm 0.8390.839 0.90.9 0.8620.862 0.9030.903 0.8790.879 0.8730.873 0.8790.879 0.90.9 0.8780.878 0.8760.876 0.9020.902 0.8650.865 0.9090.909 SαSα 0.7870.787 0.8470.847 0.8250.825 0.8780.878 0.8360.836 0.8370.837 0.8330.833 0.8590.859 0.8510.851 0.8260.826 0.870.87 0.8490.849 0.8730.873 MAEMAE 0.1320.132 0.0750.075 0.0950.095 0.0710.071 0.0740.074 0.0940.094 0.0840.084 0.0660.066 0.0850.085 0.0870.087 0.0680.068 0.0840.084 0.0620.062 FβFβ 0.8220.822 0.8880.888 0.8850.885 0.9130.913 0.8670.867 0.9350.935 0.9170.917 0.8830.883 0.9270.927 0.9370.937 0.9320.932 0.8920.892 0.9440.944 EmEm 0.9280.928 0.9450.945 0.9460.946 0.960.96 0.9230.923 0.9730.973 0.960.96 0.9410.941 0.9730.973 0.9770.977 0.9730.973 0.9680.968 0.9820.982 SαSα 0.8480.848 0.9010.901 0.8980.898 0.920.92 0.8850.885 0.9410.941 0.9240.924 0.9050.905 0.9350.935 0.9380.938 0.9380.938 0.9360.936 0.9440.944 MAEMAE 0.0650.065 0.0290.029 0.0310.031 0.0270.027 0.0280.028 0.0210.021 0.0210.021 0.0250.025 0.0210.021 0.0180.018 0.0190.019 0.0210.021 0.0160.016

MAE表示归一化预测之间的平均像素绝对差。MAE represents the average pixel absolute difference between normalized predictions.

F_β、E_m、S_α均为越高越好，MAE越低越好。从表1中可以看出本发明对RGB-D图像具有更准确的显著性分割能力，充分表明了本发明方法的有效性和优越性。The higher F _β , E _m and S _α are, the better, and the lower MAE is, the better. It can be seen from Table 1 that the present invention has more accurate saliency segmentation capabilities for RGB-D images, which fully demonstrates the effectiveness and superiority of the method of the present invention.

上面对本发明的实施方式做了详细说明。但是本发明并不限于上述实施方式，在所属技术领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above. However, the present invention is not limited to the above-described embodiments. Various changes can be made within the knowledge scope of those of ordinary skill in the art without departing from the gist of the present invention.

以上的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above specific embodiments further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Within the spirit and principles of the present invention, any modifications, equivalent substitutions, improvements, etc. shall be included in the protection scope of the present invention.

Claims

1. The multi-modal significance target detection method based on cross-modal uncertainty region correction is characterized by comprising the following steps of:

based on a multi-modal saliency target detection model corrected by a cross-modal uncertain region, the influence of a low-quality input image on the performance of the saliency target detection model is fully considered;

the cross-modal feature enhancement module enhances effective information in shallow features by using an attention mechanism and suppresses interference information at the same time;

the cross-modal feature correction module for the uncertain region perception carries out cross-modal correction on an uncertain region existing in the two-modal deep features in a bidirectional information interaction mode in the deep single-modal feature extraction process so as to obtain single-modal features with more discrimination and improve the accuracy of the model on the detection of the saliency target of the uncertain region;

the multi-scale cross-modal feature fusion module is used for fully mining and capturing complementary information in the visible light image and the depth image and multi-scale context information of the cross-modal fusion features; the multi-scale feature extraction module with residual connection based on hole convolution is adopted to extract multi-scale features before cross-modal fusion is carried out, and then the decoding features are carried out on the coding features through a decoder;

the fusion features are used for decoding and predicting the final significance map step by step to obtain a significance prediction map;

combining cross entropy loss and a loss function based on cross-over ratio to further train the network to obtain a more complete saliency target;

and obtaining network model parameters by adopting a supervised learning model for the significance prediction graph.

2. The multi-modal salient object detection method based on cross-modal uncertainty region correction according to claim 1, wherein the cross-modal feature enhancement module enhances effective information in shallow features by using an attention mechanism and suppresses interference information, and specifically comprises:

the two modal features are first selected and enhanced in spatial dimension using a spatial attention mechanism, and their common spatial attention profile SA is first calculated _i Then, features with discrimination are enhanced according to the shared spatial attention, and the enhanced features in the spatial dimension are obtainedAnd->

A channel attention map is then generated by a channel attention mechanismAnd->Then realizing enhancement on channel dimension through channel attention, simultaneously inhibiting interference information contained in the channel dimension, and obtaining the finally enhanced single-mode RGB (red, green and blue) characteristic>And unimodal depth profile->

3. The multi-modal salient object detection method based on cross-modal uncertainty region correction according to claim 1, wherein the cross-modal feature correction module for uncertainty region perception performs cross-modal correction on an uncertainty region with two kinds of modal deep features in an information interaction mode to obtain a single-modal feature with better discrimination, and accuracy of the salient object detection of the uncertainty region is improved.

4. The multi-modal salient object detection method based on cross-modal uncertainty region correction as claimed in claim 3, comprising the steps of:

first according to RGB features R mapped to the same space ⁱ And depth feature D ⁱ Calculating the region features shared by the deep-level RGB features and the depth features, and further calculating the features of uncertain regions of two modesAnd->

Then the spatial weight of the uncertain region in the deep features of the two modes is obtained through the spatial attention operation, and then the feature of the region corresponding to the other mode is selected by using the weightAnd->Simultaneous acquisition of RGB features and features of an uncertainty region in depth features +.>And->

Then the sub-module pair is interacted by using the bi-directional cross-modal characteristicAnd->And->The cross-modal correction is carried out by respectively utilizing a bidirectional interaction mode, the cross-modal characteristic correction is carried out by cross-attention operation, and finally the corrected characteristic is obtained as follows>The way of bi-directional information interaction is utilized, thus, for +.>And-> And->And +.>And->Applying the same correction procedure to get ∈ ->And->

And finally, selecting the features after the bidirectional cross-modal feature interaction sub-module according to the feature selection sub-module.

5. The multi-modal salient object detection method based on cross-modal uncertainty area correction of claim 4, wherein the feature selection sub-module is used for selecting the features after the bidirectional cross-modal feature interaction sub-module, and the specific steps are as follows:

first, features corrected in two ways of uncertain regionsAnd->Dimension reduction is performed by using 1×1 convolution, then the dimension reduction is performed by using global average pooling operation and global maximum pooling operation to compress the dimension reduction into one-dimensional feature vectors respectively, and finally feature weights W are generated by two cascaded full-connection layers and one Sigmoid function operation respectively ⁱ According to this weight, a more accurate feature for the correction of the uncertainty region is selected>And->The corrected RGB features and depth features are finally obtained as +.>And

6. the multi-modal salient object detection method based on cross-modal uncertainty region correction according to claim 1, wherein complementary information and multi-scale context information in a visible light image and a depth image are fully mined according to the multi-scale cross-modal feature fusion module, and specifically comprising:

firstly, respectively obtaining single-mode multi-scale characteristics through Re_ASPP modules, and then communicatingThe common information of the two modes is mined through multiplication operation to supplement and enhance the original single-mode characteristics, and finally the cross-mode fusion characteristic F is obtained through splicing operation ⁱ ；

Then enhancing the cross-modal fusion characteristics in the channel dimension through a channel attention mechanism to obtain the output characteristics of the multi-scale cross-modal fusion block

7. The multi-modal significance target detection method based on cross-modal uncertainty area correction according to claim 1, wherein the step-wise fusion of the fused cross-modal features in a decoding stage to obtain a final fused feature prediction final significance map, and the obtaining of the significance prediction map specifically comprises:

obtaining a final saliency map S for the fusion features through a 1X 1 convolution layer and a Sigmoid function ^(t) (t＝1,2,3,4,5)。

8. The multi-modal salient object detection method based on cross-modal uncertainty region correction of claim 1, wherein obtaining network model parameters for the salient prediction graph by using a supervised learning model specifically comprises:

on a training data set, adopting a supervised learning model to carry out predictive significance graph, and completing algorithm network training end to obtain network model parameters:

on a training data set, a supervised learning mechanism is adopted to calculate a loss function L of a predicted result and a true value of a significance map in a network model _joint ：

L _joint (S,G)＝l _bce (S,G)+l _iou (S,G),

Wherein l _bce (:) and l _iou Cross entropy loss function and cross-ratio based loss function, respectively;

the definition of the two is respectively:

each pixel label in which G (m, n) ε {0,1} is a true value; p (m, n) e {0,1} is the probability of predicting each pixel of the saliency map; w is the width of the input image and H is the height of the input image.

9. The multi-modal salient object detection method based on cross-modal uncertainty area correction as claimed in claim 1, wherein the multi-modal image salient object detection network based on cross-modal uncertainty area correction comprises three modules: the system comprises a dual-branch feature extraction network, a cross-modal feature enhancement module, an uncertain region-aware cross-modal feature correction module and a multi-scale cross-modal fusion module.