CN114693939B

CN114693939B - Method for extracting depth features of transparent object detection under complex environment

Info

Publication number: CN114693939B
Application number: CN202210259132.0A
Authority: CN
Inventors: 李仪; 刘星辰
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2024-04-30
Anticipated expiration: 2042-03-16
Also published as: CN114693939A

Abstract

The invention discloses a method for extracting transparent object detection depth features in a complex environment, which comprises the steps of obtaining preliminary features through a composite backbone network, processing the preliminary features by utilizing a receptive field enhancement feature pyramid module, carrying out receptive field enhancement processing on a first pyramid stage feature map by fusing contextual feature information, and fusing the enhanced features with a second pyramid stage feature map, so that the depth features of transparent objects such as glassware and the like which are relatively robust in the complex environment can be extracted, and the effect of transparent object detection algorithms such as glassware and the like can be improved.

Description

A deep feature extraction method for transparent object detection in complex environments

技术领域Technical Field

本发明属于图像深度特征提取领域，尤其涉及一种复杂环境下透明物检测深度特征提取方法。The present invention belongs to the field of image depth feature extraction, and in particular relates to a method for extracting depth features of transparent object detection in a complex environment.

背景技术Background technique

透明物体的目标检测算法可应用于多种现实生活场景中，例如酒店中送餐机器人需要检测其行进路线上玻璃门以避免碰撞，家庭机器人需要识别酒瓶，玻璃杯等易碎的透明玻璃制品等。因此，透明目标检测算法具有较高的实用价值。The transparent object detection algorithm can be applied to a variety of real-life scenarios. For example, a food delivery robot in a hotel needs to detect glass doors on its route to avoid collisions, and a home robot needs to identify wine bottles, glasses and other fragile transparent glass products. Therefore, the transparent object detection algorithm has high practical value.

透明物体广泛的存在于现实世界中，如窗户，杯子，酒瓶等。并且透明物体的检测可应用于多种实际场景中，例如应用于工厂物流机器人需要识别路线上的玻璃门以避免碰撞，应用于智能家居的机器人需要识别玻璃杯，酒瓶等透明易碎的物体，除此之外透明物体检测在玻璃器皿工厂，化学实验室等存在大量透明物体场景中有着广泛的应用场景。Transparent objects widely exist in the real world, such as windows, cups, wine bottles, etc. And the detection of transparent objects can be applied to a variety of practical scenarios. For example, robots used in factory logistics need to identify glass doors on the route to avoid collisions, and robots used in smart homes need to identify transparent and fragile objects such as glasses and wine bottles. In addition, transparent object detection has a wide range of application scenarios in glassware factories, chemical laboratories, and other scenes with a large number of transparent objects.

目标检测任务是一项经典的计算机视觉任务，旨在识别图像中的目标类别并定位目标的位置，并在机器人导航，自动驾驶，工业检测等诸多领域获得了广泛的应用，具有非常重要的实用价值。近年来，基于深度学习的目标检测算法在检测速度以及检测精度上均有了巨大的提升，然而由于透明物体如玻璃器皿等相比较非透明目标，更容易受到背景、光照等因素的影响，而实际应用场景背景往往相对复杂，环境光照多变且不固定。目前已有的目标检测算法大多针对车辆，行人等非透明的通用目标所设计，通用的目标检测算法对于透明目标更容易出现错检，漏检等问题，限制了目标检测算法在存在大量透明物体的场景中的应用。而目前基于深度学习的目标检测算法的特征提取方法基本上针对通用目标所设计，而对于极容易被背景所干扰的玻璃器皿等透明目标，此类方法往往因为缺少针对性的设计导致所提取的特征不能够很好的表达透明目标信息，进而影响目标检测算法的效果。The target detection task is a classic computer vision task, which aims to identify the target category in the image and locate the position of the target. It has been widely used in many fields such as robot navigation, autonomous driving, industrial inspection, etc., and has very important practical value. In recent years, the target detection algorithm based on deep learning has made great improvements in detection speed and detection accuracy. However, compared with non-transparent targets, transparent objects such as glassware are more susceptible to factors such as background and lighting, and the background of actual application scenarios is often relatively complex, and the ambient lighting is variable and not fixed. At present, most of the existing target detection algorithms are designed for non-transparent general targets such as vehicles and pedestrians. General target detection algorithms are more likely to have problems such as false detection and missed detection for transparent targets, which limits the application of target detection algorithms in scenes with a large number of transparent objects. At present, the feature extraction methods of target detection algorithms based on deep learning are basically designed for general targets. For transparent targets such as glassware that are easily disturbed by the background, such methods often lack targeted design, resulting in the extracted features not being able to well express the transparent target information, thereby affecting the effect of target detection algorithms.

名词解释：Glossary:

concat特征图组融合操作：是网络结构设计中很重要的一种操作，用于将特征联合，多个卷积特征提取框架提取的特征融合或者是将输出层的信息进行融合。concat feature map group fusion operation: It is a very important operation in network structure design, which is used to combine features, fuse features extracted by multiple convolutional feature extraction frameworks, or fuse the information of the output layer.

发明内容Summary of the invention

为解决当前主流的基于深度学习目标检测网络特征提取网络所提取特征不能够很好的提取透明目标特征的问题，本发明提供一种复杂环境下透明物检测深度特征提取方法。In order to solve the problem that the features extracted by the current mainstream deep learning-based target detection network feature extraction network cannot well extract transparent target features, the present invention provides a deep feature extraction method for transparent object detection in a complex environment.

为实现上述目的，本发明的技术方案为：To achieve the above object, the technical solution of the present invention is:

一种复杂环境下透明物检测深度特征提取方法，包括如下步骤：A method for extracting depth features of transparent object detection in a complex environment comprises the following steps:

步骤一、利用复合主干网络提取初始特征；Step 1: Extract initial features using a composite backbone network;

S1.1:复合主干网络包括第一子网络和第二子网络；S1.1: The composite backbone network includes a first sub-network and a second sub-network;

S1.2：第一子网络包括N个串联的层级，第二子网络包括包括N个串联的层级；第一子网络的第i层级的输出特征输入到第一子网络的第i+1层级，第一子网络的第i+1层级对第一子网络的第i层级的输出特征进行特征提取后输出第一子网络的第i+1层级的输出特征；第一子网络的第i层级、第i+1层级...第N层级的输出特征图均进行1x1卷积以统一通道数，且通过最邻近上采样以统一大小后进入第i-1个第一特征融合模块，得到输出辅助特征图；辅助特征图及第二子网络的第i-1层级的输出逐点相加作为第二子网络的第i层级的输入，第二子网络的第i层级对第二子网络的第i层级的输入进行特征提取，得到第二子网络的第i层级输出的初步特征；第二子网络后m个层级的输出的初步特征的合集即为初始特征，所述的初始特征包含m个初步特征，0<m<n；i＞1；S1.2: The first subnetwork includes N layers connected in series, and the second subnetwork includes N layers connected in series; the output features of the i-th layer of the first subnetwork are input to the i+1-th layer of the first subnetwork, and the i+1-th layer of the first subnetwork extracts features from the output features of the i-th layer of the first subnetwork and outputs the output features of the i+1-th layer of the first subnetwork; the output feature maps of the i-th layer, the i+1-th layer...the N-th layer of the first subnetwork are all subjected to 1x1 convolution to unify the number of channels, and enter the i-1-th first feature fusion module after being uniformly sized through nearest neighbor upsampling to obtain an output auxiliary feature map; the auxiliary feature map and the output of the i-1-th layer of the second subnetwork are added point by point as the input of the i-th layer of the second subnetwork, and the i-th layer of the second subnetwork extracts features from the input of the i-th layer of the second subnetwork to obtain the preliminary features of the i-th layer output of the second subnetwork; the set of preliminary features of the outputs of the last m layers of the second subnetwork is the initial features, and the initial features include m preliminary features, 0<m<n; i＞1;

步骤二、感受野增强特征金字塔模块处理初始特征，得到第二特征融合模块得输出特征；Step 2: The receptive field enhancement feature pyramid module processes the initial features to obtain the output features of the second feature fusion module;

S2.1：将初始特征进行金字塔第一阶段处理，输出第一金字塔阶段特征图；S2.1: Process the initial features into the first stage of the pyramid and output the first pyramid stage feature map;

S2.2：第一金字塔阶段特征图利用感受野增强处理，得到第一金字塔阶段感受野增强特征：S2.2: The feature map of the first pyramid stage is processed by receptive field enhancement to obtain the receptive field enhancement feature of the first pyramid stage:

其中，F_out2表示第一金字塔阶段感受野增强特征，ε_n表示第n种不同空洞率得空洞卷积，表示两特征图在特征图通道维度上的concat特征图组融合操作，F_in表示输入至感受野增强模块的第一金字塔阶段特征图，σ()表示1×1卷积；Among them, F _out2 represents the receptive field enhancement feature of the first pyramid stage, ε _n represents the nth type of dilated convolution with different dilation rates, Represents the concat feature map group fusion operation of two feature maps in the feature map channel dimension, _Fin represents the first pyramid stage feature map input to the receptive field enhancement module, and σ() represents a 1×1 convolution;

S2.3：将第一金字塔阶段特征图进行金字塔第二阶段处理，输出第二金字塔阶段特征图；S2.3: Process the first pyramid stage feature map into the second pyramid stage, and output the second pyramid stage feature map;

S2.4：将第一金字塔阶段感受野增强特征和第二金字塔阶段特征图一同输入至第二特征融合模块，得到第二特征融合模块得输出特征：S2.4: Input the first pyramid stage receptive field enhancement feature and the second pyramid stage feature map together into the second feature fusion module to obtain the output feature of the second feature fusion module:

F_out＝weight*F₁+(1-weight)*F₂ F _out = weight*F ₁ + (1-weight)*F ₂

其中，weight表示学习到的权重数，F₁表示第二金字塔阶段特征图，F₂表示第一金字塔阶段感受野增强特征。Among them, weight represents the number of learned weights, _F1 represents the feature map of the second pyramid stage, and _F2 represents the receptive field enhancement feature of the first pyramid stage.

进一步的改进，所述第二子网络中的初步特征有：As a further improvement, the preliminary features of the second sub-network are:

其中，为第二子网络中的第l层级的初步特征，/>表示第二子网络中的第l-1层级的输出的初步特征、n表示复合主干网络中每个子网络的层级数、l表示当前初步特征所处层级、/>表示输入到第二子网络的第一子网络第i层级的输出特征图、ε表示特征融合操作。in, is the preliminary feature of the lth level in the second sub-network,/> represents the preliminary features of the output of the l-1th level in the second sub-network, n represents the number of levels of each sub-network in the composite backbone network, l represents the level of the current preliminary features, /> represents the output feature map of the i-th level of the first subnetwork input to the second subnetwork, and ε represents the feature fusion operation.

进一步的改进，所述金字塔第一阶段处理包括如下步骤：As a further improvement, the first stage of pyramid processing includes the following steps:

A1：将得到的初始特征中所有的初步特征分别经过1x1卷积，统一通道数和维度，得到统一维度的特征图；A1: All preliminary features in the obtained initial features are subjected to 1x1 convolution respectively to unify the number of channels and dimensions to obtain feature maps of unified dimensions;

A2：将统一维度的特征图按从深层级向浅层级的顺序，将最深层第m层级特征图通过最近邻上采样得到第m层级上采样特征图，第m层级上采样特征图的大小与第m-1层级特征图大小一样；第m层级上采样特征图与第m-1层级特征图逐点相加得到上采样过程中新的特征图，上采样过程中新的特征图保留作为第m-1层级的新特征图；之后将上采样过程中第m-1层级的新特征图重复上采样操作，逐层操作直至处理完全部的统一维度的特征图；A2: The feature maps of unified dimension are arranged in order from deep layers to shallow layers. The deepest layer m-th layer feature map is upsampled by nearest neighbor to obtain the m-th layer upsampled feature map. The size of the m-th layer upsampled feature map is the same as that of the m-1-th layer feature map. The m-th layer upsampled feature map and the m-1-th layer feature map are added point by point to obtain a new feature map in the upsampling process. The new feature map in the upsampling process is retained as the new feature map of the m-1-th layer. Then, the new feature map of the m-1-th layer in the upsampling process is repeatedly upsampled, layer by layer, until all the feature maps of unified dimension are processed.

A3：将A2步骤中得到的全部上采样过程中新的特征图和第m层级特征图均通过3×3卷积降低混叠效应，输出第一金字塔阶段特征图。A3: All new feature maps obtained in the upsampling process in step A2 and the feature map of the mth level are subjected to 3×3 convolution to reduce the aliasing effect, and the feature map of the first pyramid stage is output.

进一步的改进，所述金字塔第二阶段处理包括如下步骤：As a further improvement, the second stage of pyramid processing includes the following steps:

B1：将第一金字塔阶段特征图从浅层级向深层级的顺序，将最浅层级第一层级特征图进行下采样处理得到第一层级下采样特征图，第一层级下采样特征图大小与相邻的下一深层级第二层级特征图大小一样，第一层级下采样特征图与第二层级特征图逐点相加得到下采样过程中新的特征图，下采样过程中新的特征图保留作为第二层级的新特征图；B1: The feature maps of the first pyramid stage are arranged from shallow to deep levels, and the feature map of the shallowest level 1 is downsampled to obtain the downsampled feature map of the first level. The size of the downsampled feature map of the first level is the same as the size of the feature map of the second level of the adjacent next deep level. The downsampled feature map of the first level is added point by point to the feature map of the second level to obtain a new feature map in the downsampling process. The new feature map in the downsampling process is retained as the new feature map of the second level;

B2：将第二层级的新特征图继续进行下采样处理，向深层逐层操作直至将第一金字塔阶段特征图全部处理完；B2: Continue to downsample the new feature map of the second level, and operate layer by layer until all the feature maps of the first pyramid stage are processed;

B3：将步骤B2中全部下采样过程中新的特征图和第一层级特征图通过3×3卷积，输出第二金字塔阶段特征图。B3: Perform 3×3 convolution on the new feature map from all downsampling processes in step B2 and the first-level feature map to output the feature map of the second pyramid stage.

进一步的改进，所述ε特征融合操作为：As a further improvement, the ε feature fusion operation is:

ε←upsample(f(F_a))ε←upsample(f(F _a ))

其中，upsample()为最近邻上采样算法，f()表示1x1卷积，F_a表示第一子网络的输出特征图。Among them, upsample() is the nearest neighbor upsampling algorithm, f() represents 1x1 convolution, and _Fa represents the output feature map of the first sub-network.

进一步的改进，所述第一子网络和第二子网络均为ImageNet数据集上预训练的Res2Net101网络。As a further improvement, both the first sub-network and the second sub-network are Res2Net101 networks pre-trained on the ImageNet dataset.

进一步的改进，所述第二特征融合模块为AM注意力模块。As a further improvement, the second feature fusion module is an AM attention module.

进一步的改进，所述权重参数weight为：Further improvement, the weight parameter weight is:

weight＝sigmoid(σ(F))weight = sigmoid(σ(F))

其中，σ()为1x1卷积、sigmoid()表示激活函数，F表示感受野增强模块输出的特征图。Among them, σ() is a 1x1 convolution, sigmoid() represents the activation function, and F represents the feature map output by the receptive field enhancement module.

进一步的改进，所述N为5。In a further improvement, the N is 5.

进一步的改进，所述特征提取包括ResNet特征提取和Res2Net特征提取。As a further improvement, the feature extraction includes ResNet feature extraction and Res2Net feature extraction.

本发明的优点：Advantages of the present invention:

1.通过融合上下文特征信息，将第一金字塔阶段特征图利用感受野增强处理并将增强后的特征与第二金字塔阶段特征图进行融合，能够提取复杂的环境中较为鲁棒的玻璃器皿等透明目标的深度特征，有利于提升对玻璃器皿等透明目标检测算法的效果。1. By fusing contextual feature information, the feature map of the first pyramid stage is enhanced using the receptive field and the enhanced features are fused with the feature map of the second pyramid stage. This can extract the deep features of transparent targets such as glassware in complex environments that are more robust, which is beneficial to improving the effect of the detection algorithm for transparent targets such as glassware.

2.通过感受野增强模块增强初步特征的感受野，使所提取的特征能够更好的表达透明目标的全局特征而非局部特征，并且通过注意力特征融合模块与第二阶段特征金字塔输出特征进行融合，以降低背景等噪声的干扰，有利于提升对玻璃器皿等透明目标检测算法的效果。2. The receptive field of the preliminary features is enhanced through the receptive field enhancement module, so that the extracted features can better express the global features of transparent targets rather than local features, and are fused with the output features of the second-stage feature pyramid through the attention feature fusion module to reduce the interference of background noise, which is conducive to improving the effect of the detection algorithm for transparent targets such as glassware.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为复合主干网络示意图；Figure 1 is a schematic diagram of a composite backbone network;

图2为复合主干网络中第一子网络特征融合过程示意图；FIG2 is a schematic diagram of the feature fusion process of the first sub-network in the composite backbone network;

图3为基于浅层特征增强以及感受野增强的特征金字塔示意图；FIG3 is a schematic diagram of a feature pyramid based on shallow feature enhancement and receptive field enhancement;

图4为感受野增强模块示意图；FIG4 is a schematic diagram of a receptive field enhancement module;

图5为基于注意力机制的特征融合模块示意图；FIG5 is a schematic diagram of a feature fusion module based on an attention mechanism;

图6为特征提取前与特征提取后输入输出对比示意图。FIG6 is a schematic diagram showing the comparison of input and output before and after feature extraction.

其中：in:

第一子网络第一层级1、第一子网络第二层级2、第一子网络第三层级3、第一子网络第四层级4、第一子网络第五层级5、第二子网络第一层级6、第二子网络第二层级7、第二子网络第三层级8、第二子网络第四层级9、第二子网络第五层级10、第一特征融合模块a11、第二特征融合模块b12、第三特征融合模块c13、第四特征融合模块d14、输入的图像15、第一子网络第三层级输出特征图A、第一子网络第四层级输出特征图B、第一子网络第五层级输出特征图C、第一子网络第三层级经过1x1卷积和最近邻上采样特征图A1、第一子网络第四层级经过1x1卷积和最近邻上采样特征图B1、第一子网络第五层级经过1x1卷积和最近邻上采样特征图C1、辅助特征图H、感受野增强模块输出的特征图16、第二阶段特征金字塔输出的特征图17、网络学习到对感受野增强模块特征图的权重矩阵18、加权后的感受野增强模块输出特征图19、加权后的第二阶段特征金字塔输出的特征图20、第二特征融合模块的输出特征21。The first layer of the first subnetwork 1, the second layer of the first subnetwork 2, the third layer of the first subnetwork 3, the fourth layer of the first subnetwork 4, the fifth layer of the first subnetwork 5, the first layer of the second subnetwork 6, the second layer of the second subnetwork 7, the third layer of the second subnetwork 8, the fourth layer of the second subnetwork 9, the fifth layer of the second subnetwork 10, the first feature fusion module a11, the second feature fusion module b12, the third feature fusion module c13, the fourth feature fusion module d14, the input image 15, the output feature map A of the third layer of the first subnetwork, the output feature map B of the fourth layer of the first subnetwork, the output feature map of the fifth layer of the first subnetwork Figure C, feature map A1 of the third layer of the first subnetwork after 1x1 convolution and nearest neighbor upsampling, feature map B1 of the fourth layer of the first subnetwork after 1x1 convolution and nearest neighbor upsampling, feature map C1 of the fifth layer of the first subnetwork after 1x1 convolution and nearest neighbor upsampling, auxiliary feature map H, feature map 16 output by the receptive field enhancement module, feature map 17 output by the second stage feature pyramid, weight matrix 18 learned by the network for the feature map of the receptive field enhancement module, feature map 19 output by the weighted receptive field enhancement module, feature map 20 output by the weighted second stage feature pyramid, output feature 21 of the second feature fusion module.

具体实施方式Detailed ways

以下结合附图及实施例对本发明做进一步说明。The present invention is further described below in conjunction with the accompanying drawings and embodiments.

实施例1Example 1

如图1-图4所示的一种复杂环境下透明物检测深度特征提取方法示意图，具体的实施方式如下As shown in Figures 1 to 4, a schematic diagram of a method for extracting depth features of transparent object detection in a complex environment, the specific implementation method is as follows

第一：复合主干网络First: Composite backbone network

选用两个ImageNet数据集上预训练的Res2Net101网络作为复合主干网络的子网络。一个作为第一子网络assistant backbone，另外一个作为第二子网络lead backbone。两者均去除了最后的全连接层，第二子网络lead backbone去除了预训练权重stem部分，其中包含一个卷积层以及一个最大池化层，而在训练过程中，对于第一子网络assistantbackbone与第二子网络lead backbone均冻结前两个层级，被冻结部分参数不参与训练过程。Two Res2Net101 networks pre-trained on the ImageNet dataset were selected as sub-networks of the composite backbone network. One was used as the assistant backbone of the first sub-network, and the other was used as the lead backbone of the second sub-network. The last fully connected layer was removed from both, and the pre-trained weight stem part was removed from the lead backbone of the second sub-network, which included a convolutional layer and a maximum pooling layer. During the training process, the first two layers of the assistant backbone of the first sub-network and the lead backbone of the second sub-network were frozen, and the parameters of the frozen parts did not participate in the training process.

对于第一子网络assistant backbonee的每一层级输出特征图都传输至第二子网络lead backbone中。每一个比第一子网络assistant backbone特征层级低的特征图中，输出特征图首先通过1x1卷积以统一通道数之后经过最邻近上采样以统一大小，然后将所有特征图进行逐点相加得到第一子网络assistant backbone对第二子网络lead backbone输入的辅助特征图，最终辅助特征图与第二子网络lead backbone当前层级的特征图再次逐点相加完成一个层级的特征融合。对于其他层级特征，处理方式完全相同，而仅仅在融合的辅助特征图的子特征图数量上有所区别。Each level output feature map of the first sub-network assistant backbone is transmitted to the second sub-network lead backbone. In each feature map with a lower level than the first sub-network assistant backbone feature map, the output feature map first passes through 1x1 convolution to unify the number of channels and then passes through nearest neighbor upsampling to unify the size. Then all feature maps are added point by point to obtain the auxiliary feature map of the first sub-network assistant backbone input to the second sub-network lead backbone. Finally, the auxiliary feature map and the feature map of the current level of the second sub-network lead backbone are added point by point again to complete the feature fusion of one level. For features of other levels, the processing method is exactly the same, and the only difference is the number of sub-feature maps of the fused auxiliary feature map.

复合主干网络中第二子网络中的第m层级的初步特征有：Preliminary characteristics of the mth level in the second sub-network of the composite backbone network have:

其中，表示第二子网络中的第m-1层级的输出的初步特征、n表示复合主干网络中每个子网络的层级数、m表示当前初步特征所处层级、/>表示输入到第二子网络的第一子网络第i层级的输出特征图、ε表示特征融合操作。对于特征融合操作，有：in, represents the preliminary features of the output of the m-1th level in the second sub-network, n represents the number of levels of each sub-network in the composite backbone network, m represents the level of the current preliminary features, /> represents the output feature map of the first subnetwork at the i-th level input to the second subnetwork, and ε represents the feature fusion operation. For the feature fusion operation, there are:

ε←upsample(f(F_a))ε←upsample(f(F _a ))

其中upsample()为最近邻上采样算法，f()表示1x1卷积。Where upsample() is the nearest neighbor upsampling algorithm, and f() represents 1x1 convolution.

第二：基于浅层特征增强以及感受野增强的特征金字塔Second: Feature pyramid based on shallow feature enhancement and receptive field enhancement

该模块包含三个部分，自底向上的第一阶段特征金字塔，自上向下的第二阶段特征金字塔以及感受野增强模块与基于注意力机制的特征融合模块，如附图3所示，其中RFM为感受野增强模块，其详细结构如附图4所示，AM为基于注意力的特征融合模块，其详细结构如附图5所示。上采样是指将小图变成大图，下采样是将大图变成小图。The module consists of three parts: the first-stage feature pyramid from bottom to top, the second-stage feature pyramid from top to bottom, and the receptive field enhancement module and the feature fusion module based on the attention mechanism, as shown in Figure 3, where RFM is the receptive field enhancement module, and its detailed structure is shown in Figure 4, and AM is the feature fusion module based on attention, and its detailed structure is shown in Figure 5. Upsampling refers to converting a small image into a large image, and downsampling refers to converting a large image into a small image.

在获取复合主干网络各个阶段的特征图后，各个特征图首先通过1x1卷积统一特征图的通道数并对高维语义特征图实现降维。在统一特征图维度之后，将最后一层特征图上采样至上一层级特征图大小，并将上采样后的特征图与上一层级特征图逐点相加得到新特征图，之后将得到的新特征图重复该操作，逐层向上操作直至处理完全部的特征图，最终将融合后的所有特征图分别经过3x3卷积以降低混叠效应。通过这一步操作即可将复合主干网络中的不同层级的特征图融合，得到的新特征图相比较原始特征图具有更为丰富的信息，该特征图为第一阶段特征金字塔结果。After obtaining the feature maps of each stage of the composite backbone network, each feature map first uses 1x1 convolution to unify the number of channels of the feature map and reduce the dimension of the high-dimensional semantic feature map. After unifying the dimension of the feature map, the last layer of feature map is upsampled to the size of the feature map of the previous layer, and the upsampled feature map is added point by point to the feature map of the previous layer to obtain a new feature map. The operation is then repeated for the new feature map, layer by layer until all feature maps are processed. Finally, all fused feature maps are subjected to 3x3 convolution to reduce the aliasing effect. This step can be used to fuse the feature maps of different levels in the composite backbone network. The new feature map obtained has richer information than the original feature map. This feature map is the result of the first stage feature pyramid.

将第一阶段特征金字塔的输出从最浅层特征开始，经过下采样并与下一层特征图逐点相加得到新的特征图，并重复上述处理过程直至处理完全部特征图，最终将融合后的所有特征图分别经过3x3卷积得到第二阶段特征金字塔的输出特征。The output of the first-stage feature pyramid starts from the shallowest feature, is downsampled and added point by point with the feature map of the next layer to obtain a new feature map, and the above process is repeated until all feature maps are processed. Finally, all the fused feature maps are subjected to 3x3 convolution to obtain the output features of the second-stage feature pyramid.

对于复合主干网络所提取各个层级特征，经过第一阶段特征金字塔中的1x1卷积后，各个特征图通道数保持一致。对每个特征图分别经过感受野增强模块，输入至对于经过感受野增强模块特征有：For the features of each level extracted by the composite backbone network, after the 1x1 convolution in the first stage feature pyramid, the number of channels of each feature map remains consistent. Each feature map is passed through the receptive field enhancement module and input to the features after the receptive field enhancement module:

其中，F_out2表示第一金字塔阶段感受野增强特征，ε_n表示第n种不同空洞率得空洞卷积，表示两特征图在特征图通道维度上的concat特征图组融合操作，F_in表示输入至感受野增强模块的第一金字塔阶段特征图，σ()表示1×1卷积，输入至σ卷积的特征图通道数为n*channels(F_in),输出特征图通道为F_in通道数。在获得感受野增强特征后，将该特征与第二阶段特征金字塔一起输入至特征融合模块中。第二特征融合模块的输出特征有：Among them, F _out2 represents the receptive field enhancement feature of the first pyramid stage, ε _n represents the nth type of dilated convolution with different dilation rates, It represents the concat feature map group fusion operation of two feature maps in the feature map channel dimension, _Fin represents the first pyramid stage feature map input to the receptive field enhancement module, σ() represents 1×1 convolution, the number of feature map channels input to σ convolution is n*channels( _Fin ), and the output feature map channel is the number of _Fin channels. After obtaining the receptive field enhancement feature, the feature is input into the feature fusion module together with the second stage feature pyramid. The output features of the second feature fusion module are:

F_out＝weight*F₁+(1-weight)*F₂ F _out = weight*F ₁ + (1-weight)*F ₂

其中weight为学习到的权重参数，F₁和F₂分别代表第二阶段特征金字塔输出特征以及感受野增强后的原始特征。对于权重参数weight，有：Where weight is the learned weight parameter, _F1 and _F2 represent the output features of the second-stage feature pyramid and the original features after the receptive field enhancement, respectively. For the weight parameter weight, we have:

weight＝sigmoid(σ(F))weight = sigmoid(σ(F))

其中σ()为1x1卷积，其输出通道数为1，即可将大小为cxhxw大小的特征图降维成hxw大小，最后通过sigmoid()激活函数将全部特征值归一化到[0,1]区间，最终得到hxw大小全部数值处于[0,1]区间的矩阵，矩阵上每一点的数值即为对应特征点在最终融合特征过程中所占的权重。Among them, σ() is a 1x1 convolution, and its output channel number is 1, which can reduce the dimension of the feature map of size cxhxw to hxw. Finally, all eigenvalues are normalized to the interval [0,1] through the sigmoid() activation function, and finally a matrix of size hxw with all values in the interval [0,1] is obtained. The value of each point on the matrix is the weight of the corresponding feature point in the final feature fusion process.

尽管本发明的实施方案已公开如上，但并不仅仅限于说明书和实施方案中所列运用，它完全可以被适用于各种适合本发明的领域，对于熟悉本领域的人员而言，可容易地实现另外的修改，因此在不背离权利要求及等同范围所限定的一般概念下，本发明并不限于特定的细节和这里所示。Although the embodiments of the present invention have been disclosed as above, they are not limited to the applications listed in the specification and the embodiments. They can be fully applied to various fields suitable for the present invention. For those familiar with the art, additional modifications can be easily implemented. Therefore, without departing from the general concept defined by the claims and the scope of equivalents, the present invention is not limited to the specific details and shown here.

Claims

1. A method for extracting depth features of transparent object detection in a complex environment, characterized by comprising the following steps:

Step 1: Extract initial features using a composite backbone network;

S1.1: The composite backbone network includes a first sub-network and a second sub-network;

S1.2: The first sub-network includes N layers connected in series, and the second sub-network includes N layers connected in series; the output features of the i-th layer of the first sub-network are input to the i+1-th layer of the first sub-network, and the i+1-th layer of the first sub-network extracts the output features of the i-th layer of the first sub-network and outputs the output feature map of the i+1-th layer of the first sub-network; the output feature maps of the i-th layer, the i+1-th layer...the N-th layer of the first sub-network are all convolved with 1x1 to unify the number of channels, and are unified by nearest neighbor upsampling. After the size is increased, it enters the i-1th first feature fusion module for feature fusion operation to obtain an output auxiliary feature map; the auxiliary feature map and the output of the i-1th layer of the second sub-network are added point by point as the input of the i-th layer of the second sub-network, and the i-th layer of the second sub-network performs feature extraction on the input of the i-th layer of the second sub-network to obtain the preliminary features of the i-th layer output of the second sub-network; the collection of the preliminary features of the outputs of the m layers after the second sub-network is the initial feature, and the initial feature includes m preliminary features, 0<m<n; i＞1;

Step 2: The receptive field enhancement feature pyramid module processes the initial features to obtain the output features of the second feature fusion module;

S2.1: Process the initial features into the first stage of the pyramid and output the first pyramid stage feature map;

S2.2: The feature map of the first pyramid stage is processed by receptive field enhancement to obtain the receptive field enhancement feature of the first pyramid stage:

Among them, F _out2 represents the receptive field enhancement feature of the first pyramid stage, ε _n represents the nth type of dilated convolution with different dilation rates, Represents the concat feature map group fusion operation of two feature maps in the feature map channel dimension, _Fin represents the first pyramid stage feature map input to the receptive field enhancement module, and σ() represents a 1×1 convolution;

S2.3: Process the first pyramid stage feature map into the second pyramid stage, and output the second pyramid stage feature map;

S2.4: Input the first pyramid stage receptive field enhancement feature and the second pyramid stage feature map together into the second feature fusion module to obtain the output feature of the second feature fusion module:

F _out = weight*F ₁ + (1-weight)*F ₂

Among them, weight represents the number of learned weights, _F1 represents the feature map of the second pyramid stage, and _F2 represents the receptive field enhancement feature of the first pyramid stage.

2. A method for extracting deep features for transparent object detection in a complex environment as claimed in claim 1, characterized in that the preliminary features in the second sub-network are:

in, is the preliminary feature of the lth level in the second sub-network,/> represents the preliminary features of the output of the l-1th level in the second sub-network, n represents the number of levels of each sub-network in the composite backbone network, l represents the level of the current preliminary features, /> represents the output feature map of the i-th level of the first subnetwork input to the second subnetwork, and ε represents the feature fusion operation.

3. The method for extracting depth features of transparent object detection in a complex environment according to claim 1, wherein the first stage of pyramid processing comprises the following steps:

A1: All preliminary features in the obtained initial features are subjected to 1x1 convolution respectively to unify the number of channels and dimensions to obtain feature maps of unified dimensions;

A2: The feature maps of unified dimension are arranged in order from deep layers to shallow layers. The deepest layer m-th layer feature map is upsampled by nearest neighbor to obtain the m-th layer upsampled feature map. The size of the m-th layer upsampled feature map is the same as that of the m-1-th layer feature map. The m-th layer upsampled feature map and the m-1-th layer feature map are added point by point to obtain a new feature map in the upsampling process. The new feature map in the upsampling process is retained as the new feature map of the m-1-th layer. Then, the new feature map of the m-1-th layer in the upsampling process is repeatedly upsampled, layer by layer, until all the feature maps of unified dimension are processed.

A3: All new feature maps obtained in the upsampling process in step A2 and the feature map of the mth level are subjected to 3×3 convolution to reduce the aliasing effect, and the feature map of the first pyramid stage is output.

4. The method for extracting depth features of transparent object detection in a complex environment according to claim 1, wherein the second stage of pyramid processing comprises the following steps:

B1: The feature maps of the first pyramid stage are arranged from shallow to deep levels, and the feature map of the shallowest level 1 is downsampled to obtain the downsampled feature map of the first level. The size of the downsampled feature map of the first level is the same as the size of the feature map of the second level of the adjacent next deep level. The downsampled feature map of the first level is added point by point to the feature map of the second level to obtain a new feature map in the downsampling process. The new feature map in the downsampling process is retained as the new feature map of the second level;

B2: Continue to downsample the new feature map of the second level in the downsampling process, and operate layer by layer until all the feature maps of the first pyramid stage are processed;

B3: Perform 3×3 convolution on the new feature map from all downsampling processes in step B2 and the first-level feature map to output the feature map of the second pyramid stage.

5. The method for extracting depth features of transparent object detection in a complex environment according to claim 1, wherein the feature fusion operation ε is:

ε←upsample(f(F _a ))

Among them, upsample() is the nearest neighbor upsampling algorithm, f() represents 1x1 convolution, and _Fa represents the output feature map of the first sub-network.

6. A deep feature extraction method for transparent object detection in a complex environment as described in claim 1, characterized in that the first sub-network and the second sub-network are both Res2Net101 networks pre-trained on the ImageNet dataset.

7. A deep feature extraction method for transparent object detection in a complex environment as described in claim 1, characterized in that the second feature fusion module is an AM attention module.

8. The method for extracting depth features of transparent object detection in a complex environment according to claim 1, wherein the weight parameter weight is:

weight = sigmoid(σ(F))

Among them, σ() is a 1x1 convolution, sigmoid() represents the activation function, and F represents the feature map output by the receptive field enhancement module.

9. A method for extracting depth features for transparent object detection in a complex environment as described in claim 1, characterized in that N is 5.

10. A method for extracting deep features for transparent object detection in a complex environment as described in claim 1, characterized in that the feature extraction method includes ResNet feature extraction and Res2Net feature extraction.