CN117975165B

CN117975165B - Transparent object grabbing method based on depth complement

Info

Publication number: CN117975165B
Application number: CN202410305145.6A
Authority: CN
Inventors: 陈俊洪; 黄熠恒; 刘文印; 西木
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2024-03-18
Filing date: 2024-03-18
Publication date: 2024-09-17
Anticipated expiration: 2044-03-18
Also published as: CN117975165A

Abstract

The invention provides a transparent object grabbing method based on depth complementation, which comprises the following steps: identifying transparent objects in the RGB picture, and obtaining a depth map with a mask; performing feature extraction on the RGB picture and the depth map with the mask by using a teacher network to obtain global features and local features of the transparent object; wherein, the teacher network is constructed based on a transducer network; based on the composite distillation loss function, learning global features and local features by using a student network, inputting the RGB picture and the depth map with the mask into the learned student network for association processing, and obtaining a predicted depth map; wherein, student network builds based on CNN network. According to the invention, a student network with high precision, high efficiency and high robustness is taught by using a teacher network, and a teacher network with slightly lower precision or a student network with slightly lower precision and higher efficiency is selected according to actual needs to be deployed in a real robot so as to realize high-precision grabbing.

Description

A transparent object grasping method based on depth completion

技术领域Technical Field

本发明属于透明物体抓取技术领域，尤其涉及基于一种基于深度补全的透明物体抓取方法。The present invention belongs to the technical field of transparent object grasping, and in particular relates to a transparent object grasping method based on depth completion.

背景技术Background Art

透明物体是我们日常中常见的物体，例如眼镜，玻璃杯，塑料瓶等等；除此之外，在化工行业，制造业，生物行业等也有不少透明物体的使用。然而由于透明物体的特殊性质，例如折射性和反射性，使得标准的深度相机难以捕捉其准确的深度信息，比如微软的Kinect深度相机，其可以将激光照射到物体表面上，再通过CMOS传感器捕捉点阵，从而计算出物体的形状，该方法对于非透明物体具有较好的识别效果。然而对于透明物体来说，一部分激光将会穿透透明物体并折射到背景，使得相机在捕捉时获取到背景的深度信息；一部分激光将会在物体表面进行反射，此时相机获取不到其对应深度信息；剩下部分激光打到物体表面，这部分激光将被相机捕捉并生成对应的深度。然而由于这些漂移点和丢失点导致机器人抓取框预测的准确性大大降低，从而无法进行准确的抓取。Transparent objects are common objects in our daily life, such as glasses, glasses, plastic bottles, etc. In addition, there are many transparent objects in the chemical industry, manufacturing industry, biological industry, etc. However, due to the special properties of transparent objects, such as refraction and reflection, it is difficult for standard depth cameras to capture their accurate depth information. For example, Microsoft's Kinect depth camera can irradiate the laser onto the surface of the object, and then capture the dot matrix through the CMOS sensor to calculate the shape of the object. This method has a good recognition effect for non-transparent objects. However, for transparent objects, part of the laser will penetrate the transparent object and refract to the background, so that the camera can obtain the depth information of the background when capturing; part of the laser will be reflected on the surface of the object, and the camera cannot obtain its corresponding depth information at this time; the remaining part of the laser hits the surface of the object, and this part of the laser will be captured by the camera and generate the corresponding depth. However, due to these drift points and lost points, the accuracy of the robot's grasping frame prediction is greatly reduced, making it impossible to accurately grasp.

为了捕捉透明物体的深度信息，现有的方法主要分为以下四类：In order to capture the depth information of transparent objects, existing methods are mainly divided into the following four categories:

基于多视角标注和关键点估计的方法。通过将深度相机和双目相机夹持在机器臂上，并按照给定的轨迹进行不断拍摄，从而获得多视角数据集，再利用机器学习等算法提取物体关键点进行深度补全。A method based on multi-view annotation and key point estimation. By clamping the depth camera and binocular camera on the robot arm and continuously shooting along a given trajectory, a multi-view dataset is obtained, and then machine learning and other algorithms are used to extract the key points of the object for depth completion.

基于目标分割和二维边缘分析的方法。通过透明物体二维平面信息提取透明物体的大致轮廓，并不能进一步感知透明物体的深度信息。The method based on object segmentation and two-dimensional edge analysis extracts the rough outline of the transparent object through the two-dimensional plane information of the transparent object, but cannot further perceive the depth information of the transparent object.

基于已知透明物体模型的方法。通过已知物体的三维模型来估计透明物体的几何形状，这种方法需要提前知道透明物体的三维模型，使用条件受到约束。Methods based on known transparent object models: The geometric shape of a transparent object is estimated by using a known 3D model of the object. This method requires knowing the 3D model of the transparent object in advance, and its use conditions are restricted.

基于深度学习的方法。通过训练网络从二维图像和深度图中学习映射信息，并迁移到透明物体上进行深度补全。A deep learning-based method that learns mapping information from 2D images and depth maps by training the network, and then transfers it to transparent objects for depth completion.

目前的实现方案：Current implementation:

Fang等人提出TransCG方法，将RGB图片和深度图进行拼接，然后使用多个卷积层(卷积-稠密卷积-卷积-下采样)进行下采样提取特征，然后再利用多个上采样层(卷积-稠密卷积-卷积-上采样)进行图像的复原，但是该方法统一处理两个不同类型的特征，无法提取出映射关系；此外多个稠密卷积的应用增大了计算开销。Fang et al. proposed the TransCG method, which concatenates the RGB image and the depth map, then uses multiple convolutional layers (convolution-dense convolution-convolution-downsampling) to downsample and extract features, and then uses multiple upsampling layers (convolution-dense convolution-convolution-upsampling) to restore the image. However, this method uniformly processes two different types of features and cannot extract the mapping relationship; in addition, the application of multiple dense convolutions increases the computational overhead.

Shao等人提出URCDC方法，通过将RGB图片分别输入到一个Transformer网络和CNN网络进行户外场景的深度图重建，再利用蒸馏学习让学生预测的深度图接近教师预测的深度图，从而输出学生的结果。虽然该方法使用蒸馏学习提高预测的效率，但是该方法的输入为RGB图，网络在提取特征时无法提取出两个不同特征的映射关系，削弱了学习效果；此外，该方法依赖户外场景的颜色纹理信息，对于缺少纹理的透明物体场景效果不佳。Shao et al. proposed the URCDC method, which reconstructs the depth map of outdoor scenes by inputting RGB images into a Transformer network and a CNN network respectively, and then uses distillation learning to make the depth map predicted by the student close to the depth map predicted by the teacher, thereby outputting the student's results. Although this method uses distillation learning to improve the efficiency of prediction, the input of this method is an RGB image, and the network cannot extract the mapping relationship between two different features when extracting features, which weakens the learning effect; in addition, this method relies on the color texture information of outdoor scenes, and is not effective for transparent object scenes that lack texture.

综上，现有技术的缺点：In summary, the shortcomings of the prior art are:

需要依赖人工进行数据的采集，人力成本和时间成本高；It needs to rely on manual data collection, which has high labor and time costs;

使用多个传感器采集的方法导致了硬件成本提高；The method of using multiple sensors to collect data leads to higher hardware costs;

现有深度相机硬件无法采集到准确的透明物体深度信息，采集结果中含有丢失点和漂移点，采集结果不理想；Existing depth camera hardware cannot collect accurate depth information of transparent objects. The collected results contain lost points and drift points, and the collection results are not ideal.

使用单个基于Transformer架构和稠密卷积架构的神经网络模型计算代价太高，无法部署在真实场景中。Using a single neural network model based on Transformer architecture and dense convolutional architecture is too computationally expensive to be deployed in real scenarios.

因此，提出一个能准确预测透明物体的机器人抓取框、能适应各种复杂环境和背景下的抓取框检测方案是有重大意义的。Therefore, it is of great significance to propose a robot grasping frame that can accurately predict the transparent object and can adapt to various complex environments and backgrounds.

发明内容Summary of the invention

本发明的目的在于提供一种基于深度补全的透明物体抓取方法，利用教师网络教授出一个精度高，效率高，鲁棒性高的学生网络，并根据现实需要选择高精度效率略低的教师网络或者精度略差效率更高的学生网络部署于真实机器人中实现高精度抓取。The purpose of the present invention is to provide a transparent object grasping method based on depth completion, which uses a teacher network to teach a student network with high precision, high efficiency and high robustness, and selects a teacher network with high precision and slightly lower efficiency or a student network with slightly lower precision and higher efficiency according to actual needs and deploys them in a real robot to achieve high-precision grasping.

为实现上述目的，本发明提供了一种基于深度补全的透明物体抓取方法，包括：To achieve the above object, the present invention provides a transparent object grasping method based on depth completion, comprising:

对RGB图片中的透明物体进行识别，获取带掩码的深度图；Identify transparent objects in RGB images and obtain masked depth maps;

利用教师网络对所述RGB图片和带掩码的所述深度图进行特征提取，获取透明物体的全局特征和局部特征；其中，所述教师网络基于Transformer网络构建；Using a teacher network to extract features from the RGB image and the masked depth map to obtain global features and local features of the transparent object; wherein the teacher network is constructed based on a Transformer network;

基于复合蒸馏损失函数，利用学生网络学习所述全局特征和局部特征，将所述RGB图片和带掩码的所述深度图输入至学习后的所述学生网络进行关联处理，获取预测的深度图；其中，所述学生网络基于CNN网络构建；Based on the composite distillation loss function, the global features and local features are learned using a student network, and the RGB image and the masked depth map are input into the learned student network for association processing to obtain a predicted depth map; wherein the student network is constructed based on a CNN network;

将所述预测的深度图输入GR-CNN网络，获取所述透明物体的抓取框，基于所述抓取框完成所述透明物体的抓取；Inputting the predicted depth map into the GR-CNN network, obtaining a grabbing frame of the transparent object, and completing the grabbing of the transparent object based on the grabbing frame;

对RGB图片中的透明物体进行识别，获取带掩码的深度图包括：Identify transparent objects in RGB images and obtain a masked depth map including:

使用Deeplabv3+算法识别所述RGB图片中的透明物体，提取出透明物体掩膜；Using the Deeplabv3+ algorithm to identify transparent objects in the RGB image and extract transparent object masks;

基于所述透明物体掩膜，通过位置信息映射去除深度图中对应的透明物体区域，获取带掩码的所述深度图；Based on the transparent object mask, remove the corresponding transparent object area in the depth map by mapping the position information to obtain the masked depth map;

将所述RGB图片和带掩码的所述深度图输入至学习后的所述学生网络进行关联处理包括：Inputting the RGB image and the masked depth map into the learned student network for association processing includes:

将所述RGB图片和带掩码的所述深度图分别输入至卷积层中进行特征的初步抽取，获取初步抽取后的RGB图片特征和深度图特征；Input the RGB image and the masked depth map into the convolution layer respectively for preliminary feature extraction, and obtain the RGB image features and depth map features after preliminary extraction;

对初步抽取后的深度图特征使用下采样层进行降维处理；每个降维处理后的深度图特征再进入到一个下采样层进行降维处理；The depth map features after preliminary extraction are processed by downsampling layer for dimensionality reduction; each depth map feature after dimensionality reduction is then sent to a downsampling layer for dimensionality reduction;

初步抽取后的RGB图片特征和深度图特征输入第一个一致特征关联模块，进行特征关联处理；从第二个特征关联模块开始，上一个特征关联模块处理后的输出与上一个通过下采样层进行降维处理后的深度图特征作为下一个所述一致特征关联模块的输入，依次类推，连续输入三个所述一致特征关联模块进行进一步特征关联；其中，每次一致特征关联模块输出的关联特征大小为上一个一致特征关联模块输出大小的一半，但维度增加为原来的两倍；而深度图特征通过下采样层后大小同样减少为上一层的一半，而维度增加为原来的两倍；经过四个一致特征关联模块进行特征关联处理，最终获取编码后的第二集成特征；The RGB image features and depth map features after preliminary extraction are input into the first consistent feature association module for feature association processing; starting from the second feature association module, the output processed by the previous feature association module and the depth map features after the previous dimensionality reduction processing through the downsampling layer are used as the input of the next consistent feature association module, and so on, and are continuously input into three consistent feature association modules for further feature association; wherein, the size of the associated features output by each consistent feature association module is half of the output size of the previous consistent feature association module, but the dimension is increased to twice the original; and the size of the depth map features is also reduced to half of the previous layer after passing through the downsampling layer, and the dimension is increased to twice the original; after the feature association processing is performed by four consistent feature association modules, the encoded second integrated features are finally obtained;

利用第二解码器，对所述第二集成特征进行解码，获取预测的深度图；Using a second decoder, decoding the second integrated features to obtain a predicted depth map;

将初步抽取后的RGB图片特征和深度图特征输入第一个一致特征关联模块，进行特征关联处理包括：The initially extracted RGB image features and depth map features are input into the first consistent feature association module, and feature association processing includes:

将初步抽取后的RGB图片特征和深度图特征分别输入到卷积层中进行通道数压缩；The initially extracted RGB image features and depth map features are respectively input into the convolutional layer for channel compression;

将压缩后的深度图特征与对应的未压缩RGB图片特征进行通道数上拼接后，再进行一次卷积操作，获取深度图特征的相似性权重；After concatenating the compressed depth map features with the corresponding uncompressed RGB image features in terms of the number of channels, a convolution operation is performed to obtain the similarity weight of the depth map features;

将所述深度图特征的相似性权重与未压缩的深度图特征进行按位相乘，获取可靠深度图位置特征；Multiplying the similarity weight of the depth map feature by the uncompressed depth map feature bit by bit to obtain a reliable depth map position feature;

将压缩后的RGB图片特征与对应的未压缩深度图特征进行通道数上拼接后，再进行一次卷积操作，获取RGB图片特征的相似性权重；After concatenating the compressed RGB image features with the corresponding uncompressed depth image features in terms of the number of channels, a convolution operation is performed to obtain the similarity weights of the RGB image features;

将RGB图片特征的相似性权重与未压缩的RGB图片特征进行按位相乘，获取可靠RGB图片位置特征；Multiply the similarity weight of the RGB image feature by the uncompressed RGB image feature bit by bit to obtain a reliable RGB image position feature;

通过按位相加的方式对可靠深度图位置特征和可靠RGB图片位置特征进行求和，获取最终的可靠位置特征图；The reliable depth map position features and the reliable RGB image position features are summed by bitwise addition to obtain a final reliable position feature map;

对所述最终的可靠位置特征图，使用线性整流函数ReLU，批归一化BatchNormalization和卷积进行特征关联，获取关联特征。For the final reliable position feature map, linear rectification function ReLU, batch normalization BatchNormalization and convolution are used to perform feature association to obtain associated features.

可选地，利用教师网络对所述RGB图片和带掩码的所述深度图进行特征提取包括：Optionally, using a teacher network to extract features from the RGB image and the masked depth map includes:

分别对所述RGB图片和带掩码的所述深度图进行块嵌入处理；Performing block embedding processing on the RGB picture and the masked depth map respectively;

对块嵌入处理后的深度图依次通过3个下采样层进行降维处理，每个下采样层进行降维处理后的深度图均作为一个位置关联模块的输入；The depth map after block embedding is processed by three downsampling layers in turn for dimensionality reduction. The depth map after dimensionality reduction in each downsampling layer is used as the input of a position association module.

将块嵌入处理后的RGB图片和块嵌入处理后的深度图以及块嵌入处理后的深度图经过下采样层降维后的深度图，通过位置关联模块进行编码处理，获取编码后的第一集成特征；其中，所述位置关联模块包括4个，第一个位置关联模块的输入为块嵌入处理后的RGB图片和块嵌入处理后的深度图，后续的位置关联模块的输入为前一个位置关联模块的输出以及块嵌入处理后的深度图经过对应的下采样层降维后的深度图；The RGB image after block embedding processing, the depth map after block embedding processing, and the depth map after the depth map after block embedding processing are encoded by a position association module to obtain the encoded first integrated feature; wherein the position association module includes 4, the input of the first position association module is the RGB image after block embedding processing and the depth map after block embedding processing, and the input of the subsequent position association module is the output of the previous position association module and the depth map after the depth map after block embedding processing is reduced by the corresponding downsampling layer;

利用第一解码器，对编码后的第一集成特征进行特征解码，获取所述透明物体的全局特征和局部特征。The first decoder is used to perform feature decoding on the encoded first integrated feature to obtain the global feature and the local feature of the transparent object.

可选地，将所述块嵌入处理后的RGB图片和块嵌入处理后的深度图，通过第一个位置关联模块进行编码处理包括：Optionally, encoding the block-embedded RGB picture and the block-embedded depth map through a first position association module includes:

使用层归一化对所述RGB图片和深度图进行初始化，获得RGB图片特征和深度图特征，将RGB图片特征作为检索请求Q，逐步向RGB图片特征和深度图特征进行检索得到键值元素；Initialize the RGB image and the depth map using layer normalization to obtain RGB image features and depth map features, use the RGB image features as a search request Q, and gradually search the RGB image features and the depth map features to obtain key value elements;

将所述键值元素输入至多头注意力机制模块，获取检索结果的置信度；Input the key-value elements into the multi-head attention mechanism module to obtain the confidence of the retrieval results;

基于层归一化和多层感知机，获取所述置信度中所述RGB图片特征和深度图特征之间的关联特征；Based on layer normalization and a multi-layer perceptron, obtaining correlation features between the RGB image features and the depth map features in the confidence;

对所述关联特征和深度图特征进行偏移窗口操作，获取新的检索结果置信度；Performing an offset window operation on the associated features and the depth map features to obtain a new retrieval result confidence;

基于所述新的检索结果置信度，获取编码后的所述集成特征；Based on the new retrieval result confidence, obtaining the encoded integrated features;

所述RGB图片和深度图之间的所述关联特征为：The correlation feature between the RGB picture and the depth map is:

F_corr＝MLP(LN(F_qkv))+F_qkv F _corr =MLP(LN(F _qkv ))+F _qkv

其中，F_corr是关联特征，F_qkv是多头注意力计算的结果，LN是层归一化，MLP是多层感知机。Among them, F _corr is the correlation feature, F _qkv is the result of multi-head attention calculation, LN is layer normalization, and MLP is a multi-layer perceptron.

可选地，对所述关联特征和深度图特征进行偏移窗口操作包括：Optionally, performing an offset window operation on the associated features and the depth map features includes:

将所述关联特征和深度图特征中的每个像素点往右下角移动两个像素点，然后重新对关联特征和深度图特征重新进行分块；Move each pixel in the associated features and the depth map features to the lower right corner by two pixels, and then re-block the associated features and the depth map features;

将分块后的特征作为新的检索请求Q，逐步向分块后的关联特征和深度图特征所对应的索引元素K和内容元素V进行检索得到新的键值元素；The block features are used as new search requests Q, and the index elements K and content elements V corresponding to the block-related features and depth map features are gradually searched to obtain new key-value elements;

将新的键值元素输入给多头注意力机制模块中计算出新的检索结果置信度。The new key-value elements are input into the multi-head attention mechanism module to calculate the new retrieval result confidence.

可选地，所述复合蒸馏损失函数为：Optionally, the composite distillation loss function is:

其中，为复合蒸馏损失函数为，为距离损失，为结构损失，为真实深度图与学生预测深度图的差值，为教师和学生之间预测的深度图之间的差值，λ₁、λ₂、λ₃分别为对应损失的常数权重。in, The loss function for the composite distillation is, is the distance loss, is the structural loss, is the difference between the real depth map and the student predicted depth map, is the difference between the predicted depth maps of the teacher and the student, and λ ₁ , λ ₂ , and λ ₃ are the constant weights of the corresponding losses.

本发明具有以下有益效果：The present invention has the following beneficial effects:

本发明利用教师网络对所述RGB图片和带掩码的所述深度图进行特征提取，获取透明物体的全局特征和局部特征；其中，所述教师网络基于Transformer网络构建；本发明保留了使用transformer架构提取全局和局部特征的优点，训练出一个目前领先其他所有透明物体深度补全的教师网络，并通过蒸馏学习的方式，教授出一个精度仅次于教师网络，但是速度提高两倍的学生网络。此外通过在真实环境下进行部署，有效地验证了学生网络的鲁棒性，实现高精度抓取。The present invention uses a teacher network to extract features from the RGB image and the masked depth map to obtain global and local features of transparent objects; wherein the teacher network is constructed based on the Transformer network; the present invention retains the advantages of using the transformer architecture to extract global and local features, trains a teacher network that currently leads all other transparent object depth completions, and teaches a student network with accuracy second only to the teacher network, but twice the speed, through distillation learning. In addition, by deploying in a real environment, the robustness of the student network is effectively verified, achieving high-precision capture.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本申请的一部分的附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings constituting a part of the present application are used to provide a further understanding of the present application. The illustrative embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1为本发明实施例的一种基于深度补全的透明物体抓取方法流程示意图；FIG1 is a schematic flow chart of a method for grasping a transparent object based on depth completion according to an embodiment of the present invention;

图2为本发明实施例的教师网络中的位置关联模块示意图；FIG2 is a schematic diagram of a location association module in a teacher network according to an embodiment of the present invention;

图3为本发明实施例的教师网络中的NeWCRFs解码模块示意图；FIG3 is a schematic diagram of a NeWCRFs decoding module in a teacher network according to an embodiment of the present invention;

图4为本发明实施例的学生网络中的一致特征关联模块示意图；FIG4 is a schematic diagram of a consistent feature association module in a student network according to an embodiment of the present invention;

图5为本发明实施例的复合蒸馏损失函数示意图；FIG5 is a schematic diagram of a composite distillation loss function according to an embodiment of the present invention;

图6为本发明实施例的8类常见透明物体示意图；FIG6 is a schematic diagram of eight types of common transparent objects according to an embodiment of the present invention;

图7为本发明实施例的机器人抓取透明物体实验示意图。FIG. 7 is a schematic diagram of an experiment of a robot grasping a transparent object according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the absence of conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and that, although a logical order is shown in the flowcharts, in some cases, the steps shown or described can be executed in an order different from that shown here.

如图1-图5所示本实施例提出了一种基于深度补全的透明物体抓取方法，通过蒸馏学习算法对已有的数据集，具体来说，在教师网络端，本实施例提出了基于Transformer架构的网络将RGB图片特征关联到深度图中，实现特征间的位置映射，得益于Transformer架构，本实施例可以提取到透明物体的全局和局部特征，并最终输出高精度的深度图用于教授学生网络；在学生网络端，本实施例为了保证效率，提出了基于CNN架构的神经网络，该网络使用了更高效的方式将RGB和深度图进行关联，并且从教师网络中学习到全局特征和局部特征，在稍微牺牲精度的情况下实现了两倍的教师网络计算速度，速度达到了48帧每秒，可实时进行计算。此外通过在机器人端的应用，有效地验证了本实施例方法的鲁棒性。As shown in Figures 1-5, this embodiment proposes a transparent object grasping method based on depth completion. Through the distillation learning algorithm for the existing data set, specifically, on the teacher network side, this embodiment proposes a network based on the Transformer architecture to associate RGB image features with the depth map to achieve position mapping between features. Thanks to the Transformer architecture, this embodiment can extract the global and local features of transparent objects, and finally output a high-precision depth map for teaching the student network; on the student network side, in order to ensure efficiency, this embodiment proposes a neural network based on the CNN architecture. The network uses a more efficient way to associate RGB and depth maps, and learns global and local features from the teacher network. It achieves twice the teacher network calculation speed at a slight sacrifice of accuracy, reaching a speed of 48 frames per second, and can be calculated in real time. In addition, through the application on the robot side, the robustness of the method of this embodiment is effectively verified.

本实施例所提出的一种基于深度补全的透明物体抓取方法，具体包括：A transparent object grasping method based on depth completion proposed in this embodiment specifically includes:

进一步地，获取带掩码的深度图包括：Furthermore, obtaining a masked depth map includes:

使用Deeplabv3算法识别RGB图片中的透明物体，提取出透明物体掩膜；Use the Deeplabv3 algorithm to identify transparent objects in RGB images and extract transparent object masks;

基于透明物体掩膜，通过位置信息映射去除深度图中对应的透明物体区域，获取带掩码的深度图。Based on the transparent object mask, the corresponding transparent object area in the depth map is removed through position information mapping to obtain a masked depth map.

具体地，在本实施例中，首先使用Deeplabv3算法识别RGB图片中的透明物体，并提取出透明物体掩膜Mask出来，然后通过位置信息映射挖掉深度图中相对应的透明物体区域，以防这些不确定点影响模型学习RGB和深度图之间的映射关系，此时得到带掩码的深度图Mask Depth。Specifically, in this embodiment, the Deeplabv3 algorithm is first used to identify transparent objects in the RGB image, and the transparent object mask Mask is extracted. Then, the corresponding transparent object area in the depth map is dug out through position information mapping to prevent these uncertain points from affecting the model learning the mapping relationship between RGB and depth map. At this time, a masked depth map Mask Depth is obtained.

利用教师网络对RGB图片和带掩码的深度图进行特征提取，获取透明物体的全局特征和局部特征；其中，教师网络基于Transformer网络构建；The teacher network is used to extract features from RGB images and masked depth images to obtain global and local features of transparent objects. The teacher network is built based on the Transformer network.

进一步地，利用教师网络对RGB图片和带掩码的深度图进行特征提取包括：Furthermore, the teacher network is used to extract features from the RGB image and the masked depth map, including:

将所述块嵌入处理后的RGB图片和块嵌入处理后的深度图，通过第一个位置关联模块进行编码处理包括：The block is embedded in the processed RGB picture and the block is embedded in the processed depth map, and encoding is performed by a first position association module, including:

具体地，在本实施例中，首先对RGB图和带掩码的深度图Mask Depth进行块嵌入Patch Embedding处理，将图像进行分块方便提取精细的局部特征。对于带掩码的深度图，本实施例采用普通的下采样层进行降维；而对于RGB图，本实施例提出了位置关联模块position correlation block(PCB)，具体来说，本实施例首先使用层归一化对RGB图和深度图进行初始化，然后将RGB特征作为检索请求Q，分别向RGB图和深度图检索相对应的键值对K和V，本实施例将这三个元素一起输入给多头注意力机制模块中计算出检索结果的置信度，该置信度包含了RGB图和深度图之间的关联特征，紧接着本实施例使用了一个层归一化和多层感知机计算出关联矩阵，公式表示如下：Specifically, in this embodiment, the RGB image and the masked depth image Mask Depth are first subjected to Patch Embedding processing, and the image is divided into blocks to facilitate the extraction of fine local features. For the masked depth image, this embodiment uses a common downsampling layer for dimensionality reduction; and for the RGB image, this embodiment proposes a position correlation block (PCB). Specifically, this embodiment first uses layer normalization to initialize the RGB image and the depth image, and then uses the RGB feature as the retrieval request Q to retrieve the corresponding key-value pairs K and V from the RGB image and the depth image respectively. This embodiment inputs these three elements together into the multi-head attention mechanism module to calculate the confidence of the retrieval result, which includes the correlation features between the RGB image and the depth image. Then, this embodiment uses a layer normalization and a multi-layer perceptron to calculate the correlation matrix, which is expressed as follows:

Q＝x_i,K＝x_i,V＝x_d Q＝ _xi ,K＝ _xi ,V＝ _xd

F_qkv＝SoftMax(Q·K^T+B)·V+IF _qkv =SoftMax(Q·K ^T +B)·V+I

F_corr＝MLP(LN(F_qkv))+E_qkv F _corr =MLP(LN(F _qkv ))+E _qkv

其中I是原始图像特征，LN是层归一化，MLP是多层感知机，x_i是RGB图像块特征，x_d是深度图块特征，F_qkv是多头注意力计算的结果，F_corr是最后的关联矩阵特征。Where I is the original image feature, LN is layer normalization, MLP is a multi-layer perceptron, _xi is the RGB image block feature, _xd is the depth block feature, _Fqkv is the result of multi-head attention calculation, and _Fcorr is the final correlation matrix feature.

但是由于该特征只考虑了局部块的关联结果，没有包括全局特征，因此在补全时产生的总体效果较差。对此本实施例引入了Swin-Transformer的偏移窗口操作ShiftedWindows，具体来说，通过将所有特征结果中的每个像素点往右下角移动两个像素点，然后重新对特征结果重新进行分块，此时由于偏移操作，现有块将含有原来相邻区域的像素特征，紧接着本实施例使用上述同样的做法，将偏移后特征作为检索请求Q，分别向特征图和偏移处理后的深度图检索相对应的键值对K和V，并将这三个元素一起输入给多头注意力机制模块中计算出新的检索结果置信度，此时本实施例的特征包含了局部和全局的关联信息。值得一提的是，每次位置关联模块处理之后，特征大小维度会变为原来的一半，而通道数增加为原来的一倍。为了提取多尺度特征，本实施例叠加了4个位置关联模块，并获得编码后的集成特征Integrated Feature。However, since this feature only considers the association results of local blocks and does not include global features, the overall effect produced during completion is poor. To this end, this embodiment introduces the shifted window operation ShiftedWindows of Swin-Transformer. Specifically, by moving each pixel in all feature results to the lower right corner by two pixels, and then re-blocking the feature results, due to the offset operation, the existing block will contain the pixel features of the original adjacent area. Then this embodiment uses the same approach as above, and uses the offset feature as the retrieval request Q to retrieve the corresponding key-value pairs K and V from the feature map and the depth map after offset processing, and input these three elements together into the multi-head attention mechanism module to calculate the new retrieval result confidence. At this time, the features of this embodiment contain local and global association information. It is worth mentioning that after each position association module processing, the feature size dimension will become half of the original, and the number of channels will increase to twice the original. In order to extract multi-scale features, this embodiment superimposes 4 position association modules and obtains the encoded integrated feature Integrated Feature.

对于解码器，本实施例使用了基于神经窗口全连接的条件随机场网络NeWCRFs(Neural Window Fully-connected CRFs)的解码器进行特征解码，值得一提的是在注意力机制计算时，该模块使用了与上文相同的RGB关联深度图的做法进行特征关联，即使用RGB图片特征作为检索请求Q，分别向RGB图片特征和深度图特征检索相对应的键值对K和V，只不过此时使用的是集成特征作为Q进行检索请求，并在集成特征Q和编码特征V中查找相当于的键值对K和V。最后在叠加了4层解码层后，本实施例获得了教师网络端的预测深度图。For the decoder, this embodiment uses a decoder based on the neural window fully connected conditional random field network NeWCRFs (Neural Window Fully-connected CRFs) for feature decoding. It is worth mentioning that when calculating the attention mechanism, the module uses the same RGB-associated depth map method as above for feature association, that is, using the RGB image feature as the retrieval request Q, and retrieving the corresponding key-value pairs K and V from the RGB image feature and the depth map feature respectively, but this time the integrated feature is used as Q for the retrieval request, and the equivalent key-value pairs K and V are searched in the integrated feature Q and the encoded feature V. Finally, after superimposing 4 layers of decoding layers, this embodiment obtains the predicted depth map of the teacher network end.

基于复合蒸馏损失函数，利用学生网络学习全局特征和局部特征，将RGB图片和带掩码的深度图输入至学习后的学生网络进行关联处理，获取预测的深度图；其中，学生网络基于CNN网络构建。Based on the composite distillation loss function, the student network is used to learn global features and local features, and the RGB image and the masked depth map are input into the learned student network for association processing to obtain the predicted depth map; among them, the student network is built based on the CNN network.

进一步地，将RGB图片和带掩码的深度图输入至学习后的学生网络进行关联处理包括：Furthermore, the RGB image and the masked depth map are input into the learned student network for association processing including:

具体地，在本实施例中，首先将RGB图和带掩码的深度图Mask Depth分别输入给卷积网络中提取初级特征，对于深度图，本实施例使用普通的下采样层进行降维；对于RGB图，本实施例提出了一致特征关联模块consistent feature correlation module(CFCM)，具体来说，本实施例首先将RGB图片和带掩码的深度图分别输入至卷积层中进行特征的初步抽取，获取初步抽取后的RGB图片特征和深度图特征；再对初步抽取后的深度图特征使用下采样层进行降维处理；每个降维处理后的深度图特征再进入到一个下采样层进行降维处理；然后将初步抽取后的RGB图片特征和深度图特征输入第一个一致特征关联模块，进行特征关联处理；从第二个特征关联模块开始，上一个特征关联模块处理后的输出与上一个通过下采样层进行降维处理后的深度图特征作为下一个一致特征关联模块的输入，依次类推，连续输入三个一致特征关联模块进行进一步特征关联；其中，每次一致特征关联模块输出的关联特征大小为上一个一致特征关联模块输出大小的一半，但维度增加为原来的两倍；而深度图特征通过下采样层后大小同样减少为上一层的一半，而维度增加为原来的两倍；经过四个一致特征关联模块进行特征关联处理，最终获取编码后的第二集成特征；最后利用第二解码器，对第二集成特征进行解码，获取预测的深度图。Specifically, in this embodiment, the RGB image and the masked depth image Mask Depth are first input into the convolutional network to extract primary features. For the depth image, this embodiment uses an ordinary downsampling layer for dimensionality reduction. For the RGB image, this embodiment proposes a consistent feature correlation module (CFCM). Specifically, this embodiment first inputs the RGB image and the masked depth image into the convolutional layer for preliminary feature extraction to obtain the RGB image features and depth image features after preliminary extraction. The downsampling layer is then used to perform dimensionality reduction processing on the depth image features after preliminary extraction. Each depth image feature after dimensionality reduction processing enters a downsampling layer for dimensionality reduction processing. Then the RGB image features and depth image features after preliminary extraction are input into the first consistent feature correlation module for feature correlation processing. Starting from the second feature correlation module, the output after processing by the previous feature correlation module is combined with the output of the previous feature correlation module through the downsampling layer. The depth map features after dimensionality reduction processing are used as the input of the next consistent feature association module, and so on, and are continuously input into three consistent feature association modules for further feature association; wherein, the size of the associated features output by each consistent feature association module is half of the output size of the previous consistent feature association module, but the dimension is increased to twice the original; and the size of the depth map features after passing through the downsampling layer is also reduced to half of the previous layer, while the dimension is increased to twice the original; after feature association processing is performed through four consistent feature association modules, the encoded second integrated features are finally obtained; finally, the second decoder is used to decode the second integrated features to obtain the predicted depth map.

具体地，将初步抽取后的RGB图片特征和深度图特征输入第一个一致特征关联模块，进行特征关联处理包括：Specifically, the initially extracted RGB image features and depth map features are input into the first consistent feature association module, and feature association processing includes:

将初步抽取后的RGB图片特征和深度图特征分别输入到卷积层中进行通道数压缩；将压缩后的深度图特征与对应的未压缩RGB图片特征进行通道数上拼接后，再进行一次卷积操作，获取深度图特征的相似性权重；将深度图特征的相似性权重与未压缩的深度图特征进行按位相乘，获取可靠深度图位置特征；将压缩后的RGB图片特征与对应的未压缩深度图特征进行通道数上拼接后，再进行一次卷积操作，获取RGB图片特征的相似性权重；该权重反映了两个不同特征之间的相似程度，相似度越高意味着关联程度越高，也意味着其映射可靠性越高；将RGB图片特征的相似性权重与未压缩的RGB图片特征进行按位相乘，获取可靠RGB图片位置特征；通过按位相加的方式对可靠深度图位置特征和可靠RGB图片位置特征进行求和，获取最终的可靠位置特征图；对最终的可靠位置特征图，使用线性整流函数ReLU，批归一化Batch Normalization和卷积进行特征关联，获取关联特征。通过堆叠四个一致特征关联模块，本实施例可以得到编码后的集成特征Integrated Feature。The RGB image features and depth map features after preliminary extraction are respectively input into the convolution layer for channel number compression; the compressed depth map features are spliced with the corresponding uncompressed RGB image features in terms of the number of channels, and then a convolution operation is performed to obtain the similarity weight of the depth map features; the similarity weight of the depth map features is bitwise multiplied with the uncompressed depth map features to obtain the reliable depth map position features; the compressed RGB image features are spliced with the corresponding uncompressed depth map features in terms of the number of channels, and then a convolution operation is performed to obtain the similarity weight of the RGB image features; the weight reflects the degree of similarity between two different features, and the higher the similarity means the higher the degree of association, which also means the higher the mapping reliability; the similarity weight of the RGB image features is bitwise multiplied with the uncompressed RGB image features to obtain the reliable RGB image position features; the reliable depth map position features and the reliable RGB image position features are summed by bitwise addition to obtain the final reliable position feature map; for the final reliable position feature map, the linear rectification function ReLU, batch normalization and convolution are used to perform feature association to obtain the associated features. By stacking four consistent feature association modules, this embodiment can obtain the encoded integrated feature.

对于解码器，本实施例使用常规的上采样模块进行深度复原，具体来说，该上采样模块包括一个卷积层和一个上采样层，通过将集成特征上采样后与编码特征进行拼接，然后输出给下一层进行处理。通过堆叠四组上采样模块，本实施例可以得到学生网络预测的深度图。For the decoder, this embodiment uses a conventional upsampling module for depth restoration. Specifically, the upsampling module includes a convolution layer and an upsampling layer. The integrated features are upsampled and concatenated with the encoded features, and then output to the next layer for processing. By stacking four groups of upsampling modules, this embodiment can obtain the depth map predicted by the student network.

为了防止学生网络只学习到教师网络的区域特征，本实施例设计了复合蒸馏损失函数来保证学习的效果，其中包括距离损失，结构损失和边缘损失，三个损失函数定义如下：In order to prevent the student network from learning only the regional features of the teacher network, this embodiment designs a composite distillation loss function to ensure the learning effect, which includes distance loss, structure loss and edge loss. The three loss functions are defined as follows:

距离损失：本实施例首先使用了尺度不变性Log损失函数作为预测点之间的误差计算，其公式如下：Distance loss: This embodiment first uses the scale-invariant Log loss function as the error calculation between prediction points, and its formula is as follows:

其中△E_gt表示学生网络和教师网络之间的对数损失，△E_p表示学生网络和教师网络之间的对数损失，公式如下：Where △E _gt represents the logarithmic loss between the student network and the teacher network, and △E _p represents the logarithmic loss between the student network and the teacher network. The formula is as follows:

ΔE_p＝logD_s-logD_t,ΔE_gt＝logD_s-logD_gt ΔE _p =logD _s -logD _t ,ΔE _gt =logD _s -logD _gt

α和β为缩放常数，λ为方差最小化因子。这三个常数分别取3,7和0.85。α and β are scaling constants, and λ is the variance minimization factor. These three constants are 3, 7, and 0.85 respectively.

结构损失：本实施例使用了结构相似性structural similarity index measure(SSIM)中的结构参数C来衡量结构损失，其公式如下：Structural loss: This embodiment uses the structural parameter C in the structural similarity index measure (SSIM) to measure the structural loss, and the formula is as follows:

其中x和y指代两组深度图，σ_x表示深度图的方差，σ_x,y表示协方差，θ为常数防止分母项为0。在结构损失里，本实施例考虑两组结构差异，一是真实深度图和教师预测的深度图之间的差异，另一组是教师预测的深度图和学生预测的深度图之间的差异，最后本实施例将这两组差异用均方误差Mean squared error(MSE)进行计算，公式如下：Where x and y refer to two sets of depth maps, σ _x represents the variance of the depth map, σ _x,y represents the covariance, and θ is a constant to prevent the denominator from being 0. In the structural loss, this embodiment considers two sets of structural differences, one is the difference between the real depth map and the depth map predicted by the teacher, and the other is the difference between the depth map predicted by the teacher and the depth map predicted by the student. Finally, this embodiment calculates the two sets of differences using the mean squared error (MSE), and the formula is as follows:

其中MSE为均方误差。Where MSE is the mean square error.

边缘损失：由于在物体边缘区域补全会产生较大的深度值变化，主要是深度值落在物体表面和背景之间的差异，因此本实施例设计了边缘损失，该损失主要分别计算横向和纵向物体边缘位置的差值。其中横向差值定义如下：Edge loss: Since the completion of the edge area of the object will produce a large change in depth value, mainly due to the difference between the depth value of the object surface and the background, this embodiment designs an edge loss, which mainly calculates the difference in the horizontal and vertical object edge positions. The definition is as follows:

其中表示深度图r中的坐标点(x,y)，M_x表示横向透明区域掩膜，⊙表示按位相乘。同样地，纵向差值定义如下：in represents the coordinate point (x, y) in the depth map r, M _x represents the horizontal transparent area mask, and ⊙ represents bitwise multiplication. Similarly, the vertical difference The definition is as follows:

其中M_y为纵向透明区域掩膜。本实施例在此分别计算两组差值，一是真实深度图与学生预测深度图的差值：Where _My is the vertical transparent area mask. In this embodiment, two sets of difference values are calculated respectively, one is the difference between the real depth map and the student predicted depth map:

另一组是老师和学生之间预测的深度图之间的差值：The other set is the difference between the predicted depth maps between the teacher and the student:

最后本实施例的蒸馏损失函数定义如下：Finally, the distillation loss function of this embodiment is defined as follows:

其中λ₁，λ₂，λ₃分别设置为0.1,0.3,和0.7。Where λ ₁ , λ ₂ , and λ ₃ are set to 0.1, 0.3, and 0.7, respectively.

通过蒸馏方式，本实施例可以实现教师网络的知识完整传递给学生网络，并教授出一个精度高，速度高，鲁棒性好的学生网络进行透明物体深度补全。Through the distillation method, this embodiment can achieve the complete transfer of knowledge from the teacher network to the student network, and teach a student network with high accuracy, high speed and good robustness to perform depth completion of transparent objects.

通过对网络的训练，教师网络和学生网络可分别输出完整的深度图，其中本实施例使用逐点绝对值误差MAE(mean absolute error)，逐点均方根误差RMSE(mean absoluteerror)，绝对值相对误差REL(absolute relative difference)和预测深度图与真实深度图的精度比δ对预测结果进行评估，精度比阈值采用常见的1.05，1.10和1.25三个比率进行衡量，即预测深度图和真实深度图的误差小于1.05，1.10和1.12即认为预测准确，预测分值越高表示预测效果越好。By training the network, the teacher network and the student network can output complete depth maps respectively. In this embodiment, the prediction results are evaluated using point-by-point absolute error MAE (mean absolute error), point-by-point root mean square error RMSE (mean absolute error), absolute relative error REL (absolute relative difference) and the accuracy ratio δ of the predicted depth map to the true depth map. The accuracy ratio threshold is measured using the common three ratios of 1.05, 1.10 and 1.25, that is, if the error between the predicted depth map and the true depth map is less than 1.05, 1.10 and 1.12, the prediction is considered to be accurate, and the higher the prediction score, the better the prediction effect.

教师网络在合成物体的补全精度比δ分别可以达到δ_1.05＝77.56，δ_1.10＝93.83和δ_1.25＝99.68，在RMSE指标中达到0.021，REL指标中达到0.035和MAE指标中达到0.018；在现实物体的补全精度比δ分别可以达到δ_1.05＝80.78，δ_1.10＝94.91和δ_1.25＝99.56，在RMSE指标中达到0.021，REL指标中达到0.032和MAE指标中达到0.018，该结果为目前所有透明物体深度补全方法中取得的最佳效果；而学生网络在合成物体的补全精度比δ分别为δ_1.05＝72.48，δ_1.10＝91.23和δ_1.25＝99.32，在RMSE指标中达到0.024，REL指标中达到0.040和MAE指标中达到0.020；在真实物体的补全精度比δ分别为δ_1.05＝71.33，δ_1.10＝87.38和δ_1.25＝98.91，在RMSE指标中达到0.032，REL指标中达到0.044和MAE指标中达到0.025，该效果为目前所有透明物体深度补全方法取得的次佳效果。数据结果如下表1所示：The completion accuracy ratio δ of the teacher network in synthetic objects can reach δ _1.05 ＝77.56, δ _1.10 ＝93.83 and δ _1.25 ＝99.68, reaching 0.021 in RMSE index, 0.035 in REL index and 0.018 in MAE index; the completion accuracy ratio δ of real objects can reach δ _1.05 ＝80.78, δ _1.10 ＝94.91 and δ _{1.25 ＝99.56, reaching 0.021 in RMSE index, 0.032 in REL index and 0.018 in MAE index, which is the best result achieved among all transparent object depth completion methods at present; while the completion accuracy ratio δ of the student network in synthetic objects is δ 1.05} _＝ 72.48, δ _1.10 ＝91.23 and δ _1.25 ＝99.68, reaching 0.021 in RMSE index, 0.032 in REL index and 0.018 in MAE index. =99.32, reaching 0.024 in RMSE, 0.040 in REL and 0.020 in MAE; the completion accuracy ratios δ of real objects are δ _1.05 =71.33, δ _1.10 =87.38 and δ _1.25 =98.91, reaching 0.032 in RMSE, 0.044 in REL and 0.025 in MAE, which is the second best result among all current transparent object depth completion methods. The data results are shown in Table 1 below:

表1Table 1

其中CG，DG，DFNet，FDCT为最新的透明物体补全方法。Among them, CG, DG, DFNet, and FDCT are the latest transparent object completion methods.

在计算效率方面，教师网络可达到24帧每秒，基本满足常见摄像头的需求；学生网络可以达到48帧每秒，满足工业中一些高速摄像头的需求。数据结果如下表2所示：In terms of computing efficiency, the teacher network can reach 24 frames per second, which basically meets the requirements of common cameras; the student network can reach 48 frames per second, which meets the requirements of some high-speed cameras in industry. The data results are shown in Table 2 below:

表2Table 2

在获得完整的深度图后，本实施例将深度图输入到GR-CNN网络中进行物体抓取点预测，该网络使用多个二维卷积和残差块进行特征提取，并预测输出物体的抓取框，本实施例将该抓取框输入给机器人，机器人根据逆运动学生成对应的轨迹并进行抓取。在给定的8类常见透明玻璃杯中，本实施例对每个物体执行了10次抓取，其中抓取起来超过10秒即判为成果，实验中教师网络可以达到83.75的抓取成功率，学生网络可以达到78.75的抓取成功率。机器人抓取透明物体实验如图6-图7所示After obtaining the complete depth map, this embodiment inputs the depth map into the GR-CNN network to predict the object grasping points. The network uses multiple two-dimensional convolutions and residual blocks to extract features and predict the grasping frame of the output object. This embodiment inputs the grasping frame to the robot, and the robot generates a corresponding trajectory based on the inverse kinematics and grasps it. Among the given 8 types of common transparent glass cups, this embodiment performs 10 grasps on each object, and any grasp that takes more than 10 seconds is considered a success. In the experiment, the teacher network can achieve a grasping success rate of 83.75%, and the student network can achieve a grasping success rate of 78.75%. The robot grasping transparent object experiment is shown in Figures 6 and 7

本实施例在透明物体抓取方面应用的蒸馏学习算法，这是目前第一个使用该做法的方法；This embodiment applies the distillation learning algorithm to transparent object grasping, which is the first method to use this approach.

教师网络方法，目前教师方法取得的精度超过目前已有的在透明物体深度补全的其他算法。Teacher network method,The current accuracy achieved by the teacher method exceeds other existing algorithms for,depth completion of transparent objects.

(1)本实施例保留了使用transformer架构提取全局和局部特征的优点，训练出一个目前领先其他所有透明物体深度补全的算法，并通过蒸馏学习的方式，教授出一个精度仅次于教师网络，但是速度提高两倍的学生网络。此外通过在真实环境下进行部署，有效地验证了学生网络的鲁棒性。(1) This embodiment retains the advantages of using the transformer architecture to extract global and local features, trains an algorithm that currently leads all other transparent object depth completion algorithms, and teaches a student network that is second only to the teacher network in accuracy but twice as fast through distillation learning. In addition, by deploying it in a real environment, the robustness of the student network is effectively verified.

(2)使用已有的数据集，不用花费时间和人力成本进行数据采集；(2) Use existing data sets without spending time and manpower on data collection;

(3)不用依赖复杂昂贵的深度摄像头进行数据采集；(3) No need to rely on complex and expensive depth cameras for data collection;

(4)通过训练或者微调模型，即可在真实机器人上进行抓取。(4) By training or fine-tuning the model, grasping can be performed on a real robot.

(5)可根据实际情况选择精度较高的教师网络或者效率更高的学生网络进行部署。(5) Depending on the actual situation, a teacher network with higher accuracy or a student network with higher efficiency can be selected for deployment.

以上，仅为本申请较佳的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应该以权利要求的保护范围为准。The above are only preferred specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any changes or substitutions that can be easily thought of by any technician familiar with the technical field within the technical scope disclosed in the present application should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. A method for capturing transparent objects based on depth completion, comprising:

Identify transparent objects in RGB images and obtain masked depth maps;

Using a teacher network to extract features from the RGB image and the masked depth map to obtain global features and local features of the transparent object; wherein the teacher network is constructed based on a Transformer network;

Based on the composite distillation loss function, the global features and local features are learned using a student network, and the RGB image and the masked depth map are input into the learned student network for association processing to obtain a predicted depth map; wherein the student network is constructed based on a CNN network;

Inputting the predicted depth map into the GR-CNN network, obtaining a grabbing frame of the transparent object, and completing the grabbing of the transparent object based on the grabbing frame;

Identify transparent objects in RGB images and obtain a masked depth map including:

Using the Deeplabv3+ algorithm to identify transparent objects in the RGB image and extract transparent object masks;

Based on the transparent object mask, remove the corresponding transparent object area in the depth map by mapping the position information to obtain the masked depth map;

Inputting the RGB image and the masked depth map into the learned student network for association processing includes:

Input the RGB image and the masked depth map into the convolution layer respectively for preliminary feature extraction, and obtain the RGB image features and depth map features after preliminary extraction;

The depth map features after preliminary extraction are processed by downsampling layer for dimensionality reduction; each depth map feature after dimensionality reduction is then sent to a downsampling layer for dimensionality reduction;

The RGB image features and depth map features after preliminary extraction are input into the first consistent feature association module for feature association processing; starting from the second feature association module, the output processed by the previous feature association module and the depth map features after the previous dimensionality reduction processing through the downsampling layer are used as the input of the next consistent feature association module, and so on, and are continuously input into three consistent feature association modules for further feature association; wherein, the size of the associated features output by each consistent feature association module is half of the output size of the previous consistent feature association module, but the dimension is increased to twice the original; and the size of the depth map features is also reduced to half of the previous layer after passing through the downsampling layer, and the dimension is increased to twice the original; after the feature association processing is performed by four consistent feature association modules, the encoded second integrated features are finally obtained;

Using a second decoder, decoding the second integrated features to obtain a predicted depth map;

The initially extracted RGB image features and depth map features are input into the first consistent feature association module, and feature association processing includes:

The initially extracted RGB image features and depth map features are respectively input into the convolutional layer for channel compression;

After concatenating the compressed depth map features with the corresponding uncompressed RGB image features in terms of the number of channels, a convolution operation is performed to obtain the similarity weight of the depth map features;

Multiplying the similarity weight of the depth map feature by the uncompressed depth map feature bit by bit to obtain a reliable depth map position feature;

After concatenating the compressed RGB image features with the corresponding uncompressed depth image features in terms of the number of channels, a convolution operation is performed to obtain the similarity weights of the RGB image features;

Multiply the similarity weight of the RGB image feature by the uncompressed RGB image feature bit by bit to obtain a reliable RGB image position feature;

The reliable depth map position features and the reliable RGB image position features are summed by bitwise addition to obtain a final reliable position feature map;

For the final reliable position feature map, linear rectification function ReLU, batch normalization BatchNormalization and convolution are used to perform feature association to obtain associated features.

2. The method for capturing transparent objects based on depth completion according to claim 1, wherein the step of extracting features from the RGB image and the masked depth map using a teacher network comprises:

Performing block embedding processing on the RGB picture and the masked depth map respectively;

The depth map after block embedding is processed by three downsampling layers in turn for dimensionality reduction. The depth map after dimensionality reduction in each downsampling layer is used as the input of a position association module.

The RGB image after block embedding processing, the depth map after block embedding processing, and the depth map after the depth map after block embedding processing are encoded by a position association module to obtain the encoded first integrated feature; wherein the position association module includes 4, the input of the first position association module is the RGB image after block embedding processing and the depth map after block embedding processing, and the input of the subsequent position association module is the output of the previous position association module and the depth map after the depth map after block embedding processing is reduced by the corresponding downsampling layer;

The first decoder is used to perform feature decoding on the encoded first integrated feature to obtain the global feature and the local feature of the transparent object.

3. The method for capturing a transparent object based on depth completion according to claim 2, wherein encoding the block-embedded RGB image and the block-embedded depth map through a first position association module comprises:

Initialize the RGB image and the depth map using layer normalization to obtain RGB image features and depth map features, use the RGB image features as a search request Q, and gradually search the RGB image features and the depth map features to obtain key value elements;

Input the key-value elements into the multi-head attention mechanism module to obtain the confidence of the retrieval results;

Based on layer normalization and a multi-layer perceptron, obtaining correlation features between the RGB image features and the depth map features in the confidence;

Performing an offset window operation on the associated features and the depth map features to obtain a new retrieval result confidence;

Based on the new retrieval result confidence, obtaining the encoded integrated features;

The correlation feature between the RGB picture and the depth map is:

F _corr =MLP(LN(F _qkv ))+F _qkv

Among them, F _corr is the correlation feature, F _qkv is the result of multi-head attention calculation, LN is layer normalization, and MLP is a multi-layer perceptron.

4. The method for capturing transparent objects based on depth completion according to claim 3, wherein performing an offset window operation on the associated features and the depth map features comprises:

Move each pixel in the associated features and the depth map features to the lower right corner by two pixels, and then re-block the associated features and the depth map features;

The block features are used as new search requests Q, and the index elements K and content elements V corresponding to the block-related features and depth map features are gradually searched to obtain new key-value elements;

The new key-value elements are input into the multi-head attention mechanism module to calculate the new retrieval result confidence.

5. The method for grasping transparent objects based on depth completion according to claim 1, wherein the composite distillation loss function is:

in, The loss function for the composite distillation is, is the distance loss, is the structural loss, is the difference between the real depth map and the student predicted depth map, is the difference between the predicted depth maps of the teacher and the student, and λ ₁ , λ ₂ , and λ ₃ are the constant weights of the corresponding losses.