CN118379563B

CN118379563B - Navigation model training method and device, electronic equipment and storage medium

Info

Publication number: CN118379563B
Application number: CN202410805822.0A
Authority: CN
Inventors: 易东; 乔冠辉; 吴凌翔; 王金桥
Original assignee: Wuhan Artificial Intelligence Research Institute; Institute of Automation of Chinese Academy of Science
Current assignee: Wuhan Artificial Intelligence Research Institute; Institute of Automation of Chinese Academy of Science
Priority date: 2024-06-21
Filing date: 2024-06-21
Publication date: 2024-09-10
Anticipated expiration: 2044-06-21
Also published as: CN118379563A

Abstract

The present invention relates to the field of visual navigation technology, and provides a navigation model training method, device, electronic device and storage medium, the method comprising: inputting sample images and sample text information in each sample image-text pair into a visual encoder and a text encoder in a navigation model respectively to extract sample image features and sample text features; substituting the sample image features and sample text features corresponding to each sample image-text pair into a contrastive learning loss function, and completing pre-training of the visual encoder and the text encoder when the contrastive learning loss function converges; and training the navigation model based on the pre-trained visual encoder and text encoder. The navigation model trained by the training method of the present invention can accurately judge whether the image under the current viewing angle conforms to the content described by the text information, thereby accurately predicting the next waypoint of the robot.

Description

Navigation model training method, device, electronic equipment and storage medium

技术领域Technical Field

本发明涉及视觉导航技术领域，尤其涉及一种导航模型训练方法、装置、电子设备及存储介质。The present invention relates to the field of visual navigation technology, and in particular to a navigation model training method, device, electronic equipment and storage medium.

背景技术Background Art

视觉语言导航是通过人类语言文本信息来操控机器人（无人机或自动驾驶车辆等）的一项具身人工智能任务。以无人机为例，该任务对模型输入无人机在起始点视角下的俯视图像、起始角度以及对目的地的文本信息，模型预测无人机下一步的航路点；再以对目的地的文本信息、该下一步的航路点及之前所有航路点（起始点和目标点都属于航路点）各自的俯视图像及起始角度输入模型，模型预测无人机下下一步的航路点，直到到达目标点。其中，一场机器人导航的长途行动过程，分为若干次行动动作，每次行动动作的停止点即为航路点。Visual language navigation is an embodied artificial intelligence task that controls robots (drones or self-driving vehicles, etc.) through human language text information. Taking drones as an example, this task inputs the model with the drone's overhead view image from the starting point, the starting angle, and the text information of the destination. The model predicts the drone's next waypoint; then the model inputs the text information of the destination, the next waypoint, and all previous waypoints (the starting point and the target point are both waypoints) with the overhead view image and starting angle, and the model predicts the drone's next waypoint until it reaches the target point. Among them, a long-distance action process of robot navigation is divided into several actions, and the stopping point of each action is the waypoint.

目前，视觉语言导航技术存在的主要有以下两个问题。At present, there are mainly two problems with visual language navigation technology.

1、未到达目的地便提前停止，在未到达真正目的地之前，当模型错误地把当前视角图像认为是目的地对应图像时，就会出现提前停止的情况。1. Stopping early before reaching the destination. When the model mistakenly considers the current view image as the image corresponding to the destination before reaching the actual destination, early stopping will occur.

2、到达目的地未能有效识别而继续向前行动，当到达真正目的地时，模型却认为该目的地图像与文本信息中的目的地没有关联，就会越过目的地，继续向前行动。2. When the destination is reached but the model fails to effectively identify it and continues to move forward, when the real destination is reached, the model believes that the destination image is not related to the destination in the text information, so it will skip the destination and continue to move forward.

导致上述两个问题出现的原因均为模型未能准确判别当前视角下的图像和文本信息中目的地之间的关联。The reasons for the above two problems are that the model fails to accurately determine the relationship between the destinations in the image and text information at the current perspective.

发明内容Summary of the invention

本发明提供一种导航模型训练方法、装置、电子设备及存储介质，用以解决现有技术的模型不能准确判别当前视角下的图像和文本信息中目的地之间的关联的问题。The present invention provides a navigation model training method, device, electronic device and storage medium, which are used to solve the problem that the model in the prior art cannot accurately determine the association between the destinations in the image and text information under the current viewing angle.

本发明提供一种导航模型训练方法，包括以下步骤。The invention provides a navigation model training method, which comprises the following steps.

将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征。The sample image and sample text information in each sample image-text pair are respectively input into the visual encoder and the text encoder in the navigation model to extract the sample image features and the sample text features.

将各样本图文对对应的样本图像特征和样本文本特征代入对比学习损失函数，在所述对比学习损失函数收敛时，完成对视觉编码器和文本编码器的预训练。The sample image features and sample text features corresponding to each sample image-text pair are substituted into the contrastive learning loss function. When the contrastive learning loss function converges, the pre-training of the visual encoder and the text encoder is completed.

基于预训练得到的所述视觉编码器和文本编码器训练所述导航模型。The navigation model is trained based on the pre-trained visual encoder and text encoder.

根据本发明提供的一种导航模型训练方法，所述样本图文对中的样本图像基于预设的样本航路中航路点对应图像构建，所述样本图文对中的样本文本信息至少基于所述航路点所在样本航路中导航阶段的起点航路点对应的样本描述文本构建，所述导航阶段通过样本航路中预设的人机对话航路点划分。According to a navigation model training method provided by the present invention, the sample images in the sample image-text pairs are constructed based on the corresponding images of waypoints in a preset sample route, the sample text information in the sample image-text pairs is constructed based at least on the sample description text corresponding to the starting waypoint of the navigation stage in the sample route where the waypoint is located, and the navigation stage is divided by the human-computer dialogue waypoints preset in the sample route.

根据本发明提供的一种导航模型训练方法，所述样本航路中每一导航阶段分别对应一个样本图文对，所述样本图文对中的样本图像至少基于对应导航阶段的起点航路点和目标航路点的图像拼接形成。According to a navigation model training method provided by the present invention, each navigation stage in the sample route corresponds to a sample image-text pair, and the sample image in the sample image-text pair is formed by stitching images based on at least the starting point waypoint and the target waypoint of the corresponding navigation stage.

根据本发明提供的一种导航模型训练方法，所述样本图文对中的样本图像基于对应导航阶段的起点航路点、目标航路点和至少一中间航路点的图像拼接形成，所述中间航路点为对应导航阶段的起点航路点和目标航路点之间的航路点。According to a navigation model training method provided by the present invention, the sample image in the sample image-text pair is formed by stitching images of a starting waypoint, a target waypoint and at least one intermediate waypoint of the corresponding navigation stage, and the intermediate waypoint is a waypoint between the starting waypoint and the target waypoint of the corresponding navigation stage.

根据本发明提供的一种导航模型训练方法，所述样本图文对中的样本文本信息基于对应导航阶段及之前导航阶段各自的起点航路点对应的样本描述文本拼接形成。According to a navigation model training method provided by the present invention, the sample text information in the sample image-text pair is formed by splicing the sample description texts corresponding to the starting point waypoints of the corresponding navigation stage and the previous navigation stage.

根据本发明提供的一种导航模型训练方法，所述对比学习损失函数基于视觉编码器的第一相似度损失函数和文本编码器的第二相似度损失函数确定。According to a navigation model training method provided by the present invention, the contrastive learning loss function is determined based on a first similarity loss function of a visual encoder and a second similarity loss function of a text encoder.

根据本发明提供的一种导航模型训练方法，在将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征之前，还包括：对所述视觉编码器进行图像分类预训练。According to a navigation model training method provided by the present invention, before the sample image and sample text information in each sample image-text pair are respectively input into the visual encoder and text encoder in the navigation model to extract the sample image features and sample text features, it also includes: performing image classification pre-training on the visual encoder.

根据本发明提供的一种导航模型训练方法，在将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征之前，还包括：基于导航领域的样本数据集对所述视觉编码器进行目标检测预训练。According to a navigation model training method provided by the present invention, before the sample image and sample text information in each sample image-text pair are respectively input into the visual encoder and text encoder in the navigation model to extract the sample image features and the sample text features, it also includes: pre-training the visual encoder for target detection based on a sample data set in the navigation field.

根据本发明提供的一种导航模型训练方法，基于导航领域的样本数据集对所述视觉编码器进行目标检测预训练，包括：以俯视图样本以及俯视图样本中对应的各目标类别真值和目标框坐标真值为标签，预训练所述视觉编码器。According to a navigation model training method provided by the present invention, the visual encoder is pre-trained for target detection based on a sample data set in the navigation field, including: pre-training the visual encoder with overhead view samples and the corresponding true values of each target category and target frame coordinates in the overhead view samples as labels.

本发明还提供一种导航模型训练装置，包括以下单元。The present invention also provides a navigation model training device, comprising the following units.

预训练特征提取单元，用于将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征。The pre-trained feature extraction unit is used to input the sample image and sample text information in each sample image-text pair into the visual encoder and text encoder in the navigation model respectively to extract the sample image features and sample text features.

预训练损失计算单元，用于将各样本图文对对应的样本图像特征和样本文本特征代入对比学习损失函数，在所述对比学习损失函数收敛时，完成对视觉编码器和文本编码器的预训练。The pre-training loss calculation unit is used to substitute the sample image features and sample text features corresponding to each sample image-text pair into the contrastive learning loss function, and complete the pre-training of the visual encoder and the text encoder when the contrastive learning loss function converges.

模型训练单元，用于基于预训练得到的所述视觉编码器和文本编码器训练所述导航模型。A model training unit is used to train the navigation model based on the pre-trained visual encoder and text encoder.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述的导航模型训练方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the navigation model training method as described above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述的导航模型训练方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the navigation model training method as described in any one of the above is implemented.

本发明提供的导航模型训练方法、装置、电子设备及存储介质，在传统的对导航模型训练之前通过对导航模型中的视觉编码器和文本编码器进行了对比学习预训练，对比学习的目标是使正样本图文对的特征向量距离更近，使负样本图文对的特征向量距离更远，因此，通过该对比学习预训练，对于相匹配的图文对，加强了视觉编码器与文本编码器分别输出的图像特征和文本特征的关联性，对于不匹配的图文对，弱化了视觉编码器与文本编码器分别输出的图像特征和文本特征的关联性，使得导航模型能够精准地判断当前视角下的图像是否符合文本信息描述的内容导航模型整体上能够准确地的判断当前视角下的图像与文本信息语义之间的关联，从而准确地预测机器人下一步的航路点，避免了未到达目的地便提前停止以及到达目的地未能有效识别而继续向前行动的问题。而且预训练后能够得到更优的初始化模型，为后续导航模型整体训练奠定了训练基础，能够有效提高模型训练的收敛速度及测试精度。The navigation model training method, device, electronic device and storage medium provided by the present invention are pre-trained by contrast learning of the visual encoder and text encoder in the navigation model before the traditional training of the navigation model. The goal of contrast learning is to make the feature vector distance of the positive sample image-text pair closer and the feature vector distance of the negative sample image-text pair farther. Therefore, through the contrast learning pre-training, for the matching image-text pair, the correlation between the image features and the text features respectively output by the visual encoder and the text encoder is strengthened, and for the unmatched image-text pair, the correlation between the image features and the text features respectively output by the visual encoder and the text encoder is weakened, so that the navigation model can accurately judge whether the image under the current perspective conforms to the content described by the text information. The navigation model as a whole can accurately judge the correlation between the image under the current perspective and the semantics of the text information, so as to accurately predict the next waypoint of the robot, avoiding the problem of stopping in advance before reaching the destination and continuing to move forward without effective recognition after reaching the destination. Moreover, a better initialization model can be obtained after pre-training, which lays a training foundation for the subsequent overall training of the navigation model and can effectively improve the convergence speed and test accuracy of the model training.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的导航模型训练方法中运用的导航模型示意图。FIG1 is a schematic diagram of a navigation model used in the navigation model training method provided by the present invention.

图2是本发明提供的导航模型训练方法的流程示意图。FIG. 2 is a flow chart of the navigation model training method provided by the present invention.

图3是本发明提供的导航模型训练方法中视觉编码器和文本编码器对比学习预训练示意图。FIG3 is a schematic diagram of pre-training of comparative learning of a visual encoder and a text encoder in the navigation model training method provided by the present invention.

图4是本发明提供的导航模型训练方法中基于样本航路构造样本图文对的示意图。FIG. 4 is a schematic diagram of constructing sample image-text pairs based on sample routes in the navigation model training method provided by the present invention.

图5是本发明提供的导航模型训练方法中对视觉编码器进行图像分类预训练示意图。FIG5 is a schematic diagram of image classification pre-training of a visual encoder in the navigation model training method provided by the present invention.

图6是本发明提供的导航模型训练方法中对视觉编码器进行图像目标检测预训练示意图。FIG6 is a schematic diagram of image target detection pre-training for a visual encoder in the navigation model training method provided by the present invention.

图7是本发明提供的导航模型训练方法分阶段展示示意图。FIG. 7 is a schematic diagram showing the navigation model training method provided by the present invention in stages.

图8是本发明提供的导航模型训练装置的结构示意图。FIG8 is a schematic diagram of the structure of the navigation model training device provided by the present invention.

图9是本发明提供的电子设备的结构示意图。FIG. 9 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

由于基于视觉语言的导航模型是根据人类语言文本信息以及机器人感知的当前视觉图像来预测机器人行动的下一个航路点，因此，该导航模型中至少包括：用于对图像识别并提取特征的视觉编码器和对文本信息识别并特征提取的文本编码器。如图1所示，为一种基于视觉语言的导航模型结构示意图，包括：视觉编码器110、文本编码器120、角度编码器130和多模态信息融合模块140，角度编码器130用于对每个航路点时机器人的方向角进行识别并提取角度特征，多模态信息融合模块140用于对图像特征、文本特征和角度特征进行融合并输出机器人下一航路点的坐标值，直到到达目标点。其中，视觉编码器110可以是MViT（Multiscale Vision Transformer），文本编码器120可以是Roberta，角度编码器130可以是Projector。MViT是一种基于Transformer架构的深度学习模型，专门用于处理计算机视觉任务，与传统的卷积神经网络不同，MViT利用Transformer的自注意力机制来捕捉图像中的全局特征，并且在不同尺度上对图像进行处理，使得其在处理大规模数据集时表现更加优秀。Since the navigation model based on visual language predicts the next waypoint of the robot's action based on human language text information and the current visual image perceived by the robot, the navigation model at least includes: a visual encoder for image recognition and feature extraction and a text encoder for text information recognition and feature extraction. As shown in FIG1 , it is a schematic diagram of the structure of a navigation model based on visual language, including: a visual encoder 110, a text encoder 120, an angle encoder 130 and a multimodal information fusion module 140, the angle encoder 130 is used to identify the direction angle of the robot at each waypoint and extract the angle feature, and the multimodal information fusion module 140 is used to fuse the image feature, the text feature and the angle feature and output the coordinate value of the next waypoint of the robot until it reaches the target point. Among them, the visual encoder 110 can be MViT (Multiscale Vision Transformer), the text encoder 120 can be Roberta, and the angle encoder 130 can be Projector. MViT is a deep learning model based on the Transformer architecture, specifically designed for computer vision tasks. Unlike traditional convolutional neural networks, MViT uses the Transformer's self-attention mechanism to capture global features in images and processes images at different scales, making it perform better when processing large-scale data sets.

下面以上述导航模型结构为例说明本发明实施例的导航模型训练方法，但不限于上述导航模型结构，只要包括视觉编码器110和文本编码器120，或者包括视觉编码和文本编码相应功能模块的导航模型结构均适用于本发明实施例的导航模型训练方法。The navigation model training method of the embodiment of the present invention is illustrated below using the above-mentioned navigation model structure as an example, but is not limited to the above-mentioned navigation model structure. As long as the navigation model structure includes the visual encoder 110 and the text encoder 120, or includes corresponding functional modules of visual encoding and text encoding, it is applicable to the navigation model training method of the embodiment of the present invention.

本发明实施例的导航模型训练方法，如图2和3所示，包括以下步骤S210至步骤S230。The navigation model training method according to the embodiment of the present invention, as shown in FIGS. 2 and 3 , includes the following steps S210 to S230 .

步骤S210：将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器110和文本编码器120，以提取样本图像特征和样本文本特征。样本图文对是将样本图像和样本文本信息组成的图文对，对于相匹配的样本图像和样本文本信息组成的样本图文对为正样本数据，不匹配的样本图像和样本文本信息组成的样本图文对为负样本数据。在样本图像中具有样本文本信息中描述的部分或全部内容时，则该样本图像和该样本文本信息相匹配，否则不匹配，例如：对于“目的地在你三点钟方向，是一个与公路平行的小长方形”的样本文本信息，那么只要存在公路和/或长方形状物体的样本图像均与该样本文本信息匹配。Step S210: Input the sample image and sample text information in each sample image-text pair into the visual encoder 110 and the text encoder 120 in the navigation model respectively to extract the sample image features and the sample text features. The sample image-text pair is a picture-text pair composed of the sample image and the sample text information. The sample image-text pair composed of the matched sample image and the sample text information is the positive sample data, and the sample image-text pair composed of the unmatched sample image and the sample text information is the negative sample data. When the sample image has part or all of the content described in the sample text information, the sample image and the sample text information match, otherwise they do not match. For example, for the sample text information of "the destination is at three o'clock in your direction, which is a small rectangle parallel to the road", as long as there are sample images of roads and/or rectangular objects, they will match the sample text information.

步骤S220：将各样本图文对对应的样本图像特征和样本文本特征代入对比学习损失函数，在对比学习损失函数收敛时，完成对视觉编码器110和文本编码器120的预训练。本实施例中，通过对比学习训练，在对比学习损失函数收敛时，对于正样本图文对，视觉编码器110与文本编码器120分别输出的样本图像特征和样本文本特征之间关联性得到了加强。Step S220: Substitute the sample image features and sample text features corresponding to each sample image-text pair into the contrastive learning loss function, and complete the pre-training of the visual encoder 110 and the text encoder 120 when the contrastive learning loss function converges. In this embodiment, through contrastive learning training, when the contrastive learning loss function converges, for the positive sample image-text pair, the correlation between the sample image features and the sample text features output by the visual encoder 110 and the text encoder 120 is strengthened.

步骤S230：基于预训练得到视觉编码器110和文本编码器120训练所述导航模型。本步骤中，可以采用现有的训练方式对导航模型进行训练，由于视觉编码器110和文本编码器120经过了上述步骤S210和S220的预训练，对于相匹配的图像和文本信息，对应的图像特征和文本特征具有较强的关联性，使得训练完成的导航模型能够准确地的判断当前视角下的图像与文本信息语义之间的关联，避免错误确定目标点，而导致未到达目的地便提前停止以及到达目的地未能有效识别而继续向前行动。Step S230: Train the navigation model based on the pre-trained visual encoder 110 and text encoder 120. In this step, the navigation model can be trained using an existing training method. Since the visual encoder 110 and text encoder 120 have been pre-trained in the above steps S210 and S220, the corresponding image features and text features have a strong correlation for the matching image and text information, so that the trained navigation model can accurately determine the correlation between the image and text information semantics under the current perspective, avoiding incorrect determination of the target point, resulting in premature stopping before reaching the destination, and continuing to move forward without effective recognition after reaching the destination.

本实施例的导航模型训练方法中，在传统的对导航模型训练之前通过步骤S210和步骤S220对导航模型中的视觉编码器110和文本编码器120进行了对比学习预训练，对比学习的目标是使正样本图文对的特征向量距离更近，使负样本图文对的特征向量距离更远，因此，通过该对比学习预训练，对于相匹配的图文对，加强了视觉编码器110与文本编码器120分别输出的图像特征和文本特征的关联性，对于不匹配的图文对，弱化了视觉编码器110与文本编码器120分别输出的图像特征和文本特征的关联性，使得导航模型能够精准地判断当前视角下的图像是否符合文本信息描述的内容，从而准确地预测机器人下一步的航路点，避免了未到达目的地便提前停止以及到达目的地未能有效识别而继续向前行动的问题。而且预训练后能够得到更优的初始化模型，为后续导航模型整体训练奠定了训练基础，能够有效提高模型训练的收敛速度及测试精度。In the navigation model training method of this embodiment, before the traditional training of the navigation model, the visual encoder 110 and the text encoder 120 in the navigation model are pre-trained by contrastive learning through step S210 and step S220. The goal of contrastive learning is to make the feature vector distance of the positive sample image-text pair closer and the feature vector distance of the negative sample image-text pair farther. Therefore, through the contrastive learning pre-training, for the matching image-text pair, the correlation between the image features and the text features output by the visual encoder 110 and the text encoder 120 is strengthened, and for the unmatched image-text pair, the correlation between the image features and the text features output by the visual encoder 110 and the text encoder 120 is weakened, so that the navigation model can accurately judge whether the image under the current perspective conforms to the content described by the text information, thereby accurately predicting the next waypoint of the robot, avoiding the problem of stopping in advance before reaching the destination and continuing to move forward without effective recognition after reaching the destination. Moreover, after pre-training, a better initialization model can be obtained, which lays a training foundation for the subsequent overall training of the navigation model and can effectively improve the convergence speed and test accuracy of the model training.

在一些实施例中，所述样本图文对中的样本图像基于预设的样本航路中航路点对应图像构建，所述样本图文对中的样本文本信息至少基于所述航路点所在样本航路中导航阶段的起点航路点对应的样本描述文本构建，所述导航阶段通过样本航路中预设的人机对话航路点划分。本实施例中，可以采用视觉对话导航领域的AVDN数据集，该AVDN数据集中包括多条样本航路。In some embodiments, the sample images in the sample image-text pair are constructed based on the images corresponding to the waypoints in the preset sample routes, and the sample text information in the sample image-text pair is constructed based on at least the sample description text corresponding to the starting point waypoint of the navigation stage in the sample route where the waypoint is located, and the navigation stage is divided by the preset human-computer dialogue waypoints in the sample route. In this embodiment, an AVDN dataset in the field of visual dialogue navigation can be used, and the AVDN dataset includes multiple sample routes.

其中，样本航路为表示机器人行动路线的样本数据，人机对话航路点是在标注样本数据时，模拟机器人在到达某一航路点时认为已到达目标点，询问人工平台是否到达目标点，人工确定未达到目标点时回复文本信息（例如：目标点在当前导航点的东南方向），机器人收到该回复后再次导航，进入下一个导航阶段，该人机对话航路点为下一个导航阶段的起点航路点，也为上一个导航阶段的目标航路点，本次对话信息为该对话航路点对应的目标点的样本描述文本。具体如图4所示，从起点到目标点的一条样本航路，起点和目标点中间所有的点均为航路点，样本航路中每一航路点均包括对应的图像和坐标，样本航路中，起点和人机对话航路点分别对应有目标点的样本描述文本，起点对应的目标点的样本描述文本为对目标点的原始描述信息，例如：目的地在你三点钟方向，是一个与公路平行的小长方形。因此，最简单的构造样本图文对方式是以任一导航点对应的图像为样本图像，该任一导航点所在导航阶段的起点航路点对应的样本描述文本为样本文本信息。Among them, the sample route is the sample data representing the robot's action route. The human-machine dialogue waypoint is when marking the sample data. When the robot reaches a certain waypoint, it thinks that it has reached the target point, and asks the artificial platform whether it has reached the target point. When the artificial platform determines that it has not reached the target point, it replies with a text message (for example: the target point is in the southeast direction of the current navigation point). After receiving the reply, the robot navigates again and enters the next navigation stage. The human-machine dialogue waypoint is the starting waypoint of the next navigation stage and the target waypoint of the previous navigation stage. The dialogue information of this time is the sample description text of the target point corresponding to the dialogue waypoint. As shown in Figure 4, a sample route from the starting point to the target point, all points between the starting point and the target point are waypoints, and each waypoint in the sample route includes the corresponding image and coordinates. In the sample route, the starting point and the human-machine dialogue waypoints correspond to the sample description text of the target point respectively. The sample description text of the target point corresponding to the starting point is the original description information of the target point, for example: the destination is at your three o'clock direction, which is a small rectangle parallel to the road. Therefore, the simplest way to construct sample image-text pairs is to use the image corresponding to any navigation point as the sample image, and the sample description text corresponding to the starting point waypoint of the navigation stage where the any navigation point is located as the sample text information.

在一些实施例中，所述样本航路中每一导航阶段分别对应一个样本图文对，所述样本图文对中的样本图像至少基于对应导航阶段的起点航路点和目标航路点的图像拼接形成，对应的样本文本信息仍然可以是该导航阶段的起点航路点对应的样本描述文本。具体地，如图4所示，每一导航阶段分别对应一个样本图文对，那么图4中对应三个样本图文对，对于第一导航阶段，样本图文对中的样本图像至少为起点和第一人机对话航路点各自的图像拼接而成，样本文本信息为起点对应的样本描述文本。此处的拼接表示按图像的边界连接形成一张图像即可。In some embodiments, each navigation stage in the sample route corresponds to a sample image-text pair, and the sample image in the sample image-text pair is formed by stitching together at least the images of the starting waypoint and the target waypoint of the corresponding navigation stage, and the corresponding sample text information can still be the sample description text corresponding to the starting waypoint of the navigation stage. Specifically, as shown in FIG4 , each navigation stage corresponds to a sample image-text pair, so there are three sample image-text pairs in FIG4 . For the first navigation stage, the sample image in the sample image-text pair is formed by stitching together at least the images of the starting point and the first human-computer dialogue waypoint, and the sample text information is the sample description text corresponding to the starting point. The stitching here means that the image boundaries are connected to form an image.

本实施例中，每一个导航阶段可产生至少一个正样本图文对，不匹配的图文对作为负样本图文对。在对比学习预训练时，导航模型更容易从正负样本的对比学习训练中，学习到样本图像和样本文本信息之间的关联性。由于起点航路点和目标航路点各自对应的图像语义信息与文本描述最为贴切，与文本关联度最高，从而使得在对比学习预训练后，对于相匹配的图像和文本信息，视觉编码器110和文本编码器120分别提取的图像特征和文本特征具有更强的关联性。In this embodiment, each navigation stage can generate at least one positive sample image-text pair, and the mismatched image-text pairs are used as negative sample image-text pairs. During contrastive learning pre-training, the navigation model can more easily learn the correlation between sample images and sample text information from the contrastive learning training of positive and negative samples. Since the image semantic information corresponding to the starting waypoint and the target waypoint is most appropriate to the text description and has the highest correlation with the text, after contrastive learning pre-training, the image features and text features extracted by the visual encoder 110 and the text encoder 120 respectively have a stronger correlation for the matching image and text information.

在一些实施例中，所述样本图文对中的样本图像基于对应导航阶段的起点航路点、目标航路点和至少一中间航路点的图像拼接形成，所述中间航路点为对应导航阶段的起点航路点和目标航路点之间的航路点，例如：可随机选择两个中间航路点，一共四张图拼接成一整张图作为样本图像。虽然中间航路点与样本文本信息的关联性更低，但随机选择中间航路点，也提高了选择图像的丰富度，使导航模型在训练过程中能学习更多的图像信息，一定程度上增强了导航模型的鲁棒性。In some embodiments, the sample image in the sample image-text pair is formed by stitching images of the starting waypoint, the target waypoint, and at least one intermediate waypoint of the corresponding navigation phase, wherein the intermediate waypoint is a waypoint between the starting waypoint and the target waypoint of the corresponding navigation phase, for example, two intermediate waypoints can be randomly selected, and a total of four images are stitched into a whole image as a sample image. Although the correlation between the intermediate waypoints and the sample text information is lower, the random selection of the intermediate waypoints also improves the richness of the selected images, so that the navigation model can learn more image information during the training process, and to a certain extent enhances the robustness of the navigation model.

在一些实施例中，所述样本图文对中的样本文本信息基于对应导航阶段及之前导航阶段各自的起点航路点对应的样本描述文本拼接形成。即一个样本图文对中不止包括当前导航阶段的起点航路点对应的样本描述文本，还包括之前导航阶段的起点航路点对应的样本描述文本，样本文本信息中对目标点的描述文本信息更丰富，从而使得在对比学习预训练后，对于相匹配的图像和文本信息，视觉编码器110和文本编码器120分别提取的图像特征和文本特征具有更强的关联性。In some embodiments, the sample text information in the sample image-text pair is formed by splicing the sample description texts corresponding to the starting point waypoints of the corresponding navigation stage and the previous navigation stage. That is, a sample image-text pair includes not only the sample description text corresponding to the starting point waypoint of the current navigation stage, but also the sample description text corresponding to the starting point waypoint of the previous navigation stage. The sample text information contains richer description text information of the target point, so that after the contrast learning pre-training, the image features and text features extracted by the visual encoder 110 and the text encoder 120 respectively have stronger correlation for the matching image and text information.

在一些实施例中，所述对比学习损失函数基于视觉编码器110的第一相似度损失函数和文本编码器120的第二相似度损失函数确定。具体地，对比学习损失函数如下。In some embodiments, the contrastive learning loss function is determined based on the first similarity loss function of the visual encoder 110 and the second similarity loss function of the text encoder 120. Specifically, the contrastive learning loss function as follows.

其中，sim( )为相似度函数，用矩阵点乘计算，exp( )表示自然指数函数，和分别表示所述第一相似度损失函数和第二相似度损失函数，表示N个样本图文对中的第i个样本图文对，表示第i个样本图文对的样本图像，表示第i个样本图文对的样本文本信息，为可学习的温度参数。Among them, sim () is the similarity function, which is calculated by matrix dot multiplication, exp () represents the natural exponential function, and represent the first similarity loss function and the second similarity loss function respectively, represents the i -th sample image-text pair among N sample image-text pairs, represents the sample image of the i -th sample image-text pair, represents the sample text information of the i -th sample image-text pair, is the learnable temperature parameter.

在一些实施例中，导航模型包括：角度编码器130、多模态信息融合模块140、所述视觉编码器110和所述文本编码器120，基于该导航模型结构，步骤230具体包括以下步骤。In some embodiments, the navigation model includes: an angle encoder 130, a multimodal information fusion module 140, the visual encoder 110 and the text encoder 120. Based on the navigation model structure, step 230 specifically includes the following steps.

对于样本航路中的任一航路点，将所述任一航路点及其所在导航阶段中位于所述任一航路点之前的航路点各自对应的样本图像和无人机样本方向角分别输入所述视觉编码器和所述角度编码器，以提取样本图像特征和样本角度特征；将所述任一航路点及之前所有航路点对应的样本描述文本拼接成样本文本信息，并将样本文本信息输入所述文本编码器，以提取样本文本特征。具体地，本实施例中，也可以采用AVDN数据集中的样本航路进行训练，且每预测出一个航路点，就将该航路点及之前所有航路点（起点和所有的人机对话航路点）对应的样本描述文本拼接成样本文本信息，在训练过程中，导航模型能够学习到更多样本描述文本相关的语义信息，将更多的文本语义信息与样本图像关联，使得导航模型更准确地预测下一个航路点。For any waypoint in the sample route, the sample images and the sample direction angles of the drone corresponding to the waypoint and the waypoints before the waypoint in the navigation stage are respectively input into the visual encoder and the angle encoder to extract sample image features and sample angle features; the sample description texts corresponding to the waypoint and all previous waypoints are spliced into sample text information, and the sample text information is input into the text encoder to extract sample text features. Specifically, in this embodiment, the sample routes in the AVDN data set can also be used for training, and each time a waypoint is predicted, the sample description texts corresponding to the waypoint and all previous waypoints (the starting point and all human-machine dialogue waypoints) are spliced into sample text information. During the training process, the navigation model can learn more semantic information related to the sample description texts, and associate more text semantic information with the sample images, so that the navigation model can more accurately predict the next waypoint.

将样本图像特征、样本文本特征和样本角度特征输入多模态信息融合模块，以得到多模态信息融合模块输出的所述任一航路点的下一航路点的预测坐标值。The sample image features, sample text features and sample angle features are input into the multimodal information fusion module to obtain the predicted coordinate value of the next waypoint of any waypoint output by the multimodal information fusion module.

将样本航路中任一航路点的下一航路点的真实坐标值和所述预测坐标值代入预设的动作预测损失函数，在所述动作预测损失函数收敛时，训练完成，动作预测损失函数如下。The actual coordinate value of the next waypoint of any waypoint in the sample route and the predicted coordinate value are substituted into the preset action prediction loss function. When the action prediction loss function converges, the training is completed. The action prediction loss function is as follows.

其中，T为当前执行样本个数，(a,b)为无人机下一步位置的二维坐标真实值，（,）为无人机下一步位置二维坐标模型预测值。Where, T is the number of samples currently executed, ( a , b ) is the true value of the two-dimensional coordinates of the next position of the drone, ( , ) is the predicted value of the two-dimensional coordinate model of the next position of the UAV.

在一些实施例中，步骤S210之前还包括：对所述视觉编码器110进行图像分类预训练。具体地，如图5所示，本实施例使用ImageNet数据集对视觉编码器110进行大规模图像分类任务预训练，视觉编码器110可以是MViT。ImageNet是一个包含超过1400万张图像的数据库，涵盖超过1000个类别，是目前规模最大且应用最广泛的图像识别数据集之一。该数据集中的图像经过人工标注，每个图像都对应一个类别标签，为图像分类训练提供了丰富的数据资源。为了使视觉编码器110能较好地提取图像特征，本实施例中，对视觉编码器110做图像分类预训练。在图像分类预训练中，采用交叉熵损失函数进行分类结果对比，并进行梯度反传，交叉熵损失函数衡量了模型输出的类别概率分布与真实标签之间的差异。其数学表达式如下。In some embodiments, before step S210, it also includes: performing image classification pre-training on the visual encoder 110. Specifically, as shown in Figure 5, this embodiment uses the ImageNet data set to pre-train the visual encoder 110 for large-scale image classification tasks, and the visual encoder 110 can be MViT. ImageNet is a database containing more than 14 million images, covering more than 1,000 categories, and is one of the largest and most widely used image recognition data sets. The images in this data set are manually annotated, and each image corresponds to a category label, which provides rich data resources for image classification training. In order to enable the visual encoder 110 to better extract image features, in this embodiment, the visual encoder 110 is pre-trained for image classification. In the image classification pre-training, the cross entropy loss function is used to compare the classification results, and gradient back propagation is performed. The cross entropy loss function measures the difference between the category probability distribution output by the model and the true label. Its mathematical expression is as follows.

其中，N是样本数量，C是类别数量，是样本i的真实标签c的one-hot编码，是模型对样本i预测为类别c的概率。Where N is the number of samples, C is the number of categories, is the one-hot encoding of the true label c of sample i , is the probability that the model predicts that sample i is of category c .

本实施例中，通过图像分类预训练，视觉编码器110可以学习到图像的低级特征（图像中局部的像素变化）以及高级语义信息（图像是什么），从而为后续的模型预训练及训练阶段做好铺垫，使得视觉编码器110或整个导航模型在后续阶段更容易地适应特定任务的数据集，同时提高了视觉编码器110对图像特征的抽象能力和泛化能力。In this embodiment, through image classification pre-training, the visual encoder 110 can learn the low-level features of the image (local pixel changes in the image) and high-level semantic information (what the image is), thereby paving the way for subsequent model pre-training and training stages, making it easier for the visual encoder 110 or the entire navigation model to adapt to task-specific data sets in subsequent stages, while improving the visual encoder 110's ability to abstract and generalize image features.

在一些实施例中，在步骤S210之前，还包括：基于导航领域的样本数据集对所述视觉编码器110进行图像中目标检测预训练，从而进一步地提高视觉编码器110对领域特定数据集的特征提取能力。In some embodiments, before step S210, it also includes: pre-training the visual encoder 110 for target detection in images based on a sample dataset in the navigation field, so as to further improve the feature extraction capability of the visual encoder 110 for the domain-specific dataset.

对于无人机导航任务来说，图像来源于空中俯拍形成的俯视图。因此，为了训练适用于无人机导航的导航模型，优选以俯视图样本以及俯视图样本中对应的各目标类别真值和目标框坐标真值为标签，预训练所述视觉编码器110，视觉编码器110可以是MViT。For the UAV navigation task, the image comes from the top view formed by aerial aerial photography. Therefore, in order to train a navigation model suitable for UAV navigation, it is preferred to use the top view samples and the corresponding target category true values and target frame coordinate true values in the top view samples as labels to pre-train the visual encoder 110, which may be MViT.

具体训练过程如图6所示，由于卫星图像也是俯视图，因此采用xView数据集的高分辨率卫星图像及其相应的标签信息进行无人机导航领域特定目标检测任务预训练。xView数据集是从在距离地面0.3米处从WorldView-3卫星收集到的卫星图像数据集，该数据集用于目标检测的细粒度类别包括建筑物、车辆和基础设施等等。这些图像覆盖了世界各地的广泛地理位置，提供了复杂的背景。数据集中的图像分辨率非常高，可以对小型和复杂的目的地进行详细分析。目标检测预训练包含两个损失函数：包围框分类损失函数和包围框回归损失函数。成对的图像，类别标签y和包围框标签坐标t被输入到MViT模型中计算损失函数。其中，包围框分类损失函数表达式如下。The specific training process is shown in Figure 6. Since satellite images are also bird's-eye views, high-resolution satellite images of the xView dataset and their corresponding label information are used for pre-training of specific target detection tasks in the field of drone navigation. The xView dataset is a satellite image dataset collected from the WorldView-3 satellite at 0.3 meters above the ground. The fine-grained categories used for target detection in this dataset include buildings, vehicles, and infrastructure, etc. These images cover a wide range of geographical locations around the world, providing complex backgrounds. The image resolution in the dataset is very high, allowing for detailed analysis of small and complex destinations. Target detection pre-training contains two loss functions: the bounding box classification loss function and the bounding box regression loss function. Paired images, category labels y , and bounding box label coordinates t are input into the MViT model to calculate the loss function. Among them, the bounding box classification loss function is expressed as follows.

其中，N是包围框的个数，C是类别的个数，代表第i个包围框的类别c的标签，是第i个预测包围框属于类别c的预测概率。Among them, N is the number of bounding boxes, C is the number of categories, represents the label of category c of the i -th bounding box, is the predicted probability that the i -th predicted bounding box belongs to category c .

包围框回归损失函数表达式如下。The bounding box regression loss function expression is as follows.

其中，是正样本的个数，pos是正样本个数的索引，表示第i个包围框的标签坐标，表示预测的第i个包围框的预测坐标。是smooth L1损失函数，其表达式如下。in, is the number of positive samples, pos is the index of the number of positive samples, represents the label coordinates of the i -th bounding box, Represents the predicted coordinates of the predicted i -th bounding box. is the smooth L1 loss function, which is expressed as follows.

最终损失函数为：。The final loss function is: .

如图7所示，为一具体实施例中，无人机导航模型训练方法示意图，包括以下训练步骤。As shown in FIG. 7 , it is a schematic diagram of a method for training a drone navigation model in a specific embodiment, which includes the following training steps.

第一阶段预训练：采用ImageNet数据集对视觉编码器110进行图像分类预训练。The first stage of pre-training: the image classification pre-training of the visual encoder 110 is performed using the ImageNet dataset.

第二阶段预训练：采用xView数据集对第一阶段预训练完成的视觉编码器110进行目标检测预训练。其中，xView数据集中的样本均是高分辨率的卫星图像，因此，在进行本阶段预训练之前，对高分辨率的卫星图像进行平滑处理，将一幅图像均匀地划分为四块，以匹配第一阶段预训练时图像的分辨率。分成四块后，转换其中的标签信息，图像被分成四块后，每一块中目标的包围框的坐标转换为相对于所在图像块的坐标。Second stage pre-training: Use the xView dataset to perform target detection pre-training on the visual encoder 110 that has completed the first stage pre-training. The samples in the xView dataset are all high-resolution satellite images. Therefore, before pre-training in this stage, the high-resolution satellite images are smoothed and an image is evenly divided into four blocks to match the resolution of the image during the first stage pre-training. After being divided into four blocks, the label information is converted. After the image is divided into four blocks, the coordinates of the bounding box of the target in each block are converted to the coordinates relative to the image block.

第三阶段预训练：采用AVDN数据集对文本编码器120和第二阶段预训练完成的视觉编码器110进行对比学习预训练。The third stage pre-training: the AVDN dataset is used to perform comparative learning pre-training on the text encoder 120 and the visual encoder 110 completed in the second stage pre-training.

导航模型训练阶段：基于第三阶段预训练完成的视觉编码器110和文本编码器120对无人机导航模型进行训练。Navigation model training stage: The UAV navigation model is trained based on the visual encoder 110 and text encoder 120 pre-trained in the third stage.

本实施例中，在第一阶段预训练中，使用大规模图像数据对视觉编码器110做图像分类训练，使视觉编码器110能够提取更好的视觉特征；第二阶段预训练中，使用卫星地图数据集对视觉编码器110做目标检测任务，进一步提高视觉编码器110对领域特定数据集的特征提取能力，第三阶段预训练中，使用无人机视觉语言导航AVDN数据集对文本与图像数据做对比学习训练，对于相匹配的图像和文本信息，加强了文本编码器与视觉编码器分别输出的图像特征和文本特征的关联，使得导航模型能够精准地判断当前视角下的图像是否符合文本描述，而且经过三阶段的预训练为后续模型整体训练奠定了训练基础，能够有效提高模型训练的收敛速度及测试精度。In this embodiment, in the first stage of pre-training, large-scale image data is used to perform image classification training on the visual encoder 110, so that the visual encoder 110 can extract better visual features; in the second stage of pre-training, a satellite map dataset is used to perform target detection tasks on the visual encoder 110, further improving the feature extraction capability of the visual encoder 110 for domain-specific datasets; in the third stage of pre-training, an unmanned aerial vehicle visual language navigation AVDN dataset is used to perform comparative learning training on text and image data, and for matching images and text information, the association between the image features and text features output by the text encoder and the visual encoder respectively is strengthened, so that the navigation model can accurately determine whether the image at the current perspective conforms to the text description, and the three stages of pre-training lay a foundation for subsequent overall model training, which can effectively improve the convergence speed and test accuracy of model training.

上述实施例的导航模型训练方法不仅适合于无人机导航模型，也适用于其它机器人的导航模型，例如：在野外作业的自动化作业机械的导航模型，全自动驾驶汽车的导航模型。The navigation model training method of the above embodiment is not only suitable for the navigation model of unmanned aerial vehicles, but also suitable for the navigation models of other robots, for example: the navigation model of automated working machinery operating in the field, and the navigation model of fully automatic driving vehicles.

下面对本发明提供的导航模型训练装置进行描述，下文描述的导航模型训练装置与上文描述的导航模型训练方法可相互对应参照。The navigation model training device provided by the present invention is described below. The navigation model training device described below and the navigation model training method described above can be referred to each other.

本发明实施例的导航模型训练装置，如图8所示，包括以下模块。The navigation model training device according to the embodiment of the present invention, as shown in FIG8 , includes the following modules.

预训练特征提取单元810，用于将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征。The pre-trained feature extraction unit 810 is used to input the sample image and sample text information in each sample image-text pair into the visual encoder and text encoder in the navigation model respectively, so as to extract the sample image features and sample text features.

预训练损失计算单元820，用于将各样本图文对对应的样本图像特征和样本文本特征代入对比学习损失函数，在所述对比学习损失函数收敛时，完成对视觉编码器和文本编码器的预训练。The pre-training loss calculation unit 820 is used to substitute the sample image features and sample text features corresponding to each sample image-text pair into the contrastive learning loss function, and complete the pre-training of the visual encoder and the text encoder when the contrastive learning loss function converges.

模型训练单元830，用于基于预训练得到的所述视觉编码器和文本编码器训练所述导航模型。The model training unit 830 is used to train the navigation model based on the pre-trained visual encoder and text encoder.

本发明实施例的导航模型训练装置，在传统的对导航模型训练之前通过对导航模型中的视觉编码器110和文本编码器120进行了对比学习预训练，对比学习的目标是使正样本图文对的特征向量距离更近，使负样本图文对的特征向量距离更远，因此，通过该对比学习预训练，对于相匹配的图文对，加强了视觉编码器110与文本编码器120分别输出的图像特征和文本特征的关联性，对于不匹配的图文对，弱化了视觉编码器110与文本编码器120分别输出的图像特征和文本特征的关联性，使得导航模型能够精准地判断当前视角下的图像是否符合文本信息描述的内容导航模型整体上能够准确地的判断当前视角下的图像与文本信息语义之间的关联，从而准确地预测机器人下一步的航路点，避免了未到达目的地便提前停止以及到达目的地未能有效识别而继续向前行动的问题。而且预训练后能够得到更优的初始化模型，为后续导航模型整体训练奠定了训练基础，能够有效提高模型训练的收敛速度及测试精度。The navigation model training device of the embodiment of the present invention performs contrast learning pre-training on the visual encoder 110 and the text encoder 120 in the navigation model before the traditional training of the navigation model. The goal of contrast learning is to make the feature vector distance of the positive sample image-text pair closer and the feature vector distance of the negative sample image-text pair farther. Therefore, through the contrast learning pre-training, for the matching image-text pair, the correlation between the image features and the text features respectively output by the visual encoder 110 and the text encoder 120 is strengthened, and for the unmatched image-text pair, the correlation between the image features and the text features respectively output by the visual encoder 110 and the text encoder 120 is weakened, so that the navigation model can accurately judge whether the image at the current perspective conforms to the content described by the text information. The navigation model as a whole can accurately judge the correlation between the image at the current perspective and the semantics of the text information, thereby accurately predicting the next waypoint of the robot, avoiding the problem of stopping in advance before reaching the destination and continuing to move forward without effective recognition after reaching the destination. Moreover, a better initialization model can be obtained after pre-training, which lays a training foundation for the subsequent overall training of the navigation model and can effectively improve the convergence speed and test accuracy of the model training.

可选地，所述样本图文对中的样本图像基于预设的样本航路中航路点对应图像构建，所述样本图文对中的样本文本信息至少基于所述航路点所在样本航路中导航阶段的起点航路点对应的样本描述文本构建，所述导航阶段通过样本航路中预设的人机对话航路点划分。Optionally, the sample image in the sample image-text pair is constructed based on the corresponding image of the waypoint in the preset sample route, and the sample text information in the sample image-text pair is constructed based at least on the sample description text corresponding to the starting waypoint of the navigation stage in the sample route where the waypoint is located, and the navigation stage is divided by the human-computer dialogue waypoints preset in the sample route.

可选地，所述样本航路中每一导航阶段分别对应一个样本图文对，所述样本图文对中的样本图像至少基于对应导航阶段的起点航路点和目标航路点的图像拼接形成。Optionally, each navigation stage in the sample route corresponds to a sample image-text pair, and the sample image in the sample image-text pair is formed by stitching images of at least a starting point waypoint and a target waypoint of the corresponding navigation stage.

可选地，所述样本图文对中的样本图像基于对应导航阶段的起点航路点、目标航路点和至少一中间航路点的图像拼接形成，所述中间航路点为对应导航阶段的起点航路点和目标航路点之间的航路点。Optionally, the sample image in the sample image-text pair is formed by stitching images of a starting waypoint, a target waypoint and at least one intermediate waypoint of the corresponding navigation phase, and the intermediate waypoint is a waypoint between the starting waypoint and the target waypoint of the corresponding navigation phase.

可选地，所述样本图文对中的样本文本信息基于对应导航阶段及之前导航阶段各自的起点航路点对应的样本描述文本拼接形成。Optionally, the sample text information in the sample image-text pair is formed by splicing sample description texts corresponding to respective starting point waypoints of the corresponding navigation phase and the previous navigation phase.

可选地，所述对比学习损失函数基于视觉编码器的第一相似度损失函数和文本编码器的第二相似度损失函数确定。Optionally, the contrastive learning loss function is determined based on a first similarity loss function of the visual encoder and a second similarity loss function of the text encoder.

可选地，导航模型训练装置，还包括：分类预训练单元，用于在将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征之前，对所述视觉编码器进行图像分类预训练。Optionally, the navigation model training device also includes: a classification pre-training unit, which is used to perform image classification pre-training on the visual encoder before inputting the sample image and sample text information in each sample image-text pair into the visual encoder and text encoder in the navigation model respectively to extract sample image features and sample text features.

可选地，导航模型训练装置，还包括：目标检测预训练单元，用于在将各样本图文对中的样本图像和样本文本信息分别输入导航模型中的视觉编码器和文本编码器，以提取样本图像特征和样本文本特征之前，基于导航领域的样本数据集对所述视觉编码器进行目标检测预训练。Optionally, the navigation model training device also includes: a target detection pre-training unit, which is used to perform target detection pre-training on the visual encoder based on a sample data set in the navigation field before inputting the sample image and sample text information in each sample image-text pair into the visual encoder and text encoder in the navigation model respectively to extract sample image features and sample text features.

可选地，目标检测预训练单元具体用于以俯视图样本以及俯视图样本中对应的各目标类别真值和目标框坐标真值为标签，预训练所述视觉编码器。Optionally, the object detection pre-training unit is specifically used to pre-train the visual encoder using the top view samples and the corresponding true values of each object category and the true values of the object frame coordinates in the top view samples as labels.

图9示例了一种电子设备的实体结构示意图，如图9所示，该电子设备可以包括：处理器(processor)910、通信接口(Communications Interface)920、存储器(memory)930和通信总线940，其中，处理器910，通信接口920，存储器930通过通信总线940完成相互间的通信。处理器910可以调用存储器930中的逻辑指令，以执行导航模型训练方法，该方法包括以下步骤。FIG9 illustrates a schematic diagram of a physical structure of an electronic device. As shown in FIG9 , the electronic device may include: a processor 910, a communication interface 920, a memory 930, and a communication bus 940, wherein the processor 910, the communication interface 920, and the memory 930 communicate with each other through the communication bus 940. The processor 910 may call the logic instructions in the memory 930 to execute the navigation model training method, which includes the following steps.

将各样本图文对对应的样本图像特征和样本文本特征代入对比学习损失函数，在对比学习损失函数收敛时，完成对视觉编码器和文本编码器的预训练。The sample image features and sample text features corresponding to each sample image-text pair are substituted into the contrastive learning loss function. When the contrastive learning loss function converges, the pre-training of the visual encoder and the text encoder is completed.

基于预训练得到的视觉编码器和文本编码器训练所述导航模型。The navigation model is trained based on the pre-trained visual encoder and text encoder.

此外，上述的存储器930中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 930 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的导航模型训练方法，该方法包括以下步骤。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the navigation model training method provided by the above methods, which includes the following steps.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的导航模型训练方法，该方法包括以下步骤。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, the navigation model training method provided by the above methods is implemented, and the method includes the following steps.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A navigation model training method, characterized by comprising:

Inputting the sample image and sample text information in each sample image-text pair into the visual encoder and text encoder in the navigation model respectively to extract sample image features and sample text features;

Substituting the sample image features and sample text features corresponding to each sample image-text pair into the contrastive learning loss function, and completing the pre-training of the visual encoder and the text encoder when the contrastive learning loss function converges;

Training the navigation model based on the pre-trained visual encoder and text encoder;

Among them, the sample images in the sample image-text pair are constructed based on the corresponding images of waypoints in a preset sample route, the sample text information in the sample image-text pair is constructed based on at least the sample description text corresponding to the starting waypoint of the navigation stage in the sample route where the waypoint is located, and the navigation stage is divided by the human-computer dialogue waypoints preset in the sample route; each navigation stage in the sample route corresponds to a sample image-text pair, and the sample images in the sample image-text pair are formed by stitching together the images of at least the starting waypoint and the target waypoint of the corresponding navigation stage.

2. The navigation model training method according to claim 1 is characterized in that the sample image in the sample image-text pair is formed by stitching images of a starting waypoint, a target waypoint and at least one intermediate waypoint of the corresponding navigation stage, and the intermediate waypoint is a waypoint between the starting waypoint and the target waypoint of the corresponding navigation stage.

3. The navigation model training method according to claim 1 or 2 is characterized in that the sample text information in the sample image-text pair is formed by splicing the sample description texts corresponding to the starting point waypoints of the corresponding navigation stage and the previous navigation stage.

4. The navigation model training method according to claim 1 is characterized in that the contrastive learning loss function is determined based on a first similarity loss function of a visual encoder and a second similarity loss function of a text encoder.

5. The navigation model training method according to claim 1 is characterized in that before the sample image and sample text information in each sample image-text pair are respectively input into the visual encoder and the text encoder in the navigation model to extract the sample image features and the sample text features, it also includes:

The visual encoder is pre-trained for image classification.

6. The navigation model training method according to claim 1, characterized in that before the sample image and sample text information in each sample image-text pair are respectively input into the visual encoder and the text encoder in the navigation model to extract the sample image features and the sample text features, it also includes:

The visual encoder is pre-trained for target detection based on a sample dataset in the navigation field.

7. The navigation model training method according to claim 6, characterized in that the visual encoder is pre-trained for target detection based on a sample data set in the navigation field, comprising:

The visual encoder is pre-trained using the top view samples and the corresponding true values of each target category and the true values of the target frame coordinates in the top view samples as labels.

8. A navigation model training device, characterized in that it comprises:

A pre-trained feature extraction unit, used for inputting sample image and sample text information in each sample image-text pair into a visual encoder and a text encoder in the navigation model respectively, so as to extract sample image features and sample text features;

A pre-training loss calculation unit, used for substituting the sample image features and sample text features corresponding to each sample image-text pair into a contrastive learning loss function, and completing the pre-training of the visual encoder and the text encoder when the contrastive learning loss function converges;

A model training unit, used for training the navigation model based on the visual encoder and text encoder obtained by pre-training;

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the navigation model training method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the navigation model training method according to any one of claims 1 to 7 is implemented.