CN114973402A

CN114973402A - A visual language navigation system and method based on modal-aligned action prompts

Info

Publication number: CN114973402A
Application number: CN202210467461.4A
Authority: CN
Inventors: 梁小丹; 聂云双; 林冰倩
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-30
Anticipated expiration: 2042-04-29
Also published as: CN114973402B

Abstract

The invention provides a visual language navigation system and method based on modal alignment action prompt, the system includes action prompt set generation module, input order to the action prompt set generation module, the intelligent agent retrieves the action prompt set correlated to order from the action prompt library before the navigation begins; the visual language navigation module of modal alignment action prompt, the action prompt set is through prompting the code module, the output prompt characteristic links together with output instruction characteristic of the text code module; the hint-based instruction features and the output visual features of the visual coding module are provided to the multi-layer transformer for action decision-making. The optimization learning module, namely a modal alignment loss module and a continuous consistency loss module, realizes effective action prompt learning; the invention mainly provides explicit modal alignment action prompt to improve the accuracy of intelligent agent navigation and generalization capability in different environments.

Description

A visual language navigation system and method based on modal-aligned action prompts

技术领域technical field

本发明涉及视觉语言导航领域，更具体地，涉及一种基于模态对齐的动作提示的视觉语言导航系统及方法。The present invention relates to the field of visual language navigation, and more particularly, to a visual language navigation system and method based on modal alignment action prompts.

背景技术Background technique

视觉语言导航是一项具有挑战性的任务，需要具体化的主体按照自然语言的说明导航到目标位置。为了进行成功的导航，智能体应通过理解给定指令的意图并逐步将指令基于周围的观察结果，依次做出正确的动作决策，以在动态变化的场景中移动。Visual-language navigation is a challenging task that requires embodied subjects to navigate to target locations following natural language instructions. For successful navigation, the agent should in turn make correct action decisions to move through a dynamically changing scene by understanding the intent of a given instruction and incrementally basing the instruction on observations around it.

早期的视觉语言导航方法探索了不同的数据增强策略，高效的学习范式和有用的模型架构用于提高智能体性能。受到视觉语言任务中大规模跨模态预训练模型取得的重大进展的启发之后，越来越多的工作试图将预训练范式和模型引入到视觉语言导航任务中。PREVALENT在大量的图像-语言-动作三元组上对模型进行自监督预训练。

在预训练模型中引入循环函数使智能体具有时间感知功能。虽然对象级别对齐能力可能在预训练过程中被显著提高，这些智能体仍然是在隐式地学习动作级别的模态对齐，这在很大程度上限制了不同场景下的行动决策的鲁棒性。Early visual-linguistic navigation methods explored different data augmentation strategies, efficient learning paradigms and useful model architectures for improving agent performance. Inspired by the significant progress achieved in large-scale cross-modal pre-training models in visual-linguistic tasks, a growing body of work attempts to introduce pre-training paradigms and models into visual-linguistic navigation tasks. PREVALENT self-supervised pre-training of the model on a large number of image-language-action triples.

Introducing a recurrent function into the pretrained model makes the agent time-aware. Although object-level alignment capabilities may be significantly improved during pre-training, these agents are still implicitly learning action-level modal alignment, which largely limits the robustness of action decisions in different scenarios .

现有技术公开了一种视觉与语言多模态融合的导航方法的专利，该专利，属于机器人导航、自然语言处理和计算机视觉领域；该专利首先在机器人上安装双目相机，利用该机器人训练一个多模态融合神经网络模型；选取任一真实场景，对机器人下达自然语言导航指令并转化为对应语义向量；利用机器人在每个时刻获取的RGB图，分别转化为对应的特征；对语义向量、RGB图特征特征进行特征融合，得到当前时刻的动作特征；利用提示对该动作特征进行修正后，神经网络模型最终输出机器人在当前时刻的动作，机器人执行该动作直至完成导航任务。然而，该专利对于如何实现较为鲁棒的视觉语言导航模型，提高准确性和泛化能力，具备良好解释性却鲜有报道。The prior art discloses a patent for a navigation method integrating vision and language multimodality, which belongs to the fields of robot navigation, natural language processing and computer vision; the patent first installs a binocular camera on the robot, and uses the robot to train A multi-modal fusion neural network model; select any real scene, issue natural language navigation instructions to the robot and convert them into corresponding semantic vectors; use the RGB images obtained by the robot at each moment, and convert them into corresponding features; , RGB image feature features are fused to obtain the action feature at the current moment; after correcting the action feature by using the prompt, the neural network model finally outputs the robot's action at the current moment, and the robot performs the action until the navigation task is completed. However, there are few reports in this patent on how to implement a more robust visual language navigation model, improve the accuracy and generalization ability, and have good interpretability.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于模态对齐的动作提示的视觉语言导航系统，该系统可实现强制智能体显式地学习跨模态动作知识，以改善导航期间的行动决策。The present invention provides a visual-language navigation system based on modal-aligned action prompts, which can force an agent to explicitly learn cross-modal action knowledge to improve action decision-making during navigation.

本发明的又一目的在于提供一种上述系统的导航方法。Another object of the present invention is to provide a navigation method for the above system.

为了达到上述技术效果，本发明的技术方案如下：In order to achieve above-mentioned technical effect, technical scheme of the present invention is as follows:

一种基于模态对齐的动作提示的视觉语言导航系统，包括：A visual language navigation system based on modal-aligned action prompts, comprising:

动作提示集产生模块，输入指令到动作提示集产生模块，智能体在导航开始前从动作提示库中检索与指令相关的动作提示集；Action prompt set generation module, inputting an instruction to the action prompt set generation module, the agent retrieves the action prompt set related to the instruction from the action prompt library before navigation starts;

模态对齐动作提示的视觉语言导航模块，动作提示集通过提示编码模块，输出提示特征与文本编码模块的输出指令特征连接在一起；基于提示的指令特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策；The visual language navigation module of the modal alignment action cue, the action cue set is connected through the cue coding module, and the output cue feature is connected with the output instruction feature of the text coding module; the cue-based instruction feature and the output visual feature of the visual coding module are provided to Multi-layer transformers are used to make action decisions;

优化学习模块，即模态对齐损失模块和连续一致性损失模块，实现有效的动作提示学习。The optimized learning modules, namely the modal alignment loss module and the continuous consistency loss module, enable effective action cue learning.

进一步地，所述模态对齐动作提示的视觉语言导航模块包括：Further, the visual language navigation module of the modal alignment action prompt includes:

文本编码模块该模块接收语言信息的输入，利用多层transformer神经网络分别进行编码，获得相应的特征向量；Text encoding module This module receives the input of language information, uses the multi-layer transformer neural network to encode separately, and obtains the corresponding feature vector;

提示解码模块，该模块由两个单模态子提示编码器和一个多模态提示编码器组成，图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征，连接以后输入进多模态提示编码器，获得提示特征；The prompt decoding module is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder. Input into the multi-modal prompt encoder to obtain prompt features;

视觉编码模块，该模块接收视觉观察信息的输入，通过视觉编码器进行编码，获取对应的特征向量。Visual coding module, this module receives the input of visual observation information, encodes it through the visual encoder, and obtains the corresponding feature vector.

进一步地，所述优化学习模块包括：Further, the optimization learning module includes:

模态对齐损失模块，当动作提示已经有匹配的图像和文本子提示，利用InfoNCE损失使得它们在在特征空间中对齐，动作提示可以变得更加具有识别力；Modal alignment loss module, when the action prompt already has matching image and text sub-prompts, using the InfoNCE loss to align them in the feature space, the action prompt can become more discriminative;

连续一致性损失模块，促使智能体根据其观察，按顺序关注检索到的提示集中的相关动作提示。A continuous consistency loss module that prompts the agent to sequentially focus on relevant action cues in the retrieved cue set based on its observations.

一种基于模态对齐的动作提示的视觉语言导航方法，包括以下步骤：A visual language navigation method based on modal-aligned action prompts, comprising the following steps:

S1：在导航的开始，智能体获取指令，通过动作提示产生模块从动作提示库中检索与指令相关的动作提示集；S1: At the beginning of navigation, the agent obtains the instruction, and retrieves the action prompt set related to the instruction from the action prompt library through the action prompt generation module;

S2：通过视觉编码模块和文本编码模块，对神经网络分别对输入的图像信息和指令信息进行编码，分别获得视觉编码，指令编码，状态特征；S2: Through the visual encoding module and the text encoding module, the neural network respectively encodes the input image information and instruction information, and obtains visual encoding, instruction encoding, and state features respectively;

S3：通过提示编码模，动作提示集中图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征，连接以后输入进多模态提示编码器，获得提示特征；S3: Through the prompt coding mode, the image sub-prompt and the text sub-prompt in the action prompt set respectively obtain the sub-prompt feature through the corresponding single-modal auto-encoder, and then input into the multi-modal prompt encoder after connection to obtain the prompt feature;

S4：将上述指令编码和提示编码连接起来获得基于提示的指令特征，将上述状态特征与视觉编码连接起来，得到状态视觉特征；S4: connect the above-mentioned instruction code and the prompt code to obtain the prompt-based instruction feature, and connect the above-mentioned state feature with the visual code to obtain the state visual feature;

S5：通过模态对齐动作提示的视觉语言导航模块，状态视觉特征基于自身和基于提示的指令特征之间的跨模态注意力更新，将该注意力分解为两部分，第一部分对指令编码加权更新，用于更新状态特征，第二部分对图像和文本子提示特征进行加权更新，用于计算顺序一致性损失，将状态视觉特征输入另一个自注意力模块，以获得状态特征关于视觉特征的注意力分数，即基于提示的动作预测概率；S5: By modally aligning the visual language navigation module of action cues, the state visual feature is updated based on the cross-modal attention between itself and cue-based instruction features, and the attention is decomposed into two parts, the first part weights the instruction encoding Update, which is used to update the state features, the second part performs weighted updates on the image and text sub-cue features, which are used to calculate the sequential consistency loss, and feed the state visual features into another self-attention module to obtain the state features with respect to the visual features. Attention score, which is the probability of action prediction based on cues;

S6：通过优化学习模，结合常用的模仿学习损失和强化学习损失，以及本发明特有的模态对齐损失和连续一致性损失，进行加权求和，获得总训练目标，对模型进行更新优化，提高智能体导航性能和泛化能力。S6: By optimizing the learning model, combining the commonly used imitation learning loss and reinforcement learning loss, as well as the unique modal alignment loss and continuous consistency loss of the present invention, weighted summation is performed to obtain the total training target, and the model is updated and optimized to improve Agent Navigation Performance and Generalization.

进一步地，所述步骤S1包括以下子步骤：Further, the step S1 includes the following sub-steps:

S100：动作提示库的建设，为了对齐图像和动作短语，形成动作提示符，设计两分支方案来收集图像和文本子提示：首先，对于训练数据集中的一个指令路径实例，使用一个提前创建好的视觉物体visual object/位置location词汇表来查找指令中提及的视觉物体/位置，对于每个视觉物体/位置，分别获得相关的图像和文本子提示，使用具有优秀的0-shot跨模态对齐能力的CLIP，用于定位物体/位置相关的图像，为了适应CLIP的推理过程，将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置，在动作序列中一个图像B属于c类的概率由以下方法计算：S100: Construction of action prompt library, in order to align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts: First, for an instruction path instance in the training dataset, use a pre-created Visual object visual object/location location vocabulary to find the visual object/location mentioned in the instruction, for each visual object/location, get the relevant image and text sub-cues respectively, using cross-modal alignment with excellent 0-shot Capable CLIP for locating object/position related images, to accommodate the inference process of CLIP, replace the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/position whose class label is c , the probability that an image B belongs to class c in the action sequence is calculated by:

其中τ1为温度temperature参数，sim为余弦相似度，b，w_c分别为CLIP生成的图像特征和短语特征，M为词汇表的尺寸，然后选择与该短语相似度最大的图像作为图像子提示，为了获得文本子提示，使用简单的最近动词搜索方案，即找到一个特定的物体/位置词之前最近的动词，该动词在预先构建的动词词汇中，最后，拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示；where τ1 is the temperature parameter, sim is the cosine similarity, _b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt, To obtain textual subcues, a simple nearest verb search scheme is used, i.e. to find the nearest verb before a specific object/position word, which is in a pre-built verb vocabulary, and finally, has the same visual object/position and action Image and text subprompts form an aligned action prompt;

S101：动作提示集的检索，在导航的开始，智能体从动作提示库中检索与指令相关的动作提示，计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度，用于检索与指令相关的动作提示集

其中N为该集合的大小。S101: Retrieval of action prompt set, at the beginning of navigation, the agent retrieves the action prompt related to the instruction from the action prompt library, and calculates the sentence between each object/position related action phrase in the prompt library and the text sub-prompt Similarity for retrieving the set of action cues associated with the instruction

where N is the size of the set.

进一步地，所述步骤S2包括以下子步骤：Further, the step S2 includes the following sub-steps:

S200：视觉输入的编码，对于时间步长t时，候选视图中的每个图像视图O_t，i，都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征v_t，i，然后v_t，i被视觉编码器F_v映射为视觉编码：S200: Encoding of visual input, for time step t, each image view O _t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v _t,i , and then v _t,i are mapped to visual encoding by visual encoder F _v :

V_t，i＝F_v(v_t，i；θ_v)V _t,i =F _v (v _t,i ; θ _v )

其中θ_v为F_v的参数，一组

代表时间t下的候选视觉编码；where θ _v is the parameter of F _v , a set of

represents the candidate visual code at time t;

S201：语言输入的编码，初始化时，指令编码X和初始化后的状态特征s₀通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得：S201: The encoding of language input. During initialization, the instruction code X and the initialized state feature s ₀ are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

其中Concat(·)代表连接concatenation操作，

表示self-attention模组的参数，s₀将会在时间步骤t被更新为s_t。Where Concat( ) represents the connection concatenation operation,

Represents the parameters of the self-attention module, s ₀ will be updated to s _{t at time step t} .

进一步地，所述步骤S3包括以步骤：Further, the step S3 includes the following steps:

使用

通过提示编码器得到提示编码

该提示编码器由两个单模态子提示编码器和一个多模态提示编码器组成，

其中图像子提示和文本子提示分别为

和

和

首先通过单模态子提示编码器得到子提示特征

和

use

Get hint code from hint encoder

The cue encoder consists of two unimodal sub-cue encoders and a multi-modal cue encoder,

The image sub-prompt and text sub-prompt are respectively

and

First obtain the sub-cue features through a unimodal sub-cue encoder

and

其中Eⁱ(·)使用参数θⁱ，E^u(·)使用参数θ^u，分别表示图像子提示编码器和文本子提示编码器，然后将

和

输送到多模态提示编码器E^p(·)，得到提示编码

where E ⁱ (·) uses the parameter θ ⁱ and E ^u (·) uses the parameter θ ^u , representing the image sub-hint encoder and the text sub-hint encoder, respectively, and then the

and

Send to the multimodal prompt encoder E ^p ( ), get the prompt code

其中θ^p为E^p(·)的参数，Concat(·)为连接运算，编码器Eⁱ(·)，E^u(·)和E^p(·)由一个线性层组成，后接dropout操作，以减少过拟合。where θ ^p is the parameter of E ^p ( ), Concat ( ) is the concatenation operation, the encoder E ⁱ ( ), E ^u ( ) and E ^p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.

进一步地，所述步骤S4包括以下子步骤：Further, the step S4 includes the following sub-steps:

在提示编码

和指令编码X的基础上，通过简单地将X和

连接起来，得到基于提示的指令特征X^p。coding at the prompt

and instruction encoding X, by simply combining X and

concatenated to obtain hint-based instruction features X ^p .

进一步地，所述步骤S5包括以下子步骤：Further, the step S5 includes the following sub-steps:

状态视觉特征K_t基于K_t和X^p之间的跨模态注意力

更新：The state visual feature _{Kt is} based on cross-modal attention between _Kt and ^Xp

renew:

然后将

分解为

和

获得不同的基于注意力机制增强的特征，参与指令特征

是通过对X进行

加权得到的，基于注意力机制增强的图像子提示特征

和基于注意力机制增强的文本子提示特征

通过对

和

进行

加权获得，

和

用于计算顺序一致性损失L_c，和baseline智能体一样，

用于更新状态特性，最后，将

输入

得到基于提示的动作预测概率

followed by

Decomposed into

and

Obtain different attention-based enhanced features, participating in instruction features

is performed on X by

Weighted, enhanced image sub-cue features based on attention mechanism

and enhanced text sub-cue features based on attention mechanism

through the pair

and

conduct

weighted gain,

and

Used to calculate the sequential consistency loss L _c , like the baseline agent,

is used to update the state property, and finally, the

enter

Get cue-based action prediction probabilities

进一步地，所述步骤S6包括以下子步骤：Further, the step S6 includes the following sub-steps:

S600：模态对齐损失，促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐，遵循CLIP中使用的对比学习范式，使成对的图像和文本特征相似，而不成对的图像和文本特征疏远，使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐：S600: Modal alignment loss, prompting action cues that already have matching image and text subcues to align in feature space, following the contrastive learning paradigm used in CLIP, making pairs of image and text features similar, while unpaired images and text Text feature alienation, using infoNCE loss to facilitate feature alignment of image and text sub-cues in each action cue:

其中τ2是温度参数，

表示动作提示p_n的成对的图像和文本子提示的特征，

表示非配对的子提示，通过模态对齐损失，动作提示可以变得更加具有识别力，从而知道学习动作级别的模态对齐；where τ2 is the temperature parameter,

The features representing pairs of image and text sub-prompts of action cue p _n ,

Represents unpaired sub-prompts, through the modal alignment loss, the action cue can become more discriminative, so as to know the modal alignment of the learned action level;

S601：顺序一致性损失，由于指令通常顺序地指向不同的视觉标志，因此检索到的动作提示集{p_n}中的动作提示也与不同的物体/位置相关，为了促使智能体根据其观察，按顺序关注检索到的提示集中的相关动作提示，提出顺序一致性损失，即两个单模态一致性损失之和；在每个时间步骤t上，基于注意力机制增强的文本子提示特征

以及基于注意力机制增强的指导特征

必须接近：S601: Sequential consistency loss, since instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set {p _n } are also related to different objects/positions. Focusing on the relevant action cues in the retrieved cue set in order, a sequential consistency loss is proposed, that is, the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism

and guidance features enhanced by attention mechanism

must be close to:

定义，

用于提高基于注意力机制增强的图像子提示特征

和基于注意力机制增强的视觉特征之间的相似性，则顺序一致性损失L_c为：definition,

Image sub-cue features for improving attention-based enhancement

and the similarity between visual features enhanced by the attention mechanism, then the sequential consistency loss L _c is:

S602：总目标使用导航损失L_n，即模仿损失L_IL和强化学习损失L_RL，总训练目标为：S602: The total objective uses the navigation loss L _n , namely the imitation loss L _IL and the reinforcement learning loss L _RL , and the total training objective is:

L＝L_RL+λ₁L_IL+λ₂L_c+λ₃L_a L=L _RL +λ ₁ L _IL +λ ₂ L _c +λ ₃ L _a

其中λ₁，λ₂和λ₃是平衡损失的损失权重。where λ ₁ , λ ₂ and λ ₃ are the loss weights to balance the losses.

与现有技术相比，本发明技术方案的有益效果是：Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明提出模态对齐的动作提示用于强制智能体显式地学习跨模态动作知识，以改善导航期间的行动决策，在视觉语言导航任务中开发基于提示的导航；开发了一种模态对齐损失和连续一致性损失，以实现有效的学习动作提示。使用对比语言-图像预训练(CLIP)模型来保证动作提示的质量；有效提升了基于R2R和RxR的智能体导航性能，且具有良好的可解释性和泛化能力。The present invention proposes modal-aligned action cues to force the agent to explicitly learn cross-modal action knowledge to improve action decision-making during navigation, develop cue-based navigation in visual-language navigation tasks; develop a modal Alignment loss and continuous consistency loss for efficient learned action cues. The contrastive language-image pre-training (CLIP) model is used to ensure the quality of action prompts; it effectively improves the navigation performance of agents based on R2R and RxR, and has good interpretability and generalization ability.

附图说明Description of drawings

图1为本发明一种基于模态对齐的动作提示的视觉语言导航系统架构图；1 is an architectural diagram of a visual language navigation system based on a modal alignment-based action prompt of the present invention;

图2为本发明一种基于模态对齐的动作提示的视觉语言导航方法的步骤流程图；2 is a flow chart of steps of a visual language navigation method based on modal alignment of action prompts of the present invention;

图3为本发明具体实施例中模态对齐动作提示的视觉语言导航模块示例图；3 is an example diagram of a visual language navigation module for modal alignment action prompts in a specific embodiment of the present invention;

图4为本发明具体实施例中动作提示集产生模块的动作提示库建造的示例图；4 is an example diagram of the construction of an action prompt library of an action prompt set generation module in a specific embodiment of the present invention;

图5为本发明具体实施例中应用视觉语言导航方法与baseline方法结果导航的结果样例对比展示。FIG. 5 is a comparative display of a result sample of the result navigation using the visual language navigation method and the baseline method in the specific embodiment of the present invention.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本专利的限制；The accompanying drawings are for illustrative purposes only, and should not be construed as limitations on this patent;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate this embodiment, some parts of the drawings are omitted, enlarged or reduced, which do not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。It will be understood by those skilled in the art that some well-known structures and their descriptions may be omitted from the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.

实施例1Example 1

如图1所示，一种基于模态对齐的动作提示的视觉语言导航系统，包括：As shown in Figure 1, a visual language navigation system based on modal-aligned action prompts includes:

动作提示收集模块10，为了制作高质量动作提示库，采用新近开发的具有强大的跨模态对象/位置级对齐能力的对比语言图像预训练CLIP模型，用于定位物体/位置相关的图像。为了更好的对齐图像和动作短语，形成动作提示符，设计一个两分支方案用来收集图像和文本子提示。首先，对于训练数据集中的一个指令路径实例，使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置，分别获得相关的图像和文本子提示。导航开始时，输入指令到动作提示集产生模块，智能体从提前建好的动作提示库中检索与指令相关的动作提示，构成动作提示集。The action cue collection module 10, in order to produce a high-quality action cue library, adopts the newly developed contrastive language image pre-training CLIP model with powerful cross-modal object/position level alignment capability for locating object/position-related images. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained respectively. When the navigation starts, input an instruction to the action prompt set generation module, and the agent retrieves the action prompt related to the instruction from the action prompt library built in advance to form an action prompt set.

模态对齐动作提示的视觉语言导航模块11，通过一个提示编码器来获取提示特征，与文本编码模块的输出指令特征连接在一起获得基于提示的指令特征。该特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策。The visual language navigation module 11 of the modal alignment action prompt obtains the prompt feature through a prompt encoder, and is connected with the output instruction feature of the text encoding module to obtain the prompt-based instruction feature. This feature and the output visual features of the visual encoding module are provided to the multi-layer transformer for action decision-making.

优化学习模块12，即模态对齐损失模块和连续一致性损失模块，实现有效的动作提示学习。The optimization learning module 12, namely the modal alignment loss module and the continuous consistency loss module, realizes effective action cue learning.

在本发明具体实施例中，具体地，模态对齐动作提示的视觉语言导航模块11进一步包括：In a specific embodiment of the present invention, specifically, the visual language navigation module 11 of the modal alignment action prompt further includes:

文本编码模块110，模块接收语言信息的输入，利用自监督神经网络进行编码，获得相应的文本特征向量和状态特征。Text encoding module 110, the module receives the input of language information, uses self-supervised neural network for encoding, and obtains corresponding text feature vectors and state features.

提示编码模块111，该模块由两个单模态子提示编码器和一个多模态提示编码器组成，图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征，连接以后输入进多模态提示编码器，获得提示特征。The prompt encoding module 111 is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and the text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder, and connect the Then input into the multi-modal prompt encoder to obtain prompt features.

视觉编码模块112，该模块接收视觉观察信息的输入，通过预训练的视觉特征编码器进行编码，获取对应的特征向量。Visual coding module 112, this module receives the input of visual observation information, performs coding through the pre-trained visual feature encoder, and obtains the corresponding feature vector.

在本发明具体实施例中，具体地，优化学习模块12进一步包括：In a specific embodiment of the present invention, specifically, the optimization learning module 12 further includes:

模态对齐损失模块120，当动作提示已经有匹配的图像和文本子提示，它们可能不会在特征空间中对齐。要解决这个问题，遵循CLIP中使用的对比学习范式，使成对的图像和文本特征相似，而不成对的图像和文本特征疏远，使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐。通过模态对齐损失，动作提示可以变得更加具有识别力，从而知道学习动作级别的模态对齐。Modal alignment loss module 120, when action cues already have matching image and text subcues, they may not be aligned in the feature space. To address this issue, following the contrastive learning paradigm used in CLIP to make pairs of image and text features similar, while unpaired image and text features are dissimilar, an infoNCE loss is used to facilitate the comprehension of image and text sub-cues in each action cue Feature alignment. With the modality alignment loss, action cues can be made more discriminative, knowing to learn action-level modal alignment.

连续一致性损失模块121，由于指令通常顺序地指向不同的视觉标志，因此检索到的动作提示集中的动作提示也与不同的物体/位置相关。为了促使智能体根据其观察，按顺序关注检索到的提示集中的相关动作提示，提出顺序一致性损失，即两个单模态一致性损失之和。以文本模态为例，在每个时间步骤t上，文本子提示特征以及指导特征必须接近；类似的损失定义在图像模态，用于提高图像子提示特征和视觉特征之间的相似性。In the continuous consistency loss module 121, since the instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set are also related to different objects/locations. To motivate the agent to sequentially focus on relevant action cues in the retrieved cue set according to its observations, a sequential consistency loss is proposed, which is the sum of two unimodal consistency losses. Taking the text modality as an example, at each time step t, the text sub-cue features as well as the guidance features must be close; a similar loss is defined in the image modality to improve the similarity between the image sub-cue features and visual features.

实施例2Example 2

如图2所示，一种基于模态对齐的动作提示的视觉语言导航方法，包括如下步骤：As shown in Figure 2, a visual language navigation method based on modal alignment action prompts includes the following steps:

步骤S1，根据输入的指令信息检索相关动作提示集。Step S1: Retrieve a relevant action prompt set according to the input instruction information.

具体地，步骤S1进一步包括：Specifically, step S1 further includes:

步骤S100，动作提示库的建设。为了更好的对齐图像和动作短语，形成动作提示符，设计了一个两分支方案来收集图像和文本子提示。首先，对于训练数据集中的一个指令路径实例，使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置，分别获得相关的图像和文本子提示，如下所述。Step S100, construction of an action prompt library. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained separately, as described below.

请注意，ground-truth路径序列包含一个单视图图像的集合，每一个都表示一个需要在特定的时间步骤进行的动作。因此，为了派生动作提示中的图像子提示，只从ground-truth路径序列中检索与物体/位置相关的图像，它本身包含行动信息。相比诉诸于现有的物体分类器或在固定的物品类别集合上训练的检测器，使用具有优秀的0-shot跨模态对齐能力的CLIP，用于定位locate物体/位置相关的图像。为了适应CLIP的推理过程，将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置。在动作序列中一个图像B属于c类的概率由以下方法计算：Note that the ground-truth path sequence contains a collection of single-view images, each representing an action that needs to be performed at a specific time step. Therefore, to derive image sub-cues in action cues, only object/location-related images, which themselves contain action information, are retrieved from the ground-truth path sequence. Compared to resorting to existing object classifiers or detectors trained on a fixed set of item categories, we use CLIP with excellent 0-shot cross-modal alignment capability for locating objects/position-related images. To accommodate the inference process of CLIP, the token {CLASS} token in the phrase "a photo of {CLASS}" is replaced with a visual object/location whose class label is c. The probability that an image B belongs to class c in the action sequence is calculated by:

其中τ1为温度temperature参数，sim为余弦相似度，b，w_c分别为CLIP生成的图像特征和短语特征，M为词汇表的尺寸，然后选择与该短语相似度最大的图像作为图像子提示。where τ1 is the temperature parameter, sim is the cosine similarity, _b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt.

为了获得文本子提示，使用一个简单的最近动词搜索方案，即找到一个特定的物体/位置词之前最近的动词(在预先构建的动词词汇中)。最后，拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示。To obtain textual subcues, a simple nearest-verb search scheme is used to find the nearest verb (in a pre-built verb vocabulary) before a specific object/location word. Finally, image and text subcues with the same visual object/position and action form an aligned action cue.

步骤S101，动作提示集的检索。在导航的开始，智能体从动作提示库中检索与指令相关的动作提示。计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度，用于检索与指令相关的动作提示集

其中N为该集合的大小。Step S101, retrieval of the action prompt set. At the beginning of navigation, the agent retrieves the action cues associated with the instruction from the action cue library. Calculates the sentence similarity between each object/location-related action phrase in the cue library and the textual sub-cues for retrieving the set of instruction-related action cues

where N is the size of the set.

步骤S2，通过神经网络分别对输入的图像信息和指令信息进行编码。Step S2, respectively encode the input image information and instruction information through a neural network.

具体地，步骤S2进一步包括：Specifically, step S2 further includes:

步骤S200，视觉输入的编码，对于时间步长t时，候选视图中的每个图像视图O_t，i，都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征v_t，i，然后v_t，i被视觉编码器F_v映射为视觉编码：Step S200, coding of visual input, for time step t, each image view O _t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v _t,i , Then v _{t, i} are mapped to visual encoding by visual encoder F _v :

V_t，i＝F_v(v_t，i；θ_v)V _t,i =F _v (v _t,i ; θ _v )

其中θ_v为F_v的参数，一组

代表时间t下的候选视觉编码。where θ _v is the parameter of F _v , a set of

represents the candidate visual encoding at time t.

步骤S201，语言输入的编码，初始化时，指令编码X和初始化后的状态特征s₀通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得：Step S201, the encoding of language input, during initialization, the instruction encoding X and the initialized state feature s ₀ are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

其中Concat(·)代表连接concatenation操作，

步骤S3，通过模态编码器对动作提示集进行编码。使用

通过提示编码器得到提示编码

其中图像子提示和文本子提示分别为

和

和

首先通过单模态子提示编码器得到子提示特征

和

In step S3, the action prompt set is encoded by the modal encoder. use

Get hint code from hint encoder

The image sub-prompt and text sub-prompt are respectively

and

First obtain the sub-cue features through a unimodal sub-cue encoder

and

和

输送到多模态提示编码器E^p(·)，得到提示编码

and

Send to the multimodal prompt encoder E ^p ( ), get the prompt code

步骤S4，在提示编码

和指令编码X的基础上，通过简单地将X和

连接起来，得到基于提示的指令特征X^p。Step S4, at the prompt encoding

and instruction encoding X, by simply combining X and

concatenated to obtain hint-based instruction features X ^p .

步骤S5，状态视觉特征K_t基于K_t和X^p之间的跨模态注意力

更新：Step S5, the state visual feature _{Kt is} based on the cross-modal attention between _Kt and ^Xp

renew:

然后将

分解为

和

获得不同的基于注意力机制增强的特征，参与指令特征

是通过对X进行

加权得到的，基于注意力机制增强的图像子提示特征

和基于注意力机制增强的文本子提示特征

通过对

和

进行

加权获得，

和

用于计算顺序一致性损失L_c，和baseline智能体一样，

用于更新状态特性，最后，将

输入

得到基于提示的动作预测概率

followed by

Decomposed into

and

is performed on X by

Weighted, enhanced image sub-cue features based on attention mechanism

and enhanced text sub-cue features based on attention mechanism

through the pair

and

conduct

weighted gain,

and

is used to update the state property, and finally, the

enter

Get cue-based action prediction probabilities

步骤S6，计算各损失的加权求和总目标，对模型进行更新优化，提高智能体导航性能和泛化能力。In step S6, the weighted sum total target of each loss is calculated, and the model is updated and optimized to improve the navigation performance and generalization ability of the agent.

具体地，步骤S6进一步包括：Specifically, step S6 further includes:

步骤S600，模态对齐损失，促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐，遵循CLIP中使用的对比学习范式，使成对的图像和文本特征相似，而不成对的图像和文本特征疏远，使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐：Step S600, the modal alignment loss, urges the action prompt to align the already matched image and text sub-prompts in the feature space, following the contrastive learning paradigm used in CLIP, making pairs of images and text features similar, while unpaired images Alienated from text features, the infoNCE loss is used to facilitate feature alignment of image and text sub-cues in each action cue:

其中τ2是温度参数，

表示动作提示p_n的成对的图像和文本子提示的特征，

表示非配对的子提示，通过模态对齐损失，动作提示可以变得更加具有识别力，从而知道学习动作级别的模态对齐。where τ2 is the temperature parameter,

Representing unpaired sub-cues, action cues can be made more discriminative through a modality alignment loss, knowing to learn action-level modal alignment.

步骤S601，顺序一致性损失，由于指令通常顺序地指向不同的视觉标志，因此检索到的动作提示集{p_n}中的动作提示也与不同的物体/位置相关，为了促使智能体根据其观察，按顺序关注检索到的提示集中的相关动作提示，提出顺序一致性损失，即两个单模态一致性损失之和；在每个时间步骤t上，基于注意力机制增强的文本子提示特征

以及基于注意力机制增强的指导特征

必须接近：Step S601, loss of sequential consistency, since the instructions usually point to different visual signs sequentially, the action cues in the retrieved action cue set {p _n } are also related to different objects/positions. , sequentially focus on the relevant action cues in the retrieved cue set, and propose a sequential consistency loss, which is the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism

and guidance features enhanced by attention mechanism

must be close to:

定义，

用于提高基于注意力机制增强的图像子提示特征

Image sub-cue features for improving attention-based enhancement

步骤S602，总目标使用导航损失L_n，即模仿损失L_IL和强化学习损失L_RL，总训练目标为：Step S602, the overall target uses the navigation loss L _n , namely the imitation loss L _IL and the reinforcement learning loss L _RL , and the overall training target is:

实施例3Example 3

如图2所示，上述基于模态对齐的动作提示的视觉语言导航系统的导航方法，包括如下步骤：As shown in FIG. 2, the above-mentioned navigation method of the visual language navigation system based on the action prompt of modal alignment includes the following steps:

具体地，步骤S1进一步包括：Specifically, step S1 further includes:

where N is the size of the set.

具体地，步骤S2进一步包括：Specifically, step S2 further includes:

V_t，i＝F_v(v_t，i；θ_v)V _t,i =F _v (v _t,i ; θ _v )

其中θ_v为F_v的参数，一组

represents the candidate visual encoding at time t.

其中Concat(·)代表连接concatenation操作，

步骤S3，通过模态编码器对动作提示集进行编码。使用

通过提示编码器得到提示编码

其中图像子提示和文本子提示分别为

和

和

首先通过单模态子提示编码器得到子提示特征

和

In step S3, the action prompt set is encoded by the modal encoder. use

Get hint code from hint encoder

The image sub-prompt and text sub-prompt are respectively

and

First obtain the sub-cue features through a unimodal sub-cue encoder

and

和

输送到多模态提示编码器E^p(·)，得到提示编码

and

Send to the multimodal prompt encoder E ^p ( ), get the prompt code

步骤S4，在提示编码

和指令编码X的基础上，通过简单地将X和

and instruction encoding X, by simply combining X and

concatenated to obtain hint-based instruction features X ^p .

步骤S5，状态视觉特征K_t基于K_t和X^p之间的跨模态注意力

renew:

然后将

分解为

和

获得不同的基于注意力机制增强的特征，参与指令特征

是通过对X进行

加权得到的，基于注意力机制增强的图像子提示特征

和基于注意力机制增强的文本子提示特征

通过对

和

进行

加权获得，

和

用于计算顺序一致性损失L_c，和baseline智能体一样，

用于更新状态特性，最后，将

输入

得到基于提示的动作预测概率

followed by

Decomposed into

and

is performed on X by

Weighted, enhanced image sub-cue features based on attention mechanism

and enhanced text sub-cue features based on attention mechanism

through the pair

and

conduct

weighted gain,

and

is used to update the state property, and finally, the

enter

Get cue-based action prediction probabilities

具体地，步骤S6进一步包括：Specifically, step S6 further includes:

其中τ2是温度参数，

表示动作提示p_n的成对的图像和文本子提示的特征，

以及基于注意力机制增强的指导特征

and guidance features enhanced by attention mechanism

must be close to:

定义，

用于提高基于注意力机制增强的图像子提示特征

Image sub-cue features for improving attention-based enhancement

图3为本发明具体实施例中模态对齐动作提示的视觉语言导航模块示例图。FIG. 3 is an example diagram of a visual language navigation module for modal alignment action prompts in a specific embodiment of the present invention.

本图展示baseline智能体和之间的动作决策比较。借助“走向楼梯”相关的动作提示，本发明可以选择正确的动作，在给定的观测中成功导航。This figure shows a comparison of action decisions between baseline agents and . With action prompts related to "going to the stairs", the present invention can select the correct action to successfully navigate in a given observation.

图4为本发明具体实施例中动作提示集产生模块的动作提示库建造的示例图。FIG. 4 is an example diagram of the construction of an action prompt library of an action prompt set generation module in a specific embodiment of the present invention.

本发明使用一个两分支方案来收集图像和文本子提示，首先，对于训练数据集中的一个指令路径实例，采用新近开发的具有强大的跨模态对象/位置级对齐能力的对比语言图像预训练CLIP模型，将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置。计算在动作序列中一个图像B属于c类的概率，然后选择与该短语相似度最大的图像作为图像子提示。对于文本子提示，使用最近动词搜索方案，即找到一个特定的物体/位置词之前最近的动词(在预先构建的动词词汇中)。The present invention uses a two-branch scheme to collect image and text sub-prompts. First, for an instruction path instance in the training data set, a newly developed contrastive language image pre-training CLIP with strong cross-modal object/location-level alignment capabilities is used. Model that replaces the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/location whose class label is c. Calculate the probability that an image B belongs to class c in the action sequence, and then select the image with the greatest similarity to the phrase as the image sub-prompt. For textual subprompts, a nearest verb search scheme is used, i.e. finding the nearest verb (in a pre-built verb vocabulary) before a particular object/position word.

图5为本发明具体实施例中应用视觉语言导航方法与baseline方法结果导航的结果样例对比展示。本发明通过引入动作提示，可以准确地做出动作决策，完成成功的导航。在与“走过窗户”相关的动作提示的帮助下，本发明在前两个导航步骤中执行正确的“走过窗户”动作。然而，baseline智能体在导航过程中未能执行“走过窗户”的动作，从而导致错误的轨迹。FIG. 5 is a comparative display of a result sample of the result navigation using the visual language navigation method and the baseline method in the specific embodiment of the present invention. By introducing action prompts, the present invention can accurately make action decisions and complete successful navigation. The present invention performs the correct "walking through the window" action in the first two navigation steps with the help of the action cues related to "walking through the window". However, the baseline agent failed to perform the "walking through the window" action during navigation, resulting in an incorrect trajectory.

相同或相似的标号对应相同或相似的部件；The same or similar reference numbers correspond to the same or similar parts;

附图中描述位置关系的用于仅用于示例性说明，不能理解为对本专利的限制；The positional relationship described in the accompanying drawings is only for exemplary illustration, and should not be construed as a limitation on this patent;

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. a visual language navigation system based on the action prompt of modal alignment, is characterized in that, comprises:

Action prompt set generation module, inputting an instruction to the action prompt set generation module, the agent retrieves the action prompt set related to the instruction from the action prompt library before navigation starts;

The visual language navigation module of the modal alignment action cue, the action cue set is connected through the cue coding module, and the output cue feature is connected with the output instruction feature of the text coding module; the cue-based instruction feature and the output visual feature of the visual coding module are provided to Multi-layer transformers are used to make action decisions;

The optimized learning modules, namely the modal alignment loss module and the continuous consistency loss module, enable effective action cue learning.

2. The visual language navigation system based on the action prompt of modal alignment according to claim 1, wherein the visual language navigation module of the modal alignment action prompt comprises:

Text encoding module This module receives the input of language information, uses the multi-layer transformer neural network to encode separately, and obtains the corresponding feature vector;

The prompt decoding module is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder. Input into the multi-modal prompt encoder to obtain prompt features;

Visual coding module, this module receives the input of visual observation information, encodes it through the visual encoder, and obtains the corresponding feature vector.

3. The visual language navigation system based on the action prompt of modal alignment according to claim 2, is characterized in that, described optimization learning module comprises:

Modal alignment loss module, when the action prompt already has matching image and text sub-prompts, using the InfoNCE loss to align them in the feature space, the action prompt can become more discriminative;

A continuous consistency loss module that prompts the agent to sequentially focus on relevant action cues in the retrieved cue set based on its observations.

4. a visual language navigation method applying the described system of claim 3, is characterized in that, comprises the following steps:

S1: At the beginning of navigation, the agent obtains the instruction, and retrieves the action prompt set related to the instruction from the action prompt library through the action prompt generation module;

S2: Through the visual encoding module and the text encoding module, the neural network respectively encodes the input image information and instruction information, and obtains visual encoding, instruction encoding, and state features respectively;

S3: Through the prompt coding mode, the image sub-prompt and the text sub-prompt in the action prompt set respectively obtain the sub-prompt feature through the corresponding single-modal auto-encoder, and then input into the multi-modal prompt encoder after connection to obtain the prompt feature;

S4: connect the above-mentioned instruction code and the prompt code to obtain the prompt-based instruction feature, and connect the above-mentioned state feature with the visual code to obtain the state visual feature;

S5: By modally aligning the visual language navigation module of action cues, the state visual feature is updated based on the cross-modal attention between itself and cue-based instruction features, and the attention is decomposed into two parts, the first part weights the instruction encoding Update, which is used to update the state features, the second part performs weighted updates on the image and text sub-cue features, which are used to calculate the sequential consistency loss, and feed the state visual features into another self-attention module to obtain the state features with respect to the visual features. Attention score, which is the probability of action prediction based on cues;

S6: By optimizing the learning model, combining the commonly used imitation learning loss and reinforcement learning loss, as well as the unique modal alignment loss and continuous consistency loss of the present invention, weighted summation is performed to obtain the total training target, and the model is updated and optimized to improve Agent Navigation Performance and Generalization.

5. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S1 comprises the following sub-steps:

S100: Construction of action prompt library, in order to align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts: First, for an instruction path instance in the training dataset, use a pre-created Visual object visual object/location location vocabulary to find the visual object/location mentioned in the instruction, for each visual object/location, get the relevant image and text sub-cues respectively, using cross-modal alignment with excellent 0-shot Capable CLIP for locating object/position related images, to accommodate the inference process of CLIP, replace the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/position whose class label is c , the probability that an image B belongs to class c in the action sequence is calculated by:

where τ1 is the temperature parameter, sim is the cosine similarity, _b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt, To obtain textual subcues, a simple nearest verb search scheme is used, i.e. to find the nearest verb before a specific object/position word, which is in a pre-built verb vocabulary, and finally, has the same visual object/position and action Image and text subprompts form an aligned action prompt;

S101: Retrieval of action prompt set, at the beginning of navigation, the agent retrieves the action prompt related to the instruction from the action prompt library, and calculates the sentence between each object/position related action phrase in the prompt library and the text sub-prompt Similarity for retrieving the set of action cues associated with the instruction

where N is the size of the set.

6. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S2 comprises the following sub-steps:

S200: Encoding of visual input, for time step t, each image view O _t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v _t,i , and then v _t,i are mapped to visual encoding by visual encoder F _v :

V _t,i =F _v (v _t,i ; θ _v )

where θ _v is the parameter of F _v , a set of

represents the candidate visual code at time t;

S201: The encoding of language input. During initialization, the instruction code X and the initialized state feature s ₀ are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

Where Concat( ) represents the connection concatenation operation,

7. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S3 comprises the steps of:

use

Get hint code from hint encoder

The image sub-prompt and text sub-prompt are respectively

and

and

First obtain the sub-cue features through a unimodal sub-cue encoder

and

and

Send to the multimodal prompt encoder E ^p ( ), get the prompt code

where θ ^p is the parameter of E ^p ( ), Concat ( ) is the concatenation operation, the encoder E ⁱ ( ), E ^u ( ) and E ^p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.

8. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S4 comprises the following sub-steps:

coding at the prompt

and instruction encoding X, by simply combining X and

concatenated to obtain hint-based instruction features X ^p .

9. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S5 comprises the following sub-steps:

The state visual feature _{Kt is} based on cross-modal attention between _Kt and ^Xp

renew:

followed by

Decomposed into

and

is performed on X by

Weighted, enhanced image sub-cue features based on attention mechanism

and enhanced text sub-cue features based on attention mechanism

through the pair

and

conduct

weighted gain,

and

is used to update the state property, and finally, the

enter

Get cue-based action prediction probabilities

10. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S6 comprises the following sub-steps:

S600: Modal alignment loss, prompting action cues that already have matching image and text sub-cues to align in feature space, following the contrastive learning paradigm used in CLIP, making pairs of image and text features similar, while unpaired images and text Text feature alienation, using infoNCE loss to facilitate feature alignment of image and text sub-cues in each action cue:

where τ2 is the temperature parameter,

S601: Sequential consistency loss, since instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set {p _n } are also related to different objects/positions. Focusing on the relevant action cues in the retrieved cue set in order, a sequential consistency loss is proposed, that is, the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism

and guidance features enhanced by attention mechanism

must be close to:

definition,

Image sub-cue features for improving attention-based enhancement

S602: The total objective uses the navigation loss L _n , namely the imitation loss L _IL and the reinforcement learning loss L _RL , and the total training objective is:

L=L _RL +λ ₁ L _IL +λ ₂ L _c +λ ₃ L _a

where λ ₁ , λ ₂ and λ ₃ are the loss weights to balance the losses.