[go: up one dir, main page]

CN114973402A - A visual language navigation system and method based on modal-aligned action prompts - Google Patents

A visual language navigation system and method based on modal-aligned action prompts Download PDF

Info

Publication number
CN114973402A
CN114973402A CN202210467461.4A CN202210467461A CN114973402A CN 114973402 A CN114973402 A CN 114973402A CN 202210467461 A CN202210467461 A CN 202210467461A CN 114973402 A CN114973402 A CN 114973402A
Authority
CN
China
Prior art keywords
action
prompt
visual
sub
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210467461.4A
Other languages
Chinese (zh)
Other versions
CN114973402B (en
Inventor
梁小丹
聂云双
林冰倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Original Assignee
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University, Sun Yat Sen University Shenzhen Campus filed Critical Sun Yat Sen University
Priority to CN202210467461.4A priority Critical patent/CN114973402B/en
Publication of CN114973402A publication Critical patent/CN114973402A/en
Application granted granted Critical
Publication of CN114973402B publication Critical patent/CN114973402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Multimedia (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Automation & Control Theory (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a visual language navigation system and method based on modal alignment action prompt, the system includes action prompt set generation module, input order to the action prompt set generation module, the intelligent agent retrieves the action prompt set correlated to order from the action prompt library before the navigation begins; the visual language navigation module of modal alignment action prompt, the action prompt set is through prompting the code module, the output prompt characteristic links together with output instruction characteristic of the text code module; the hint-based instruction features and the output visual features of the visual coding module are provided to the multi-layer transformer for action decision-making. The optimization learning module, namely a modal alignment loss module and a continuous consistency loss module, realizes effective action prompt learning; the invention mainly provides explicit modal alignment action prompt to improve the accuracy of intelligent agent navigation and generalization capability in different environments.

Description

一种基于模态对齐的动作提示的视觉语言导航系统及方法A visual language navigation system and method based on modal-aligned action prompts

技术领域technical field

本发明涉及视觉语言导航领域,更具体地,涉及一种基于模态对齐的动作提示的视觉语言导航系统及方法。The present invention relates to the field of visual language navigation, and more particularly, to a visual language navigation system and method based on modal alignment action prompts.

背景技术Background technique

视觉语言导航是一项具有挑战性的任务,需要具体化的主体按照自然语言的说明导航到目标位置。为了进行成功的导航,智能体应通过理解给定指令的意图并逐步将指令基于周围的观察结果,依次做出正确的动作决策,以在动态变化的场景中移动。Visual-language navigation is a challenging task that requires embodied subjects to navigate to target locations following natural language instructions. For successful navigation, the agent should in turn make correct action decisions to move through a dynamically changing scene by understanding the intent of a given instruction and incrementally basing the instruction on observations around it.

早期的视觉语言导航方法探索了不同的数据增强策略,高效的学习范式和有用的模型架构用于提高智能体性能。受到视觉语言任务中大规模跨模态预训练模型取得的重大进展的启发之后,越来越多的工作试图将预训练范式和模型引入到视觉语言导航任务中。PREVALENT在大量的图像-语言-动作三元组上对模型进行自监督预训练。

Figure BDA0003624946130000011
在预训练模型中引入循环函数使智能体具有时间感知功能。虽然对象级别对齐能力可能在预训练过程中被显著提高,这些智能体仍然是在隐式地学习动作级别的模态对齐,这在很大程度上限制了不同场景下的行动决策的鲁棒性。Early visual-linguistic navigation methods explored different data augmentation strategies, efficient learning paradigms and useful model architectures for improving agent performance. Inspired by the significant progress achieved in large-scale cross-modal pre-training models in visual-linguistic tasks, a growing body of work attempts to introduce pre-training paradigms and models into visual-linguistic navigation tasks. PREVALENT self-supervised pre-training of the model on a large number of image-language-action triples.
Figure BDA0003624946130000011
Introducing a recurrent function into the pretrained model makes the agent time-aware. Although object-level alignment capabilities may be significantly improved during pre-training, these agents are still implicitly learning action-level modal alignment, which largely limits the robustness of action decisions in different scenarios .

现有技术公开了一种视觉与语言多模态融合的导航方法的专利,该专利,属于机器人导航、自然语言处理和计算机视觉领域;该专利首先在机器人上安装双目相机,利用该机器人训练一个多模态融合神经网络模型;选取任一真实场景,对机器人下达自然语言导航指令并转化为对应语义向量;利用机器人在每个时刻获取的RGB图,分别转化为对应的特征;对语义向量、RGB图特征特征进行特征融合,得到当前时刻的动作特征;利用提示对该动作特征进行修正后,神经网络模型最终输出机器人在当前时刻的动作,机器人执行该动作直至完成导航任务。然而,该专利对于如何实现较为鲁棒的视觉语言导航模型,提高准确性和泛化能力,具备良好解释性却鲜有报道。The prior art discloses a patent for a navigation method integrating vision and language multimodality, which belongs to the fields of robot navigation, natural language processing and computer vision; the patent first installs a binocular camera on the robot, and uses the robot to train A multi-modal fusion neural network model; select any real scene, issue natural language navigation instructions to the robot and convert them into corresponding semantic vectors; use the RGB images obtained by the robot at each moment, and convert them into corresponding features; , RGB image feature features are fused to obtain the action feature at the current moment; after correcting the action feature by using the prompt, the neural network model finally outputs the robot's action at the current moment, and the robot performs the action until the navigation task is completed. However, there are few reports in this patent on how to implement a more robust visual language navigation model, improve the accuracy and generalization ability, and have good interpretability.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于模态对齐的动作提示的视觉语言导航系统,该系统可实现强制智能体显式地学习跨模态动作知识,以改善导航期间的行动决策。The present invention provides a visual-language navigation system based on modal-aligned action prompts, which can force an agent to explicitly learn cross-modal action knowledge to improve action decision-making during navigation.

本发明的又一目的在于提供一种上述系统的导航方法。Another object of the present invention is to provide a navigation method for the above system.

为了达到上述技术效果,本发明的技术方案如下:In order to achieve above-mentioned technical effect, technical scheme of the present invention is as follows:

一种基于模态对齐的动作提示的视觉语言导航系统,包括:A visual language navigation system based on modal-aligned action prompts, comprising:

动作提示集产生模块,输入指令到动作提示集产生模块,智能体在导航开始前从动作提示库中检索与指令相关的动作提示集;Action prompt set generation module, inputting an instruction to the action prompt set generation module, the agent retrieves the action prompt set related to the instruction from the action prompt library before navigation starts;

模态对齐动作提示的视觉语言导航模块,动作提示集通过提示编码模块,输出提示特征与文本编码模块的输出指令特征连接在一起;基于提示的指令特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策;The visual language navigation module of the modal alignment action cue, the action cue set is connected through the cue coding module, and the output cue feature is connected with the output instruction feature of the text coding module; the cue-based instruction feature and the output visual feature of the visual coding module are provided to Multi-layer transformers are used to make action decisions;

优化学习模块,即模态对齐损失模块和连续一致性损失模块,实现有效的动作提示学习。The optimized learning modules, namely the modal alignment loss module and the continuous consistency loss module, enable effective action cue learning.

进一步地,所述模态对齐动作提示的视觉语言导航模块包括:Further, the visual language navigation module of the modal alignment action prompt includes:

文本编码模块该模块接收语言信息的输入,利用多层transformer神经网络分别进行编码,获得相应的特征向量;Text encoding module This module receives the input of language information, uses the multi-layer transformer neural network to encode separately, and obtains the corresponding feature vector;

提示解码模块,该模块由两个单模态子提示编码器和一个多模态提示编码器组成,图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征;The prompt decoding module is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder. Input into the multi-modal prompt encoder to obtain prompt features;

视觉编码模块,该模块接收视觉观察信息的输入,通过视觉编码器进行编码,获取对应的特征向量。Visual coding module, this module receives the input of visual observation information, encodes it through the visual encoder, and obtains the corresponding feature vector.

进一步地,所述优化学习模块包括:Further, the optimization learning module includes:

模态对齐损失模块,当动作提示已经有匹配的图像和文本子提示,利用InfoNCE损失使得它们在在特征空间中对齐,动作提示可以变得更加具有识别力;Modal alignment loss module, when the action prompt already has matching image and text sub-prompts, using the InfoNCE loss to align them in the feature space, the action prompt can become more discriminative;

连续一致性损失模块,促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示。A continuous consistency loss module that prompts the agent to sequentially focus on relevant action cues in the retrieved cue set based on its observations.

一种基于模态对齐的动作提示的视觉语言导航方法,包括以下步骤:A visual language navigation method based on modal-aligned action prompts, comprising the following steps:

S1:在导航的开始,智能体获取指令,通过动作提示产生模块从动作提示库中检索与指令相关的动作提示集;S1: At the beginning of navigation, the agent obtains the instruction, and retrieves the action prompt set related to the instruction from the action prompt library through the action prompt generation module;

S2:通过视觉编码模块和文本编码模块,对神经网络分别对输入的图像信息和指令信息进行编码,分别获得视觉编码,指令编码,状态特征;S2: Through the visual encoding module and the text encoding module, the neural network respectively encodes the input image information and instruction information, and obtains visual encoding, instruction encoding, and state features respectively;

S3:通过提示编码模,动作提示集中图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征;S3: Through the prompt coding mode, the image sub-prompt and the text sub-prompt in the action prompt set respectively obtain the sub-prompt feature through the corresponding single-modal auto-encoder, and then input into the multi-modal prompt encoder after connection to obtain the prompt feature;

S4:将上述指令编码和提示编码连接起来获得基于提示的指令特征,将上述状态特征与视觉编码连接起来,得到状态视觉特征;S4: connect the above-mentioned instruction code and the prompt code to obtain the prompt-based instruction feature, and connect the above-mentioned state feature with the visual code to obtain the state visual feature;

S5:通过模态对齐动作提示的视觉语言导航模块,状态视觉特征基于自身和基于提示的指令特征之间的跨模态注意力更新,将该注意力分解为两部分,第一部分对指令编码加权更新,用于更新状态特征,第二部分对图像和文本子提示特征进行加权更新,用于计算顺序一致性损失,将状态视觉特征输入另一个自注意力模块,以获得状态特征关于视觉特征的注意力分数,即基于提示的动作预测概率;S5: By modally aligning the visual language navigation module of action cues, the state visual feature is updated based on the cross-modal attention between itself and cue-based instruction features, and the attention is decomposed into two parts, the first part weights the instruction encoding Update, which is used to update the state features, the second part performs weighted updates on the image and text sub-cue features, which are used to calculate the sequential consistency loss, and feed the state visual features into another self-attention module to obtain the state features with respect to the visual features. Attention score, which is the probability of action prediction based on cues;

S6:通过优化学习模,结合常用的模仿学习损失和强化学习损失,以及本发明特有的模态对齐损失和连续一致性损失,进行加权求和,获得总训练目标,对模型进行更新优化,提高智能体导航性能和泛化能力。S6: By optimizing the learning model, combining the commonly used imitation learning loss and reinforcement learning loss, as well as the unique modal alignment loss and continuous consistency loss of the present invention, weighted summation is performed to obtain the total training target, and the model is updated and optimized to improve Agent Navigation Performance and Generalization.

进一步地,所述步骤S1包括以下子步骤:Further, the step S1 includes the following sub-steps:

S100:动作提示库的建设,为了对齐图像和动作短语,形成动作提示符,设计两分支方案来收集图像和文本子提示:首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体visual object/位置location词汇表来查找指令中提及的视觉物体/位置,对于每个视觉物体/位置,分别获得相关的图像和文本子提示,使用具有优秀的0-shot跨模态对齐能力的CLIP,用于定位物体/位置相关的图像,为了适应CLIP的推理过程,将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置,在动作序列中一个图像B属于c类的概率由以下方法计算:S100: Construction of action prompt library, in order to align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts: First, for an instruction path instance in the training dataset, use a pre-created Visual object visual object/location location vocabulary to find the visual object/location mentioned in the instruction, for each visual object/location, get the relevant image and text sub-cues respectively, using cross-modal alignment with excellent 0-shot Capable CLIP for locating object/position related images, to accommodate the inference process of CLIP, replace the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/position whose class label is c , the probability that an image B belongs to class c in the action sequence is calculated by:

Figure BDA0003624946130000031
Figure BDA0003624946130000031

其中τ1为温度temperature参数,sim为余弦相似度,b,wc分别为CLIP生成的图像特征和短语特征,M为词汇表的尺寸,然后选择与该短语相似度最大的图像作为图像子提示,为了获得文本子提示,使用简单的最近动词搜索方案,即找到一个特定的物体/位置词之前最近的动词,该动词在预先构建的动词词汇中,最后,拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示;where τ1 is the temperature parameter, sim is the cosine similarity, b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt, To obtain textual subcues, a simple nearest verb search scheme is used, i.e. to find the nearest verb before a specific object/position word, which is in a pre-built verb vocabulary, and finally, has the same visual object/position and action Image and text subprompts form an aligned action prompt;

S101:动作提示集的检索,在导航的开始,智能体从动作提示库中检索与指令相关的动作提示,计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度,用于检索与指令相关的动作提示集

Figure BDA0003624946130000041
其中N为该集合的大小。S101: Retrieval of action prompt set, at the beginning of navigation, the agent retrieves the action prompt related to the instruction from the action prompt library, and calculates the sentence between each object/position related action phrase in the prompt library and the text sub-prompt Similarity for retrieving the set of action cues associated with the instruction
Figure BDA0003624946130000041
where N is the size of the set.

进一步地,所述步骤S2包括以下子步骤:Further, the step S2 includes the following sub-steps:

S200:视觉输入的编码,对于时间步长t时,候选视图中的每个图像视图Ot,i,都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征vt,i,然后vt,i被视觉编码器Fv映射为视觉编码:S200: Encoding of visual input, for time step t, each image view O t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v t,i , and then v t,i are mapped to visual encoding by visual encoder F v :

Vt,i=Fv(vt,i;θv)V t,i =F v (v t,i ; θ v )

其中θv为Fv的参数,一组

Figure BDA0003624946130000042
代表时间t下的候选视觉编码;where θ v is the parameter of F v , a set of
Figure BDA0003624946130000042
represents the candidate visual code at time t;

S201:语言输入的编码,初始化时,指令编码X和初始化后的状态特征s0通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得:S201: The encoding of language input. During initialization, the instruction code X and the initialized state feature s 0 are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

Figure BDA0003624946130000043
Figure BDA0003624946130000043

其中Concat(·)代表连接concatenation操作,

Figure BDA0003624946130000044
表示self-attention模组的参数,s0将会在时间步骤t被更新为st。Where Concat( ) represents the connection concatenation operation,
Figure BDA0003624946130000044
Represents the parameters of the self-attention module, s 0 will be updated to s t at time step t .

进一步地,所述步骤S3包括以步骤:Further, the step S3 includes the following steps:

使用

Figure BDA0003624946130000045
通过提示编码器得到提示编码
Figure BDA0003624946130000046
该提示编码器由两个单模态子提示编码器和一个多模态提示编码器组成,
Figure BDA0003624946130000047
其中图像子提示和文本子提示分别为
Figure BDA0003624946130000048
Figure BDA0003624946130000049
Figure BDA00036249461300000410
首先通过单模态子提示编码器得到子提示特征
Figure BDA00036249461300000411
Figure BDA00036249461300000412
use
Figure BDA0003624946130000045
Get hint code from hint encoder
Figure BDA0003624946130000046
The cue encoder consists of two unimodal sub-cue encoders and a multi-modal cue encoder,
Figure BDA0003624946130000047
The image sub-prompt and text sub-prompt are respectively
Figure BDA0003624946130000048
and
Figure BDA0003624946130000049
and
Figure BDA00036249461300000410
First obtain the sub-cue features through a unimodal sub-cue encoder
Figure BDA00036249461300000411
and
Figure BDA00036249461300000412

Figure BDA00036249461300000413
Figure BDA00036249461300000413

Figure BDA00036249461300000414
Figure BDA00036249461300000414

其中Ei(·)使用参数θi,Eu(·)使用参数θu,分别表示图像子提示编码器和文本子提示编码器,然后将

Figure BDA00036249461300000415
Figure BDA00036249461300000416
输送到多模态提示编码器Ep(·),得到提示编码
Figure BDA00036249461300000417
where E i (·) uses the parameter θ i and E u (·) uses the parameter θ u , representing the image sub-hint encoder and the text sub-hint encoder, respectively, and then the
Figure BDA00036249461300000415
and
Figure BDA00036249461300000416
Send to the multimodal prompt encoder E p ( ), get the prompt code
Figure BDA00036249461300000417

Figure BDA00036249461300000418
Figure BDA00036249461300000418

其中θp为Ep(·)的参数,Concat(·)为连接运算,编码器Ei(·),Eu(·)和Ep(·)由一个线性层组成,后接dropout操作,以减少过拟合。where θ p is the parameter of E p ( ), Concat ( ) is the concatenation operation, the encoder E i ( ), E u ( ) and E p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.

进一步地,所述步骤S4包括以下子步骤:Further, the step S4 includes the following sub-steps:

在提示编码

Figure BDA0003624946130000051
和指令编码X的基础上,通过简单地将X和
Figure BDA0003624946130000052
连接起来,得到基于提示的指令特征Xp。coding at the prompt
Figure BDA0003624946130000051
and instruction encoding X, by simply combining X and
Figure BDA0003624946130000052
concatenated to obtain hint-based instruction features X p .

进一步地,所述步骤S5包括以下子步骤:Further, the step S5 includes the following sub-steps:

状态视觉特征Kt基于Kt和Xp之间的跨模态注意力

Figure BDA0003624946130000053
更新:The state visual feature Kt is based on cross-modal attention between Kt and Xp
Figure BDA0003624946130000053
renew:

Figure BDA0003624946130000054
Figure BDA0003624946130000054

然后将

Figure BDA0003624946130000055
分解为
Figure BDA0003624946130000056
Figure BDA0003624946130000057
获得不同的基于注意力机制增强的特征,参与指令特征
Figure BDA0003624946130000058
是通过对X进行
Figure BDA0003624946130000059
加权得到的,基于注意力机制增强的图像子提示特征
Figure BDA00036249461300000510
和基于注意力机制增强的文本子提示特征
Figure BDA00036249461300000511
通过对
Figure BDA00036249461300000512
Figure BDA00036249461300000513
进行
Figure BDA00036249461300000514
加权获得,
Figure BDA00036249461300000515
Figure BDA00036249461300000516
用于计算顺序一致性损失Lc,和baseline智能体一样,
Figure BDA00036249461300000517
用于更新状态特性,最后,将
Figure BDA00036249461300000518
输入
Figure BDA00036249461300000519
得到基于提示的动作预测概率
Figure BDA00036249461300000520
followed by
Figure BDA0003624946130000055
Decomposed into
Figure BDA0003624946130000056
and
Figure BDA0003624946130000057
Obtain different attention-based enhanced features, participating in instruction features
Figure BDA0003624946130000058
is performed on X by
Figure BDA0003624946130000059
Weighted, enhanced image sub-cue features based on attention mechanism
Figure BDA00036249461300000510
and enhanced text sub-cue features based on attention mechanism
Figure BDA00036249461300000511
through the pair
Figure BDA00036249461300000512
and
Figure BDA00036249461300000513
conduct
Figure BDA00036249461300000514
weighted gain,
Figure BDA00036249461300000515
and
Figure BDA00036249461300000516
Used to calculate the sequential consistency loss L c , like the baseline agent,
Figure BDA00036249461300000517
is used to update the state property, and finally, the
Figure BDA00036249461300000518
enter
Figure BDA00036249461300000519
Get cue-based action prediction probabilities
Figure BDA00036249461300000520

进一步地,所述步骤S6包括以下子步骤:Further, the step S6 includes the following sub-steps:

S600:模态对齐损失,促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐:S600: Modal alignment loss, prompting action cues that already have matching image and text subcues to align in feature space, following the contrastive learning paradigm used in CLIP, making pairs of image and text features similar, while unpaired images and text Text feature alienation, using infoNCE loss to facilitate feature alignment of image and text sub-cues in each action cue:

Figure BDA00036249461300000521
Figure BDA00036249461300000521

其中τ2是温度参数,

Figure BDA00036249461300000522
表示动作提示pn的成对的图像和文本子提示的特征,
Figure BDA00036249461300000523
表示非配对的子提示,通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐;where τ2 is the temperature parameter,
Figure BDA00036249461300000522
The features representing pairs of image and text sub-prompts of action cue p n ,
Figure BDA00036249461300000523
Represents unpaired sub-prompts, through the modal alignment loss, the action cue can become more discriminative, so as to know the modal alignment of the learned action level;

S601:顺序一致性损失,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集{pn}中的动作提示也与不同的物体/位置相关,为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和;在每个时间步骤t上,基于注意力机制增强的文本子提示特征

Figure BDA00036249461300000524
以及基于注意力机制增强的指导特征
Figure BDA00036249461300000525
必须接近:S601: Sequential consistency loss, since instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set {p n } are also related to different objects/positions. Focusing on the relevant action cues in the retrieved cue set in order, a sequential consistency loss is proposed, that is, the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism
Figure BDA00036249461300000524
and guidance features enhanced by attention mechanism
Figure BDA00036249461300000525
must be close to:

Figure BDA00036249461300000526
Figure BDA00036249461300000526

定义,

Figure BDA0003624946130000061
用于提高基于注意力机制增强的图像子提示特征
Figure BDA0003624946130000062
和基于注意力机制增强的视觉特征之间的相似性,则顺序一致性损失Lc为:definition,
Figure BDA0003624946130000061
Image sub-cue features for improving attention-based enhancement
Figure BDA0003624946130000062
and the similarity between visual features enhanced by the attention mechanism, then the sequential consistency loss L c is:

Figure BDA0003624946130000063
Figure BDA0003624946130000063

S602:总目标使用导航损失Ln,即模仿损失LIL和强化学习损失LRL,总训练目标为:S602: The total objective uses the navigation loss L n , namely the imitation loss L IL and the reinforcement learning loss L RL , and the total training objective is:

L=LRL1LIL2Lc3La L=L RL1 L IL2 L c3 L a

其中λ1,λ2和λ3是平衡损失的损失权重。where λ 1 , λ 2 and λ 3 are the loss weights to balance the losses.

与现有技术相比,本发明技术方案的有益效果是:Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明提出模态对齐的动作提示用于强制智能体显式地学习跨模态动作知识,以改善导航期间的行动决策,在视觉语言导航任务中开发基于提示的导航;开发了一种模态对齐损失和连续一致性损失,以实现有效的学习动作提示。使用对比语言-图像预训练(CLIP)模型来保证动作提示的质量;有效提升了基于R2R和RxR的智能体导航性能,且具有良好的可解释性和泛化能力。The present invention proposes modal-aligned action cues to force the agent to explicitly learn cross-modal action knowledge to improve action decision-making during navigation, develop cue-based navigation in visual-language navigation tasks; develop a modal Alignment loss and continuous consistency loss for efficient learned action cues. The contrastive language-image pre-training (CLIP) model is used to ensure the quality of action prompts; it effectively improves the navigation performance of agents based on R2R and RxR, and has good interpretability and generalization ability.

附图说明Description of drawings

图1为本发明一种基于模态对齐的动作提示的视觉语言导航系统架构图;1 is an architectural diagram of a visual language navigation system based on a modal alignment-based action prompt of the present invention;

图2为本发明一种基于模态对齐的动作提示的视觉语言导航方法的步骤流程图;2 is a flow chart of steps of a visual language navigation method based on modal alignment of action prompts of the present invention;

图3为本发明具体实施例中模态对齐动作提示的视觉语言导航模块示例图;3 is an example diagram of a visual language navigation module for modal alignment action prompts in a specific embodiment of the present invention;

图4为本发明具体实施例中动作提示集产生模块的动作提示库建造的示例图;4 is an example diagram of the construction of an action prompt library of an action prompt set generation module in a specific embodiment of the present invention;

图5为本发明具体实施例中应用视觉语言导航方法与baseline方法结果导航的结果样例对比展示。FIG. 5 is a comparative display of a result sample of the result navigation using the visual language navigation method and the baseline method in the specific embodiment of the present invention.

具体实施方式Detailed ways

附图仅用于示例性说明,不能理解为对本专利的限制;The accompanying drawings are for illustrative purposes only, and should not be construed as limitations on this patent;

为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;In order to better illustrate this embodiment, some parts of the drawings are omitted, enlarged or reduced, which do not represent the size of the actual product;

对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。It will be understood by those skilled in the art that some well-known structures and their descriptions may be omitted from the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.

实施例1Example 1

如图1所示,一种基于模态对齐的动作提示的视觉语言导航系统,包括:As shown in Figure 1, a visual language navigation system based on modal-aligned action prompts includes:

动作提示收集模块10,为了制作高质量动作提示库,采用新近开发的具有强大的跨模态对象/位置级对齐能力的对比语言图像预训练CLIP模型,用于定位物体/位置相关的图像。为了更好的对齐图像和动作短语,形成动作提示符,设计一个两分支方案用来收集图像和文本子提示。首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置,分别获得相关的图像和文本子提示。导航开始时,输入指令到动作提示集产生模块,智能体从提前建好的动作提示库中检索与指令相关的动作提示,构成动作提示集。The action cue collection module 10, in order to produce a high-quality action cue library, adopts the newly developed contrastive language image pre-training CLIP model with powerful cross-modal object/position level alignment capability for locating object/position-related images. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained respectively. When the navigation starts, input an instruction to the action prompt set generation module, and the agent retrieves the action prompt related to the instruction from the action prompt library built in advance to form an action prompt set.

模态对齐动作提示的视觉语言导航模块11,通过一个提示编码器来获取提示特征,与文本编码模块的输出指令特征连接在一起获得基于提示的指令特征。该特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策。The visual language navigation module 11 of the modal alignment action prompt obtains the prompt feature through a prompt encoder, and is connected with the output instruction feature of the text encoding module to obtain the prompt-based instruction feature. This feature and the output visual features of the visual encoding module are provided to the multi-layer transformer for action decision-making.

优化学习模块12,即模态对齐损失模块和连续一致性损失模块,实现有效的动作提示学习。The optimization learning module 12, namely the modal alignment loss module and the continuous consistency loss module, realizes effective action cue learning.

在本发明具体实施例中,具体地,模态对齐动作提示的视觉语言导航模块11进一步包括:In a specific embodiment of the present invention, specifically, the visual language navigation module 11 of the modal alignment action prompt further includes:

文本编码模块110,模块接收语言信息的输入,利用自监督神经网络进行编码,获得相应的文本特征向量和状态特征。Text encoding module 110, the module receives the input of language information, uses self-supervised neural network for encoding, and obtains corresponding text feature vectors and state features.

提示编码模块111,该模块由两个单模态子提示编码器和一个多模态提示编码器组成,图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征。The prompt encoding module 111 is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and the text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder, and connect the Then input into the multi-modal prompt encoder to obtain prompt features.

视觉编码模块112,该模块接收视觉观察信息的输入,通过预训练的视觉特征编码器进行编码,获取对应的特征向量。Visual coding module 112, this module receives the input of visual observation information, performs coding through the pre-trained visual feature encoder, and obtains the corresponding feature vector.

在本发明具体实施例中,具体地,优化学习模块12进一步包括:In a specific embodiment of the present invention, specifically, the optimization learning module 12 further includes:

模态对齐损失模块120,当动作提示已经有匹配的图像和文本子提示,它们可能不会在特征空间中对齐。要解决这个问题,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐。通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐。Modal alignment loss module 120, when action cues already have matching image and text subcues, they may not be aligned in the feature space. To address this issue, following the contrastive learning paradigm used in CLIP to make pairs of image and text features similar, while unpaired image and text features are dissimilar, an infoNCE loss is used to facilitate the comprehension of image and text sub-cues in each action cue Feature alignment. With the modality alignment loss, action cues can be made more discriminative, knowing to learn action-level modal alignment.

连续一致性损失模块121,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集中的动作提示也与不同的物体/位置相关。为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和。以文本模态为例,在每个时间步骤t上,文本子提示特征以及指导特征必须接近;类似的损失定义在图像模态,用于提高图像子提示特征和视觉特征之间的相似性。In the continuous consistency loss module 121, since the instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set are also related to different objects/locations. To motivate the agent to sequentially focus on relevant action cues in the retrieved cue set according to its observations, a sequential consistency loss is proposed, which is the sum of two unimodal consistency losses. Taking the text modality as an example, at each time step t, the text sub-cue features as well as the guidance features must be close; a similar loss is defined in the image modality to improve the similarity between the image sub-cue features and visual features.

实施例2Example 2

如图2所示,一种基于模态对齐的动作提示的视觉语言导航方法,包括如下步骤:As shown in Figure 2, a visual language navigation method based on modal alignment action prompts includes the following steps:

步骤S1,根据输入的指令信息检索相关动作提示集。Step S1: Retrieve a relevant action prompt set according to the input instruction information.

具体地,步骤S1进一步包括:Specifically, step S1 further includes:

步骤S100,动作提示库的建设。为了更好的对齐图像和动作短语,形成动作提示符,设计了一个两分支方案来收集图像和文本子提示。首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置,分别获得相关的图像和文本子提示,如下所述。Step S100, construction of an action prompt library. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained separately, as described below.

请注意,ground-truth路径序列包含一个单视图图像的集合,每一个都表示一个需要在特定的时间步骤进行的动作。因此,为了派生动作提示中的图像子提示,只从ground-truth路径序列中检索与物体/位置相关的图像,它本身包含行动信息。相比诉诸于现有的物体分类器或在固定的物品类别集合上训练的检测器,使用具有优秀的0-shot跨模态对齐能力的CLIP,用于定位locate物体/位置相关的图像。为了适应CLIP的推理过程,将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置。在动作序列中一个图像B属于c类的概率由以下方法计算:Note that the ground-truth path sequence contains a collection of single-view images, each representing an action that needs to be performed at a specific time step. Therefore, to derive image sub-cues in action cues, only object/location-related images, which themselves contain action information, are retrieved from the ground-truth path sequence. Compared to resorting to existing object classifiers or detectors trained on a fixed set of item categories, we use CLIP with excellent 0-shot cross-modal alignment capability for locating objects/position-related images. To accommodate the inference process of CLIP, the token {CLASS} token in the phrase "a photo of {CLASS}" is replaced with a visual object/location whose class label is c. The probability that an image B belongs to class c in the action sequence is calculated by:

Figure BDA0003624946130000081
Figure BDA0003624946130000081

其中τ1为温度temperature参数,sim为余弦相似度,b,wc分别为CLIP生成的图像特征和短语特征,M为词汇表的尺寸,然后选择与该短语相似度最大的图像作为图像子提示。where τ1 is the temperature parameter, sim is the cosine similarity, b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt.

为了获得文本子提示,使用一个简单的最近动词搜索方案,即找到一个特定的物体/位置词之前最近的动词(在预先构建的动词词汇中)。最后,拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示。To obtain textual subcues, a simple nearest-verb search scheme is used to find the nearest verb (in a pre-built verb vocabulary) before a specific object/location word. Finally, image and text subcues with the same visual object/position and action form an aligned action cue.

步骤S101,动作提示集的检索。在导航的开始,智能体从动作提示库中检索与指令相关的动作提示。计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度,用于检索与指令相关的动作提示集

Figure BDA0003624946130000091
其中N为该集合的大小。Step S101, retrieval of the action prompt set. At the beginning of navigation, the agent retrieves the action cues associated with the instruction from the action cue library. Calculates the sentence similarity between each object/location-related action phrase in the cue library and the textual sub-cues for retrieving the set of instruction-related action cues
Figure BDA0003624946130000091
where N is the size of the set.

步骤S2,通过神经网络分别对输入的图像信息和指令信息进行编码。Step S2, respectively encode the input image information and instruction information through a neural network.

具体地,步骤S2进一步包括:Specifically, step S2 further includes:

步骤S200,视觉输入的编码,对于时间步长t时,候选视图中的每个图像视图Ot,i,都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征vt,i,然后vt,i被视觉编码器Fv映射为视觉编码:Step S200, coding of visual input, for time step t, each image view O t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v t,i , Then v t, i are mapped to visual encoding by visual encoder F v :

Vt,i=Fv(vt,i;θv)V t,i =F v (v t,i ; θ v )

其中θv为Fv的参数,一组

Figure BDA0003624946130000092
代表时间t下的候选视觉编码。where θ v is the parameter of F v , a set of
Figure BDA0003624946130000092
represents the candidate visual encoding at time t.

步骤S201,语言输入的编码,初始化时,指令编码X和初始化后的状态特征s0通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得:Step S201, the encoding of language input, during initialization, the instruction encoding X and the initialized state feature s 0 are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

Figure BDA0003624946130000093
Figure BDA0003624946130000093

其中Concat(·)代表连接concatenation操作,

Figure BDA0003624946130000094
表示self-attention模组的参数,s0将会在时间步骤t被更新为st。Where Concat( ) represents the connection concatenation operation,
Figure BDA0003624946130000094
Represents the parameters of the self-attention module, s 0 will be updated to s t at time step t .

步骤S3,通过模态编码器对动作提示集进行编码。使用

Figure BDA0003624946130000095
通过提示编码器得到提示编码
Figure BDA0003624946130000096
该提示编码器由两个单模态子提示编码器和一个多模态提示编码器组成,
Figure BDA0003624946130000097
其中图像子提示和文本子提示分别为
Figure BDA0003624946130000098
Figure BDA0003624946130000099
Figure BDA00036249461300000910
首先通过单模态子提示编码器得到子提示特征
Figure BDA00036249461300000911
Figure BDA00036249461300000912
In step S3, the action prompt set is encoded by the modal encoder. use
Figure BDA0003624946130000095
Get hint code from hint encoder
Figure BDA0003624946130000096
The cue encoder consists of two unimodal sub-cue encoders and a multi-modal cue encoder,
Figure BDA0003624946130000097
The image sub-prompt and text sub-prompt are respectively
Figure BDA0003624946130000098
and
Figure BDA0003624946130000099
and
Figure BDA00036249461300000910
First obtain the sub-cue features through a unimodal sub-cue encoder
Figure BDA00036249461300000911
and
Figure BDA00036249461300000912

Figure BDA00036249461300000913
Figure BDA00036249461300000913

Figure BDA00036249461300000914
Figure BDA00036249461300000914

其中Ei(·)使用参数θi,Eu(·)使用参数θu,分别表示图像子提示编码器和文本子提示编码器,然后将

Figure BDA00036249461300000915
Figure BDA00036249461300000916
输送到多模态提示编码器Ep(·),得到提示编码
Figure BDA00036249461300000917
where E i (·) uses the parameter θ i and E u (·) uses the parameter θ u , representing the image sub-hint encoder and the text sub-hint encoder, respectively, and then the
Figure BDA00036249461300000915
and
Figure BDA00036249461300000916
Send to the multimodal prompt encoder E p ( ), get the prompt code
Figure BDA00036249461300000917

Figure BDA00036249461300000918
Figure BDA00036249461300000918

其中θp为Ep(·)的参数,Concat(·)为连接运算,编码器Ei(·),Eu(·)和Ep(·)由一个线性层组成,后接dropout操作,以减少过拟合。where θ p is the parameter of E p ( ), Concat ( ) is the concatenation operation, the encoder E i ( ), E u ( ) and E p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.

步骤S4,在提示编码

Figure BDA0003624946130000101
和指令编码X的基础上,通过简单地将X和
Figure BDA0003624946130000102
连接起来,得到基于提示的指令特征Xp。Step S4, at the prompt encoding
Figure BDA0003624946130000101
and instruction encoding X, by simply combining X and
Figure BDA0003624946130000102
concatenated to obtain hint-based instruction features X p .

步骤S5,状态视觉特征Kt基于Kt和Xp之间的跨模态注意力

Figure BDA0003624946130000103
更新:Step S5, the state visual feature Kt is based on the cross-modal attention between Kt and Xp
Figure BDA0003624946130000103
renew:

Figure BDA0003624946130000104
Figure BDA0003624946130000104

然后将

Figure BDA0003624946130000105
分解为
Figure BDA0003624946130000106
Figure BDA0003624946130000107
获得不同的基于注意力机制增强的特征,参与指令特征
Figure BDA0003624946130000108
是通过对X进行
Figure BDA0003624946130000109
加权得到的,基于注意力机制增强的图像子提示特征
Figure BDA00036249461300001010
和基于注意力机制增强的文本子提示特征
Figure BDA00036249461300001011
通过对
Figure BDA00036249461300001012
Figure BDA00036249461300001013
进行
Figure BDA00036249461300001014
加权获得,
Figure BDA00036249461300001015
Figure BDA00036249461300001016
用于计算顺序一致性损失Lc,和baseline智能体一样,
Figure BDA00036249461300001017
用于更新状态特性,最后,将
Figure BDA00036249461300001018
输入
Figure BDA00036249461300001019
得到基于提示的动作预测概率
Figure BDA00036249461300001020
followed by
Figure BDA0003624946130000105
Decomposed into
Figure BDA0003624946130000106
and
Figure BDA0003624946130000107
Obtain different attention-based enhanced features, participating in instruction features
Figure BDA0003624946130000108
is performed on X by
Figure BDA0003624946130000109
Weighted, enhanced image sub-cue features based on attention mechanism
Figure BDA00036249461300001010
and enhanced text sub-cue features based on attention mechanism
Figure BDA00036249461300001011
through the pair
Figure BDA00036249461300001012
and
Figure BDA00036249461300001013
conduct
Figure BDA00036249461300001014
weighted gain,
Figure BDA00036249461300001015
and
Figure BDA00036249461300001016
Used to calculate the sequential consistency loss L c , like the baseline agent,
Figure BDA00036249461300001017
is used to update the state property, and finally, the
Figure BDA00036249461300001018
enter
Figure BDA00036249461300001019
Get cue-based action prediction probabilities
Figure BDA00036249461300001020

步骤S6,计算各损失的加权求和总目标,对模型进行更新优化,提高智能体导航性能和泛化能力。In step S6, the weighted sum total target of each loss is calculated, and the model is updated and optimized to improve the navigation performance and generalization ability of the agent.

具体地,步骤S6进一步包括:Specifically, step S6 further includes:

步骤S600,模态对齐损失,促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐:Step S600, the modal alignment loss, urges the action prompt to align the already matched image and text sub-prompts in the feature space, following the contrastive learning paradigm used in CLIP, making pairs of images and text features similar, while unpaired images Alienated from text features, the infoNCE loss is used to facilitate feature alignment of image and text sub-cues in each action cue:

Figure BDA00036249461300001021
Figure BDA00036249461300001021

其中τ2是温度参数,

Figure BDA00036249461300001022
表示动作提示pn的成对的图像和文本子提示的特征,
Figure BDA00036249461300001023
表示非配对的子提示,通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐。where τ2 is the temperature parameter,
Figure BDA00036249461300001022
The features representing pairs of image and text sub-prompts of action cue p n ,
Figure BDA00036249461300001023
Representing unpaired sub-cues, action cues can be made more discriminative through a modality alignment loss, knowing to learn action-level modal alignment.

步骤S601,顺序一致性损失,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集{pn}中的动作提示也与不同的物体/位置相关,为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和;在每个时间步骤t上,基于注意力机制增强的文本子提示特征

Figure BDA00036249461300001024
以及基于注意力机制增强的指导特征
Figure BDA00036249461300001025
必须接近:Step S601, loss of sequential consistency, since the instructions usually point to different visual signs sequentially, the action cues in the retrieved action cue set {p n } are also related to different objects/positions. , sequentially focus on the relevant action cues in the retrieved cue set, and propose a sequential consistency loss, which is the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism
Figure BDA00036249461300001024
and guidance features enhanced by attention mechanism
Figure BDA00036249461300001025
must be close to:

Figure BDA0003624946130000111
Figure BDA0003624946130000111

定义,

Figure BDA0003624946130000112
用于提高基于注意力机制增强的图像子提示特征
Figure BDA0003624946130000113
和基于注意力机制增强的视觉特征之间的相似性,则顺序一致性损失Lc为:definition,
Figure BDA0003624946130000112
Image sub-cue features for improving attention-based enhancement
Figure BDA0003624946130000113
and the similarity between visual features enhanced by the attention mechanism, then the sequential consistency loss L c is:

Figure BDA0003624946130000114
Figure BDA0003624946130000114

步骤S602,总目标使用导航损失Ln,即模仿损失LIL和强化学习损失LRL,总训练目标为:Step S602, the overall target uses the navigation loss L n , namely the imitation loss L IL and the reinforcement learning loss L RL , and the overall training target is:

L=LRL1LIL2Lc3La L=L RL1 L IL2 L c3 L a

其中λ1,λ2和λ3是平衡损失的损失权重。where λ 1 , λ 2 and λ 3 are the loss weights to balance the losses.

实施例3Example 3

如图1所示,一种基于模态对齐的动作提示的视觉语言导航系统,包括:As shown in Figure 1, a visual language navigation system based on modal-aligned action prompts includes:

动作提示收集模块10,为了制作高质量动作提示库,采用新近开发的具有强大的跨模态对象/位置级对齐能力的对比语言图像预训练CLIP模型,用于定位物体/位置相关的图像。为了更好的对齐图像和动作短语,形成动作提示符,设计一个两分支方案用来收集图像和文本子提示。首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置,分别获得相关的图像和文本子提示。导航开始时,输入指令到动作提示集产生模块,智能体从提前建好的动作提示库中检索与指令相关的动作提示,构成动作提示集。The action cue collection module 10, in order to produce a high-quality action cue library, adopts the newly developed contrastive language image pre-training CLIP model with powerful cross-modal object/position level alignment capability for locating object/position-related images. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained respectively. When the navigation starts, input an instruction to the action prompt set generation module, and the agent retrieves the action prompt related to the instruction from the action prompt library built in advance to form an action prompt set.

模态对齐动作提示的视觉语言导航模块11,通过一个提示编码器来获取提示特征,与文本编码模块的输出指令特征连接在一起获得基于提示的指令特征。该特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策。The visual language navigation module 11 of the modal alignment action prompt obtains the prompt feature through a prompt encoder, and is connected with the output instruction feature of the text encoding module to obtain the prompt-based instruction feature. This feature and the output visual features of the visual encoding module are provided to the multi-layer transformer for action decision-making.

优化学习模块12,即模态对齐损失模块和连续一致性损失模块,实现有效的动作提示学习。The optimization learning module 12, namely the modal alignment loss module and the continuous consistency loss module, realizes effective action cue learning.

在本发明具体实施例中,具体地,模态对齐动作提示的视觉语言导航模块11进一步包括:In a specific embodiment of the present invention, specifically, the visual language navigation module 11 of the modal alignment action prompt further includes:

文本编码模块110,模块接收语言信息的输入,利用自监督神经网络进行编码,获得相应的文本特征向量和状态特征。Text encoding module 110, the module receives the input of language information, uses self-supervised neural network for encoding, and obtains corresponding text feature vectors and state features.

提示编码模块111,该模块由两个单模态子提示编码器和一个多模态提示编码器组成,图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征。The prompt encoding module 111 is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and the text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder, and connect the Then input into the multi-modal prompt encoder to obtain prompt features.

视觉编码模块112,该模块接收视觉观察信息的输入,通过预训练的视觉特征编码器进行编码,获取对应的特征向量。Visual coding module 112, this module receives the input of visual observation information, performs coding through the pre-trained visual feature encoder, and obtains the corresponding feature vector.

在本发明具体实施例中,具体地,优化学习模块12进一步包括:In a specific embodiment of the present invention, specifically, the optimization learning module 12 further includes:

模态对齐损失模块120,当动作提示已经有匹配的图像和文本子提示,它们可能不会在特征空间中对齐。要解决这个问题,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐。通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐。Modal alignment loss module 120, when action cues already have matching image and text subcues, they may not be aligned in the feature space. To address this issue, following the contrastive learning paradigm used in CLIP to make pairs of image and text features similar, while unpaired image and text features are dissimilar, an infoNCE loss is used to facilitate the comprehension of image and text sub-cues in each action cue Feature alignment. With the modality alignment loss, action cues can be made more discriminative, knowing to learn action-level modal alignment.

连续一致性损失模块121,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集中的动作提示也与不同的物体/位置相关。为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和。以文本模态为例,在每个时间步骤t上,文本子提示特征以及指导特征必须接近;类似的损失定义在图像模态,用于提高图像子提示特征和视觉特征之间的相似性。In the continuous consistency loss module 121, since the instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set are also related to different objects/locations. To motivate the agent to sequentially focus on relevant action cues in the retrieved cue set according to its observations, a sequential consistency loss is proposed, which is the sum of two unimodal consistency losses. Taking the text modality as an example, at each time step t, the text sub-cue features as well as the guidance features must be close; a similar loss is defined in the image modality to improve the similarity between the image sub-cue features and visual features.

如图2所示,上述基于模态对齐的动作提示的视觉语言导航系统的导航方法,包括如下步骤:As shown in FIG. 2, the above-mentioned navigation method of the visual language navigation system based on the action prompt of modal alignment includes the following steps:

步骤S1,根据输入的指令信息检索相关动作提示集。Step S1: Retrieve a relevant action prompt set according to the input instruction information.

具体地,步骤S1进一步包括:Specifically, step S1 further includes:

步骤S100,动作提示库的建设。为了更好的对齐图像和动作短语,形成动作提示符,设计了一个两分支方案来收集图像和文本子提示。首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体/位置词汇表来查找指令中提及的视觉物体/位置。然后对于每个视觉物体/位置,分别获得相关的图像和文本子提示,如下所述。Step S100, construction of an action prompt library. To better align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts. First, for an instruction path instance in the training dataset, a pre-created visual object/location vocabulary is used to find the visual object/location mentioned in the instruction. Then for each visual object/location, the associated image and textual subcues are obtained separately, as described below.

请注意,ground-truth路径序列包含一个单视图图像的集合,每一个都表示一个需要在特定的时间步骤进行的动作。因此,为了派生动作提示中的图像子提示,只从ground-truth路径序列中检索与物体/位置相关的图像,它本身包含行动信息。相比诉诸于现有的物体分类器或在固定的物品类别集合上训练的检测器,使用具有优秀的0-shot跨模态对齐能力的CLIP,用于定位locate物体/位置相关的图像。为了适应CLIP的推理过程,将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置。在动作序列中一个图像B属于c类的概率由以下方法计算:Note that the ground-truth path sequence contains a collection of single-view images, each representing an action that needs to be performed at a specific time step. Therefore, to derive image sub-cues in action cues, only object/location-related images, which themselves contain action information, are retrieved from the ground-truth path sequence. Compared to resorting to existing object classifiers or detectors trained on a fixed set of item categories, we use CLIP with excellent 0-shot cross-modal alignment capability for locating objects/position-related images. To accommodate the inference process of CLIP, the token {CLASS} token in the phrase "a photo of {CLASS}" is replaced with a visual object/location whose class label is c. The probability that an image B belongs to class c in the action sequence is calculated by:

Figure BDA0003624946130000131
Figure BDA0003624946130000131

其中τ1为温度temperature参数,sim为余弦相似度,b,wc分别为CLIP生成的图像特征和短语特征,M为词汇表的尺寸,然后选择与该短语相似度最大的图像作为图像子提示。where τ1 is the temperature parameter, sim is the cosine similarity, b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt.

为了获得文本子提示,使用一个简单的最近动词搜索方案,即找到一个特定的物体/位置词之前最近的动词(在预先构建的动词词汇中)。最后,拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示。To obtain textual subcues, a simple nearest-verb search scheme is used to find the nearest verb (in a pre-built verb vocabulary) before a specific object/location word. Finally, image and text subcues with the same visual object/position and action form an aligned action cue.

步骤S101,动作提示集的检索。在导航的开始,智能体从动作提示库中检索与指令相关的动作提示。计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度,用于检索与指令相关的动作提示集

Figure BDA0003624946130000132
其中N为该集合的大小。Step S101, retrieval of the action prompt set. At the beginning of navigation, the agent retrieves the action cues associated with the instruction from the action cue library. Calculates the sentence similarity between each object/location-related action phrase in the cue library and the textual sub-cues for retrieving the set of instruction-related action cues
Figure BDA0003624946130000132
where N is the size of the set.

步骤S2,通过神经网络分别对输入的图像信息和指令信息进行编码。Step S2, respectively encode the input image information and instruction information through a neural network.

具体地,步骤S2进一步包括:Specifically, step S2 further includes:

步骤S200,视觉输入的编码,对于时间步长t时,候选视图中的每个图像视图Ot,i,都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征vt,i,然后vt,i被视觉编码器Fv映射为视觉编码:Step S200, coding of visual input, for time step t, each image view O t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v t,i , Then v t, i are mapped to visual encoding by visual encoder F v :

Vt,i=Fv(vt,i;θv)V t,i =F v (v t,i ; θ v )

其中θv为Fv的参数,一组

Figure BDA0003624946130000133
代表时间t下的候选视觉编码。where θ v is the parameter of F v , a set of
Figure BDA0003624946130000133
represents the candidate visual encoding at time t.

步骤S201,语言输入的编码,初始化时,指令编码X和初始化后的状态特征s0通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得:Step S201, the encoding of language input, during initialization, the instruction encoding X and the initialized state feature s 0 are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:

Figure BDA0003624946130000134
Figure BDA0003624946130000134

其中Concat(·)代表连接concatenation操作,

Figure BDA0003624946130000135
表示self-attention模组的参数,s0将会在时间步骤t被更新为st。Where Concat( ) represents the connection concatenation operation,
Figure BDA0003624946130000135
Represents the parameters of the self-attention module, s 0 will be updated to s t at time step t .

步骤S3,通过模态编码器对动作提示集进行编码。使用

Figure BDA0003624946130000136
通过提示编码器得到提示编码
Figure BDA0003624946130000137
该提示编码器由两个单模态子提示编码器和一个多模态提示编码器组成,
Figure BDA0003624946130000138
其中图像子提示和文本子提示分别为
Figure BDA0003624946130000139
Figure BDA0003624946130000141
Figure BDA0003624946130000142
首先通过单模态子提示编码器得到子提示特征
Figure BDA0003624946130000143
Figure BDA0003624946130000144
In step S3, the action prompt set is encoded by the modal encoder. use
Figure BDA0003624946130000136
Get hint code from hint encoder
Figure BDA0003624946130000137
The cue encoder consists of two unimodal sub-cue encoders and a multi-modal cue encoder,
Figure BDA0003624946130000138
The image sub-prompt and text sub-prompt are respectively
Figure BDA0003624946130000139
and
Figure BDA0003624946130000141
and
Figure BDA0003624946130000142
First obtain the sub-cue features through a unimodal sub-cue encoder
Figure BDA0003624946130000143
and
Figure BDA0003624946130000144

Figure BDA0003624946130000145
Figure BDA0003624946130000145

Figure BDA0003624946130000146
Figure BDA0003624946130000146

其中Ei(·)使用参数θi,Eu(·)使用参数θu,分别表示图像子提示编码器和文本子提示编码器,然后将

Figure BDA0003624946130000147
Figure BDA0003624946130000148
输送到多模态提示编码器Ep(·),得到提示编码
Figure BDA0003624946130000149
where E i (·) uses the parameter θ i and E u (·) uses the parameter θ u , representing the image sub-hint encoder and the text sub-hint encoder, respectively, and then the
Figure BDA0003624946130000147
and
Figure BDA0003624946130000148
Send to the multimodal prompt encoder E p ( ), get the prompt code
Figure BDA0003624946130000149

Figure BDA00036249461300001410
Figure BDA00036249461300001410

其中θp为Ep(·)的参数,Concat(·)为连接运算,编码器Ei(·),Eu(·)和Ep(·)由一个线性层组成,后接dropout操作,以减少过拟合。where θ p is the parameter of E p ( ), Concat ( ) is the concatenation operation, the encoder E i ( ), E u ( ) and E p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.

步骤S4,在提示编码

Figure BDA00036249461300001411
和指令编码X的基础上,通过简单地将X和
Figure BDA00036249461300001412
连接起来,得到基于提示的指令特征Xp。Step S4, at the prompt encoding
Figure BDA00036249461300001411
and instruction encoding X, by simply combining X and
Figure BDA00036249461300001412
concatenated to obtain hint-based instruction features X p .

步骤S5,状态视觉特征Kt基于Kt和Xp之间的跨模态注意力

Figure BDA00036249461300001413
更新:Step S5, the state visual feature Kt is based on the cross-modal attention between Kt and Xp
Figure BDA00036249461300001413
renew:

Figure BDA00036249461300001414
Figure BDA00036249461300001414

然后将

Figure BDA00036249461300001415
分解为
Figure BDA00036249461300001416
Figure BDA00036249461300001417
获得不同的基于注意力机制增强的特征,参与指令特征
Figure BDA00036249461300001418
是通过对X进行
Figure BDA00036249461300001419
加权得到的,基于注意力机制增强的图像子提示特征
Figure BDA00036249461300001420
和基于注意力机制增强的文本子提示特征
Figure BDA00036249461300001421
通过对
Figure BDA00036249461300001422
Figure BDA00036249461300001423
进行
Figure BDA00036249461300001424
加权获得,
Figure BDA00036249461300001425
Figure BDA00036249461300001426
用于计算顺序一致性损失Lc,和baseline智能体一样,
Figure BDA00036249461300001427
用于更新状态特性,最后,将
Figure BDA00036249461300001428
输入
Figure BDA00036249461300001429
得到基于提示的动作预测概率
Figure BDA00036249461300001430
followed by
Figure BDA00036249461300001415
Decomposed into
Figure BDA00036249461300001416
and
Figure BDA00036249461300001417
Obtain different attention-based enhanced features, participating in instruction features
Figure BDA00036249461300001418
is performed on X by
Figure BDA00036249461300001419
Weighted, enhanced image sub-cue features based on attention mechanism
Figure BDA00036249461300001420
and enhanced text sub-cue features based on attention mechanism
Figure BDA00036249461300001421
through the pair
Figure BDA00036249461300001422
and
Figure BDA00036249461300001423
conduct
Figure BDA00036249461300001424
weighted gain,
Figure BDA00036249461300001425
and
Figure BDA00036249461300001426
Used to calculate the sequential consistency loss L c , like the baseline agent,
Figure BDA00036249461300001427
is used to update the state property, and finally, the
Figure BDA00036249461300001428
enter
Figure BDA00036249461300001429
Get cue-based action prediction probabilities
Figure BDA00036249461300001430

步骤S6,计算各损失的加权求和总目标,对模型进行更新优化,提高智能体导航性能和泛化能力。In step S6, the weighted sum total target of each loss is calculated, and the model is updated and optimized to improve the navigation performance and generalization ability of the agent.

具体地,步骤S6进一步包括:Specifically, step S6 further includes:

步骤S600,模态对齐损失,促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐:Step S600, the modal alignment loss, urges the action prompt to align the already matched image and text sub-prompts in the feature space, following the contrastive learning paradigm used in CLIP, making pairs of images and text features similar, while unpaired images Alienated from text features, the infoNCE loss is used to facilitate feature alignment of image and text sub-cues in each action cue:

Figure BDA00036249461300001431
Figure BDA00036249461300001431

其中τ2是温度参数,

Figure BDA00036249461300001432
表示动作提示pn的成对的图像和文本子提示的特征,
Figure BDA0003624946130000151
表示非配对的子提示,通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐。where τ2 is the temperature parameter,
Figure BDA00036249461300001432
The features representing pairs of image and text sub-prompts of action cue p n ,
Figure BDA0003624946130000151
Representing unpaired sub-cues, action cues can be made more discriminative through a modality alignment loss, knowing to learn action-level modal alignment.

步骤S601,顺序一致性损失,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集{pn}中的动作提示也与不同的物体/位置相关,为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和;在每个时间步骤t上,基于注意力机制增强的文本子提示特征

Figure BDA0003624946130000152
以及基于注意力机制增强的指导特征
Figure BDA0003624946130000153
必须接近:Step S601, loss of sequential consistency, since the instructions usually point to different visual signs sequentially, the action cues in the retrieved action cue set {p n } are also related to different objects/positions. , sequentially focus on the relevant action cues in the retrieved cue set, and propose a sequential consistency loss, which is the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism
Figure BDA0003624946130000152
and guidance features enhanced by attention mechanism
Figure BDA0003624946130000153
must be close to:

Figure BDA0003624946130000154
Figure BDA0003624946130000154

定义,

Figure BDA0003624946130000155
用于提高基于注意力机制增强的图像子提示特征
Figure BDA0003624946130000156
和基于注意力机制增强的视觉特征之间的相似性,则顺序一致性损失Lc为:definition,
Figure BDA0003624946130000155
Image sub-cue features for improving attention-based enhancement
Figure BDA0003624946130000156
and the similarity between visual features enhanced by the attention mechanism, then the sequential consistency loss L c is:

Figure BDA0003624946130000157
Figure BDA0003624946130000157

步骤S602,总目标使用导航损失Ln,即模仿损失LIL和强化学习损失LRL,总训练目标为:Step S602, the overall target uses the navigation loss L n , namely the imitation loss L IL and the reinforcement learning loss L RL , and the overall training target is:

L=LRL1LIL2Lc3La L=L RL1 L IL2 L c3 L a

其中λ1,λ2和λ3是平衡损失的损失权重。where λ 1 , λ 2 and λ 3 are the loss weights to balance the losses.

图3为本发明具体实施例中模态对齐动作提示的视觉语言导航模块示例图。FIG. 3 is an example diagram of a visual language navigation module for modal alignment action prompts in a specific embodiment of the present invention.

本图展示baseline智能体和之间的动作决策比较。借助“走向楼梯”相关的动作提示,本发明可以选择正确的动作,在给定的观测中成功导航。This figure shows a comparison of action decisions between baseline agents and . With action prompts related to "going to the stairs", the present invention can select the correct action to successfully navigate in a given observation.

图4为本发明具体实施例中动作提示集产生模块的动作提示库建造的示例图。FIG. 4 is an example diagram of the construction of an action prompt library of an action prompt set generation module in a specific embodiment of the present invention.

本发明使用一个两分支方案来收集图像和文本子提示,首先,对于训练数据集中的一个指令路径实例,采用新近开发的具有强大的跨模态对象/位置级对齐能力的对比语言图像预训练CLIP模型,将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置。计算在动作序列中一个图像B属于c类的概率,然后选择与该短语相似度最大的图像作为图像子提示。对于文本子提示,使用最近动词搜索方案,即找到一个特定的物体/位置词之前最近的动词(在预先构建的动词词汇中)。The present invention uses a two-branch scheme to collect image and text sub-prompts. First, for an instruction path instance in the training data set, a newly developed contrastive language image pre-training CLIP with strong cross-modal object/location-level alignment capabilities is used. Model that replaces the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/location whose class label is c. Calculate the probability that an image B belongs to class c in the action sequence, and then select the image with the greatest similarity to the phrase as the image sub-prompt. For textual subprompts, a nearest verb search scheme is used, i.e. finding the nearest verb (in a pre-built verb vocabulary) before a particular object/position word.

图5为本发明具体实施例中应用视觉语言导航方法与baseline方法结果导航的结果样例对比展示。本发明通过引入动作提示,可以准确地做出动作决策,完成成功的导航。在与“走过窗户”相关的动作提示的帮助下,本发明在前两个导航步骤中执行正确的“走过窗户”动作。然而,baseline智能体在导航过程中未能执行“走过窗户”的动作,从而导致错误的轨迹。FIG. 5 is a comparative display of a result sample of the result navigation using the visual language navigation method and the baseline method in the specific embodiment of the present invention. By introducing action prompts, the present invention can accurately make action decisions and complete successful navigation. The present invention performs the correct "walking through the window" action in the first two navigation steps with the help of the action cues related to "walking through the window". However, the baseline agent failed to perform the "walking through the window" action during navigation, resulting in an incorrect trajectory.

相同或相似的标号对应相同或相似的部件;The same or similar reference numbers correspond to the same or similar parts;

附图中描述位置关系的用于仅用于示例性说明,不能理解为对本专利的限制;The positional relationship described in the accompanying drawings is only for exemplary illustration, and should not be construed as a limitation on this patent;

显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.

Claims (10)

1.一种基于模态对齐的动作提示的视觉语言导航系统,其特征在于,包括:1. a visual language navigation system based on the action prompt of modal alignment, is characterized in that, comprises: 动作提示集产生模块,输入指令到动作提示集产生模块,智能体在导航开始前从动作提示库中检索与指令相关的动作提示集;Action prompt set generation module, inputting an instruction to the action prompt set generation module, the agent retrieves the action prompt set related to the instruction from the action prompt library before navigation starts; 模态对齐动作提示的视觉语言导航模块,动作提示集通过提示编码模块,输出提示特征与文本编码模块的输出指令特征连接在一起;基于提示的指令特征和视觉编码模块的输出视觉特征被提供给多层transformer用来做动作决策;The visual language navigation module of the modal alignment action cue, the action cue set is connected through the cue coding module, and the output cue feature is connected with the output instruction feature of the text coding module; the cue-based instruction feature and the output visual feature of the visual coding module are provided to Multi-layer transformers are used to make action decisions; 优化学习模块,即模态对齐损失模块和连续一致性损失模块,实现有效的动作提示学习。The optimized learning modules, namely the modal alignment loss module and the continuous consistency loss module, enable effective action cue learning. 2.根据权利要求1所述的基于模态对齐的动作提示的视觉语言导航系统,其特征在于,所述模态对齐动作提示的视觉语言导航模块包括:2. The visual language navigation system based on the action prompt of modal alignment according to claim 1, wherein the visual language navigation module of the modal alignment action prompt comprises: 文本编码模块该模块接收语言信息的输入,利用多层transformer神经网络分别进行编码,获得相应的特征向量;Text encoding module This module receives the input of language information, uses the multi-layer transformer neural network to encode separately, and obtains the corresponding feature vector; 提示解码模块,该模块由两个单模态子提示编码器和一个多模态提示编码器组成,图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征;The prompt decoding module is composed of two single-modality sub-prompt encoders and a multi-modal prompt encoder. The image sub-prompt and text sub-prompt respectively obtain the sub-prompt features through the corresponding single-modal auto-encoder. Input into the multi-modal prompt encoder to obtain prompt features; 视觉编码模块,该模块接收视觉观察信息的输入,通过视觉编码器进行编码,获取对应的特征向量。Visual coding module, this module receives the input of visual observation information, encodes it through the visual encoder, and obtains the corresponding feature vector. 3.根据权利要求2所述的基于模态对齐的动作提示的视觉语言导航系统,其特征在于,所述优化学习模块包括:3. The visual language navigation system based on the action prompt of modal alignment according to claim 2, is characterized in that, described optimization learning module comprises: 模态对齐损失模块,当动作提示已经有匹配的图像和文本子提示,利用InfoNCE损失使得它们在在特征空间中对齐,动作提示可以变得更加具有识别力;Modal alignment loss module, when the action prompt already has matching image and text sub-prompts, using the InfoNCE loss to align them in the feature space, the action prompt can become more discriminative; 连续一致性损失模块,促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示。A continuous consistency loss module that prompts the agent to sequentially focus on relevant action cues in the retrieved cue set based on its observations. 4.一种应用权利要求3所述系统的视觉语言导航方法,其特征在于,包括以下步骤:4. a visual language navigation method applying the described system of claim 3, is characterized in that, comprises the following steps: S1:在导航的开始,智能体获取指令,通过动作提示产生模块从动作提示库中检索与指令相关的动作提示集;S1: At the beginning of navigation, the agent obtains the instruction, and retrieves the action prompt set related to the instruction from the action prompt library through the action prompt generation module; S2:通过视觉编码模块和文本编码模块,对神经网络分别对输入的图像信息和指令信息进行编码,分别获得视觉编码,指令编码,状态特征;S2: Through the visual encoding module and the text encoding module, the neural network respectively encodes the input image information and instruction information, and obtains visual encoding, instruction encoding, and state features respectively; S3:通过提示编码模,动作提示集中图像子提示和文本子提示分别通过对应的单模态自编码器得到子提示特征,连接以后输入进多模态提示编码器,获得提示特征;S3: Through the prompt coding mode, the image sub-prompt and the text sub-prompt in the action prompt set respectively obtain the sub-prompt feature through the corresponding single-modal auto-encoder, and then input into the multi-modal prompt encoder after connection to obtain the prompt feature; S4:将上述指令编码和提示编码连接起来获得基于提示的指令特征,将上述状态特征与视觉编码连接起来,得到状态视觉特征;S4: connect the above-mentioned instruction code and the prompt code to obtain the prompt-based instruction feature, and connect the above-mentioned state feature with the visual code to obtain the state visual feature; S5:通过模态对齐动作提示的视觉语言导航模块,状态视觉特征基于自身和基于提示的指令特征之间的跨模态注意力更新,将该注意力分解为两部分,第一部分对指令编码加权更新,用于更新状态特征,第二部分对图像和文本子提示特征进行加权更新,用于计算顺序一致性损失,将状态视觉特征输入另一个自注意力模块,以获得状态特征关于视觉特征的注意力分数,即基于提示的动作预测概率;S5: By modally aligning the visual language navigation module of action cues, the state visual feature is updated based on the cross-modal attention between itself and cue-based instruction features, and the attention is decomposed into two parts, the first part weights the instruction encoding Update, which is used to update the state features, the second part performs weighted updates on the image and text sub-cue features, which are used to calculate the sequential consistency loss, and feed the state visual features into another self-attention module to obtain the state features with respect to the visual features. Attention score, which is the probability of action prediction based on cues; S6:通过优化学习模,结合常用的模仿学习损失和强化学习损失,以及本发明特有的模态对齐损失和连续一致性损失,进行加权求和,获得总训练目标,对模型进行更新优化,提高智能体导航性能和泛化能力。S6: By optimizing the learning model, combining the commonly used imitation learning loss and reinforcement learning loss, as well as the unique modal alignment loss and continuous consistency loss of the present invention, weighted summation is performed to obtain the total training target, and the model is updated and optimized to improve Agent Navigation Performance and Generalization. 5.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S1包括以下子步骤:5. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S1 comprises the following sub-steps: S100:动作提示库的建设,为了对齐图像和动作短语,形成动作提示符,设计两分支方案来收集图像和文本子提示:首先,对于训练数据集中的一个指令路径实例,使用一个提前创建好的视觉物体visual object/位置location词汇表来查找指令中提及的视觉物体/位置,对于每个视觉物体/位置,分别获得相关的图像和文本子提示,使用具有优秀的0-shot跨模态对齐能力的CLIP,用于定位物体/位置相关的图像,为了适应CLIP的推理过程,将短语“a photo of{CLASS}”中的标记{CLASS}token替换为类别标签是c的可视物体/位置,在动作序列中一个图像B属于c类的概率由以下方法计算:S100: Construction of action prompt library, in order to align images and action phrases to form action prompts, a two-branch scheme is designed to collect image and text sub-prompts: First, for an instruction path instance in the training dataset, use a pre-created Visual object visual object/location location vocabulary to find the visual object/location mentioned in the instruction, for each visual object/location, get the relevant image and text sub-cues respectively, using cross-modal alignment with excellent 0-shot Capable CLIP for locating object/position related images, to accommodate the inference process of CLIP, replace the token {CLASS} token in the phrase "a photo of {CLASS}" with a visual object/position whose class label is c , the probability that an image B belongs to class c in the action sequence is calculated by:
Figure FDA0003624946120000021
Figure FDA0003624946120000021
其中τ1为温度temperature参数,sim为余弦相似度,b,wc分别为CLIP生成的图像特征和短语特征,M为词汇表的尺寸,然后选择与该短语相似度最大的图像作为图像子提示,为了获得文本子提示,使用简单的最近动词搜索方案,即找到一个特定的物体/位置词之前最近的动词,该动词在预先构建的动词词汇中,最后,拥有相同的视觉物体/位置和动作的图像和文本子提示形成一个对齐的动作提示;where τ1 is the temperature parameter, sim is the cosine similarity, b , wc are the image features and phrase features generated by CLIP, respectively, M is the size of the vocabulary, and then the image with the greatest similarity to the phrase is selected as the image sub-prompt, To obtain textual subcues, a simple nearest verb search scheme is used, i.e. to find the nearest verb before a specific object/position word, which is in a pre-built verb vocabulary, and finally, has the same visual object/position and action Image and text subprompts form an aligned action prompt; S101:动作提示集的检索,在导航的开始,智能体从动作提示库中检索与指令相关的动作提示,计算提示库中每个与对象/位置相关的动作短语与文本子提示之间的句子相似度,用于检索与指令相关的动作提示集
Figure FDA0003624946120000031
其中N为该集合的大小。
S101: Retrieval of action prompt set, at the beginning of navigation, the agent retrieves the action prompt related to the instruction from the action prompt library, and calculates the sentence between each object/position related action phrase in the prompt library and the text sub-prompt Similarity for retrieving the set of action cues associated with the instruction
Figure FDA0003624946120000031
where N is the size of the set.
6.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S2包括以下子步骤:6. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S2 comprises the following sub-steps: S200:视觉输入的编码,对于时间步长t时,候选视图中的每个图像视图Ot,i,都将使用一个预先训练的卷积神经网络CNN或transformer提取图像特征vt,i,然后vt,i被视觉编码器Fv映射为视觉编码:S200: Encoding of visual input, for time step t, each image view O t,i in the candidate view will use a pre-trained convolutional neural network CNN or transformer to extract image features v t,i , and then v t,i are mapped to visual encoding by visual encoder F v : Vt,i=Fv(vt,i;θv)V t,i =F v (v t,i ; θ v ) 其中θv为Fv的参数,一组
Figure FDA0003624946120000032
代表时间t下的候选视觉编码;
where θ v is the parameter of F v , a set of
Figure FDA0003624946120000032
represents the candidate visual code at time t;
S201:语言输入的编码,初始化时,指令编码X和初始化后的状态特征s0通过输入指令序列I和[CLS]和[SEP]tokens给transformer中的self-attention模块获得:S201: The encoding of language input. During initialization, the instruction code X and the initialized state feature s 0 are obtained by inputting the instruction sequence I and [CLS] and [SEP] tokens to the self-attention module in the transformer:
Figure FDA0003624946120000033
Figure FDA0003624946120000033
其中Concat(·)代表连接concatenation操作,
Figure FDA0003624946120000034
表示self-attention模组的参数,s0将会在时间步骤t被更新为st
Where Concat( ) represents the connection concatenation operation,
Figure FDA0003624946120000034
Represents the parameters of the self-attention module, s 0 will be updated to s t at time step t .
7.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S3包括以步骤:7. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S3 comprises the steps of: 使用
Figure FDA0003624946120000035
通过提示编码器得到提示编码
Figure FDA0003624946120000036
该提示编码器由两个单模态子提示编码器和一个多模态提示编码器组成,
Figure FDA0003624946120000037
其中图像子提示和文本子提示分别为
Figure FDA0003624946120000038
Figure FDA0003624946120000039
Figure FDA00036249461200000310
Figure FDA00036249461200000311
首先通过单模态子提示编码器得到子提示特征
Figure FDA00036249461200000312
Figure FDA00036249461200000313
use
Figure FDA0003624946120000035
Get hint code from hint encoder
Figure FDA0003624946120000036
The cue encoder consists of two unimodal sub-cue encoders and a multi-modal cue encoder,
Figure FDA0003624946120000037
The image sub-prompt and text sub-prompt are respectively
Figure FDA0003624946120000038
and
Figure FDA0003624946120000039
Figure FDA00036249461200000310
and
Figure FDA00036249461200000311
First obtain the sub-cue features through a unimodal sub-cue encoder
Figure FDA00036249461200000312
and
Figure FDA00036249461200000313
Figure FDA00036249461200000314
Figure FDA00036249461200000314
Figure FDA00036249461200000315
Figure FDA00036249461200000315
其中Ei(·)使用参数θi,Eu(·)使用参数θu,分别表示图像子提示编码器和文本子提示编码器,然后将
Figure FDA00036249461200000316
Figure FDA00036249461200000317
输送到多模态提示编码器Ep(·),得到提示编码
Figure FDA0003624946120000041
where E i (·) uses the parameter θ i and E u (·) uses the parameter θ u , representing the image sub-hint encoder and the text sub-hint encoder, respectively, and then the
Figure FDA00036249461200000316
and
Figure FDA00036249461200000317
Send to the multimodal prompt encoder E p ( ), get the prompt code
Figure FDA0003624946120000041
Figure FDA0003624946120000042
Figure FDA0003624946120000042
其中θp为Ep(·)的参数,Concat(·)为连接运算,编码器Ei(·),Eu(·)和Ep(·)由一个线性层组成,后接dropout操作,以减少过拟合。where θ p is the parameter of E p ( ), Concat ( ) is the concatenation operation, the encoder E i ( ), E u ( ) and E p ( ) consist of a linear layer followed by a dropout operation, to reduce overfitting.
8.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S4包括以下子步骤:8. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S4 comprises the following sub-steps: 在提示编码
Figure FDA0003624946120000043
和指令编码X的基础上,通过简单地将X和
Figure FDA0003624946120000044
连接起来,得到基于提示的指令特征Xp
coding at the prompt
Figure FDA0003624946120000043
and instruction encoding X, by simply combining X and
Figure FDA0003624946120000044
concatenated to obtain hint-based instruction features X p .
9.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S5包括以下子步骤:9. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S5 comprises the following sub-steps: 状态视觉特征Kt基于Kt和Xp之间的跨模态注意力
Figure FDA0003624946120000045
更新:
The state visual feature Kt is based on cross-modal attention between Kt and Xp
Figure FDA0003624946120000045
renew:
Figure FDA0003624946120000046
Figure FDA0003624946120000046
然后将
Figure FDA0003624946120000047
分解为
Figure FDA0003624946120000048
Figure FDA0003624946120000049
获得不同的基于注意力机制增强的特征,参与指令特征
Figure FDA00036249461200000410
是通过对X进行
Figure FDA00036249461200000411
加权得到的,基于注意力机制增强的图像子提示特征
Figure FDA00036249461200000412
和基于注意力机制增强的文本子提示特征
Figure FDA00036249461200000413
通过对
Figure FDA00036249461200000414
Figure FDA00036249461200000415
进行
Figure FDA00036249461200000416
加权获得,
Figure FDA00036249461200000417
Figure FDA00036249461200000418
用于计算顺序一致性损失Lc,和baseline智能体一样,
Figure FDA00036249461200000419
用于更新状态特性,最后,将
Figure FDA00036249461200000420
输入
Figure FDA00036249461200000421
得到基于提示的动作预测概率
Figure FDA00036249461200000422
followed by
Figure FDA0003624946120000047
Decomposed into
Figure FDA0003624946120000048
and
Figure FDA0003624946120000049
Obtain different attention-based enhanced features, participating in instruction features
Figure FDA00036249461200000410
is performed on X by
Figure FDA00036249461200000411
Weighted, enhanced image sub-cue features based on attention mechanism
Figure FDA00036249461200000412
and enhanced text sub-cue features based on attention mechanism
Figure FDA00036249461200000413
through the pair
Figure FDA00036249461200000414
and
Figure FDA00036249461200000415
conduct
Figure FDA00036249461200000416
weighted gain,
Figure FDA00036249461200000417
and
Figure FDA00036249461200000418
Used to calculate the sequential consistency loss L c , like the baseline agent,
Figure FDA00036249461200000419
is used to update the state property, and finally, the
Figure FDA00036249461200000420
enter
Figure FDA00036249461200000421
Get cue-based action prediction probabilities
Figure FDA00036249461200000422
10.根据权利要求4所述的基于模态对齐的动作提示的视觉语言导航方法,其特征在于,所述步骤S6包括以下子步骤:10. The visual language navigation method based on the action prompt of modal alignment according to claim 4, wherein the step S6 comprises the following sub-steps: S600:模态对齐损失,促使动作提示已经有匹配的图像和文本子提示在特征空间中对齐,遵循CLIP中使用的对比学习范式,使成对的图像和文本特征相似,而不成对的图像和文本特征疏远,使用infoNCE损失以促进每个动作提示中图像和文本子提示的特征对齐:S600: Modal alignment loss, prompting action cues that already have matching image and text sub-cues to align in feature space, following the contrastive learning paradigm used in CLIP, making pairs of image and text features similar, while unpaired images and text Text feature alienation, using infoNCE loss to facilitate feature alignment of image and text sub-cues in each action cue:
Figure FDA00036249461200000423
Figure FDA00036249461200000423
其中τ2是温度参数,
Figure FDA00036249461200000424
表示动作提示pn的成对的图像和文本子提示的特征,
Figure FDA00036249461200000425
表示非配对的子提示,通过模态对齐损失,动作提示可以变得更加具有识别力,从而知道学习动作级别的模态对齐;
where τ2 is the temperature parameter,
Figure FDA00036249461200000424
The features representing pairs of image and text sub-prompts of action cue p n ,
Figure FDA00036249461200000425
Represents unpaired sub-prompts, through the modal alignment loss, the action cue can become more discriminative, so as to know the modal alignment of the learned action level;
S601:顺序一致性损失,由于指令通常顺序地指向不同的视觉标志,因此检索到的动作提示集{pn}中的动作提示也与不同的物体/位置相关,为了促使智能体根据其观察,按顺序关注检索到的提示集中的相关动作提示,提出顺序一致性损失,即两个单模态一致性损失之和;在每个时间步骤t上,基于注意力机制增强的文本子提示特征
Figure FDA0003624946120000051
以及基于注意力机制增强的指导特征
Figure FDA0003624946120000052
必须接近:
S601: Sequential consistency loss, since instructions usually point to different visual landmarks sequentially, the action cues in the retrieved action cue set {p n } are also related to different objects/positions. Focusing on the relevant action cues in the retrieved cue set in order, a sequential consistency loss is proposed, that is, the sum of the two unimodal consistency losses; at each time step t, the text sub-cue feature enhanced by the attention mechanism
Figure FDA0003624946120000051
and guidance features enhanced by attention mechanism
Figure FDA0003624946120000052
must be close to:
Figure FDA0003624946120000053
Figure FDA0003624946120000053
定义,
Figure FDA0003624946120000054
用于提高基于注意力机制增强的图像子提示特征
Figure FDA0003624946120000055
和基于注意力机制增强的视觉特征之间的相似性,则顺序一致性损失Lc为:
definition,
Figure FDA0003624946120000054
Image sub-cue features for improving attention-based enhancement
Figure FDA0003624946120000055
and the similarity between visual features enhanced by the attention mechanism, then the sequential consistency loss L c is:
Figure FDA0003624946120000056
Figure FDA0003624946120000056
S602:总目标使用导航损失Ln,即模仿损失LIL和强化学习损失LRL,总训练目标为:S602: The total objective uses the navigation loss L n , namely the imitation loss L IL and the reinforcement learning loss L RL , and the total training objective is: L=LRL1LIL2Lc3La L=L RL1 L IL2 L c3 L a 其中λ1,λ2和λ3是平衡损失的损失权重。where λ 1 , λ 2 and λ 3 are the loss weights to balance the losses.
CN202210467461.4A 2022-04-29 2022-04-29 A visual language navigation system and method based on action prompts of modal alignment Active CN114973402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210467461.4A CN114973402B (en) 2022-04-29 2022-04-29 A visual language navigation system and method based on action prompts of modal alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210467461.4A CN114973402B (en) 2022-04-29 2022-04-29 A visual language navigation system and method based on action prompts of modal alignment

Publications (2)

Publication Number Publication Date
CN114973402A true CN114973402A (en) 2022-08-30
CN114973402B CN114973402B (en) 2025-05-13

Family

ID=82980379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210467461.4A Active CN114973402B (en) 2022-04-29 2022-04-29 A visual language navigation system and method based on action prompts of modal alignment

Country Status (1)

Country Link
CN (1) CN114973402B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587596A (en) * 2022-10-10 2023-01-10 中国科学技术大学 Visual language navigation method based on cross-modal semantic alignment pre-training and application
CN115824213A (en) * 2022-11-18 2023-03-21 天津大学 A Visual Language Navigation Method Based on Follower Model
CN117875407A (en) * 2024-03-11 2024-04-12 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN119091329A (en) * 2024-08-28 2024-12-06 中国石油大学(华东) A fully autonomous navigation method for unmanned aerial vehicles with fine-grained environmental perception capabilities
CN119245649A (en) * 2024-09-24 2025-01-03 北京航空航天大学 A visual language navigation method for unmanned aerial vehicles based on visual target reference guidance
CN119807671A (en) * 2025-03-13 2025-04-11 山东大学 Visual language navigation method, system and medium based on motion feature alignment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324035A (en) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 Method and system for applying mouth-shape assisted speech recognition technology in vehicle navigation
US20200302510A1 (en) * 2019-03-24 2020-09-24 We.R Augmented Reality Cloud Ltd. System, Device, and Method of Augmented Reality based Mapping of a Venue and Navigation within a Venue
CN113119138A (en) * 2021-04-16 2021-07-16 中国科学技术大学 Blind-aiding robot system and method based on Internet of things
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 A system and method for generating video description text
CN113792112A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language task processing system, training method, device, equipment and medium
CN113804200A (en) * 2021-04-12 2021-12-17 之江实验室 Visual language navigation system and method based on dynamic reinforcement command attack module
US20250029170A1 (en) * 2019-03-24 2025-01-23 We.R Augmented Reality Cloud Ltd. Automatic Generation of In-Store Product Information and Navigation Guidance, Using Augmented Reality (AR) and a Vision-and-Language Model (VLM) and Multi-Modal Artificial Intelligence (AI)

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324035A (en) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 Method and system for applying mouth-shape assisted speech recognition technology in vehicle navigation
US20200302510A1 (en) * 2019-03-24 2020-09-24 We.R Augmented Reality Cloud Ltd. System, Device, and Method of Augmented Reality based Mapping of a Venue and Navigation within a Venue
US20250029170A1 (en) * 2019-03-24 2025-01-23 We.R Augmented Reality Cloud Ltd. Automatic Generation of In-Store Product Information and Navigation Guidance, Using Augmented Reality (AR) and a Vision-and-Language Model (VLM) and Multi-Modal Artificial Intelligence (AI)
CN113792112A (en) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 Visual language task processing system, training method, device, equipment and medium
CN113804200A (en) * 2021-04-12 2021-12-17 之江实验室 Visual language navigation system and method based on dynamic reinforcement command attack module
CN113119138A (en) * 2021-04-16 2021-07-16 中国科学技术大学 Blind-aiding robot system and method based on Internet of things
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 A system and method for generating video description text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GHAREHBAGH, AK: "Real-time 3D Semantic Mapping based on Keyframes and Octomap for Autonomous Cobot", 《2021 9TH INTERNATIONAL CONFERENCE ON CONTROL, MECHATRONICS AND AUTOMATION (ICCMA)》, 31 December 2021 (2021-12-31) *
金杰: "基于余弦相似的视觉语言导航算法", 《激光与光电子学进展》, 25 August 2021 (2021-08-25) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587596A (en) * 2022-10-10 2023-01-10 中国科学技术大学 Visual language navigation method based on cross-modal semantic alignment pre-training and application
CN115824213A (en) * 2022-11-18 2023-03-21 天津大学 A Visual Language Navigation Method Based on Follower Model
CN117875407A (en) * 2024-03-11 2024-04-12 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN117875407B (en) * 2024-03-11 2024-06-04 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN119091329A (en) * 2024-08-28 2024-12-06 中国石油大学(华东) A fully autonomous navigation method for unmanned aerial vehicles with fine-grained environmental perception capabilities
CN119245649A (en) * 2024-09-24 2025-01-03 北京航空航天大学 A visual language navigation method for unmanned aerial vehicles based on visual target reference guidance
CN119807671A (en) * 2025-03-13 2025-04-11 山东大学 Visual language navigation method, system and medium based on motion feature alignment
CN119807671B (en) * 2025-03-13 2025-05-09 山东大学 Visual language navigation method, system and medium based on motion feature alignment

Also Published As

Publication number Publication date
CN114973402B (en) 2025-05-13

Similar Documents

Publication Publication Date Title
CN114973402A (en) A visual language navigation system and method based on modal-aligned action prompts
CN110188176B (en) Deep learning neural network, and training and predicting method, system, device and medium
Bhunia et al. Joint visual semantic reasoning: Multi-stage decoder for text recognition
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
CN115964467A (en) A Semantic-rich Dialogue Generation Method Integrating Visual Context
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN117571014B (en) A visual language navigation method combining image description and text generation
CN112699682B (en) Named entity identification method and device based on combinable weak authenticator
CN113535904B (en) Aspect level emotion analysis method based on graph neural network
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN115115913A (en) A data processing method, device, electronic device and storage medium
CN114298158A (en) A Multimodal Pre-training Method Based on Linear Combination of Graphics and Text
CN110188182A (en) Model training method, dialogue generation method, device, equipment and medium
CN111967272B (en) Visual dialogue generating system based on semantic alignment
WO2019235103A1 (en) Question generation device, question generation method, and program
CN114398976A (en) Machine reading comprehension method based on BERT and gated attention-enhanced network
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN115082915B (en) A visual-language navigation method for mobile robots based on multi-modal features
CN116010622A (en) BERT knowledge graph completion method and system for fusion entity type
CN118820785A (en) A visual language navigation method based on enhanced endpoint alignment to improve VLN-BERT
CN117669693A (en) A knowledge distillation method and system based on multi-teacher multi-modal model
CN117428780A (en) Robot motor skill learning method integrating text instruction and motion information
CN115953569A (en) A One-Stage Visual Localization Model Construction Method Based on Multi-step Reasoning
CN113780350A (en) Image description method based on ViLBERT and BilSTM
CN116681087B (en) An automatic question generation method based on multi-stage timing and semantic information enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant