[go: up one dir, main page]

CN113792207A - Cross-modal retrieval method based on multi-level feature representation alignment - Google Patents

Cross-modal retrieval method based on multi-level feature representation alignment Download PDF

Info

Publication number
CN113792207A
CN113792207A CN202111149240.4A CN202111149240A CN113792207A CN 113792207 A CN113792207 A CN 113792207A CN 202111149240 A CN202111149240 A CN 202111149240A CN 113792207 A CN113792207 A CN 113792207A
Authority
CN
China
Prior art keywords
text
image
data
target
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111149240.4A
Other languages
Chinese (zh)
Other versions
CN113792207B (en
Inventor
张卫锋
周俊峰
王小江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiaxing University
Original Assignee
Jiaxing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiaxing University filed Critical Jiaxing University
Priority to CN202111149240.4A priority Critical patent/CN113792207B/en
Publication of CN113792207A publication Critical patent/CN113792207A/en
Application granted granted Critical
Publication of CN113792207B publication Critical patent/CN113792207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于多层次特征表示对齐的跨模态检索方法,涉及跨模态检索技术领域。本发明通过在跨模态细粒度精确对齐阶段,分别计算图像和文本两种不同模态数据之间的全局相似度、局部相似度和关系相似度,并融合得到图像‑文本综合相似度,在神经网络训练阶段,设计相应损失函数,挖掘跨模态结构约束信息,从多个角度约束和监督检索模型的参数学习,最后根据图像‑文本综合相似度获取测试查询样例的检索结果,从而通过引入图像和文本两种不同模态数据之间的细粒度关联关系,有效提高跨模态检索的准确率,在图文检索、模式识别等领域具有广泛的市场需求和应用前景。

Figure 202111149240

The invention discloses a cross-modal retrieval method based on multi-level feature representation alignment, and relates to the technical field of cross-modal retrieval. The present invention calculates the global similarity, local similarity and relationship similarity between two different modal data of image and text respectively in the stage of cross-modal fine-grained and accurate alignment, and fuses to obtain the comprehensive similarity of image-text, in the In the neural network training phase, the corresponding loss function is designed, the cross-modal structure constraint information is mined, the parameter learning of the retrieval model is constrained and supervised from multiple perspectives, and finally the retrieval results of the test query samples are obtained according to the comprehensive similarity of the image-text, so as to pass Introducing the fine-grained correlation between two different modal data, image and text, effectively improves the accuracy of cross-modal retrieval, and has broad market demand and application prospects in image retrieval, pattern recognition and other fields.

Figure 202111149240

Description

一种基于多层次特征表示对齐的跨模态检索方法A cross-modal retrieval method based on alignment of multi-level feature representations

技术领域technical field

本发明涉及跨模态检索技术领域,特别涉及一种基于多层次特征表示对齐的跨模态检索方法。The invention relates to the technical field of cross-modality retrieval, in particular to a cross-modality retrieval method based on alignment of multi-level feature representations.

背景技术Background technique

随着移动互联网、社交网络等新一代互联网技术的快速发展,文本、图像、视频等多模态数据呈现爆炸式增长。跨模态检索技术旨在通过挖掘和利用不同模态数据之间的关联信息,实现不同模态数据之间的跨越检索,其核心是实现跨模态数据之间的相似度度量。近年来,跨模态检索技术已成为国内外研究热点,受到学术界和工业界的广泛关注,是跨模态智能的重要研究领域之一,也是信息检索领域未来发展的重要方向。With the rapid development of new-generation Internet technologies such as mobile Internet and social networks, multi-modal data such as text, images, and videos have exploded. The cross-modal retrieval technology aims to realize the cross-retrieval between different modal data by mining and utilizing the correlation information between the different modal data, and its core is to realize the similarity measurement between the cross-modal data. In recent years, cross-modal retrieval technology has become a research hotspot at home and abroad, and has received extensive attention from academia and industry. It is one of the important research fields of cross-modal intelligence and an important direction for the future development of information retrieval.

跨模态检索同时涉及多种模态的数据,这些数据之间存在“异构鸿沟”,即它们在高层语义上相互关联,但在底层特征上呈现异构性,因此需要检索算法能够深入挖掘不同模态数据之间的关联信息,实现一种模态数据到另一种模态数据的对齐。Cross-modal retrieval involves data from multiple modalities at the same time. There is a "heterogeneous gap" between these data, that is, they are related to each other in high-level semantics, but they are heterogeneous in underlying features, so the retrieval algorithm needs to be able to dig deeper. The association information between different modal data, to realize the alignment of one modal data to another.

目前,子空间学习方法是跨模态检索的主流方法,该类方法又可细分为基于传统统计相关性分析的检索模型和基于深度学习的检索模型。其中,基于传统统计相关性分析的跨模态检索方法通过线性映射矩阵将不同模态数据映射到子空间,最大化不同模态数据之间的相关性。基于深度学习的跨模态检索方法利用深度神经网络的特征抽取能力抽取不同模态数据的有效表示,同时利用神经网络的复杂非线性映射能力挖掘跨模态数据之间复杂关联特性。At present, the subspace learning method is the mainstream method of cross-modal retrieval, which can be subdivided into retrieval models based on traditional statistical correlation analysis and retrieval models based on deep learning. Among them, the cross-modal retrieval method based on traditional statistical correlation analysis maps different modal data to subspaces through a linear mapping matrix to maximize the correlation between different modal data. The deep learning-based cross-modal retrieval method uses the feature extraction ability of deep neural network to extract effective representations of different modal data, and uses the complex nonlinear mapping ability of neural network to mine complex correlation characteristics between cross-modal data.

在实现本发明的过程中,申请人发现现有技术存在以下技术问题:In the process of realizing the present invention, the applicant found that the prior art has the following technical problems:

现有技术提供的跨模态检索方法注重图像和文本的全局特征和局部特征的表示学习、关联分析和对齐,但缺乏视觉目标之间关系的推理和关系信息的对齐,且无法全面有效利用训练数据蕴含的结构约束信息监督模型进行训练,导致跨模态检索方法对图像和文本的跨模态检索精确度较低。The cross-modal retrieval methods provided by the prior art focus on the representation learning, association analysis and alignment of global and local features of images and texts, but lack reasoning about the relationship between visual objects and alignment of relationship information, and cannot fully and effectively utilize training. The structural constraint information contained in the data supervises the training of the model, resulting in low cross-modal retrieval accuracy of images and texts for cross-modal retrieval methods.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术存在的上述问题,本发明提供了一种基于多层次特征表示对齐的跨模态检索方法,通过跨模态多层次表示关联,准确衡量图像和文本之间的相似度,有效提供检索准确率,从而解决现有跨模态检索方法表示不够精细、跨模态关联不够充分的技术问题,同时,利用跨模态结构约束信息监督检索模型的训练。本发明的技术方案如下:In order to solve the above problems existing in the prior art, the present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. Provide retrieval accuracy to solve the technical problems of insufficient representation of existing cross-modal retrieval methods and insufficient cross-modal correlation, and at the same time, use cross-modal structural constraint information to supervise the training of retrieval models. The technical scheme of the present invention is as follows:

根据本发明实施例的一个方面,提供一种基于多层次特征表示对齐的跨模态检索方法,其特征在于,所述方法包括:According to an aspect of the embodiments of the present invention, a cross-modal retrieval method based on multi-level feature representation alignment is provided, wherein the method includes:

获取训练数据集,对于所述训练数据集中的每组数据对,所述数据对包括图像数据、文本数据,以及所述图像数据与所述文本数据共同对应的语义标签;Obtaining a training data set, for each group of data pairs in the training data set, the data pairs include image data, text data, and a semantic label corresponding to the image data and the text data in common;

对于所述训练数据集中的每组数据对,分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征,以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征;For each group of data pairs in the training data set, extract the image global features, image local features and image relationship features corresponding to the image data in the data pair, and text global features, textual features corresponding to the text data in the data pair, respectively. Local features and text relationship features;

对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对,根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度;For a target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, the image local feature and text local feature corresponding to the target data pair The image-text comprehensive similarity corresponding to the target data pair is obtained by calculating the corresponding image relationship feature and the text relationship feature of the feature, the target data pair;

基于各组目标数据对对应的图像-文本综合相似度,设计模态间结构约束损失函数和模态内结构约束损失函数,并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对模型进行训练。Based on the image-text comprehensive similarity corresponding to each set of target data pairs, an inter-modal structural constraint loss function and an intra-modal structural constraint loss function are designed, and the inter-modal structural constraint loss function and the intra-modal structural constraint loss function are used. The constrained loss function trains the model.

在一个优选的实施例中,所述对于所述训练数据集中的每组数据对,分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征,以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征的步骤,包括:In a preferred embodiment, for each group of data pairs in the training data set, image global features, image local features and image relationship features corresponding to the image data in the data pairs are extracted respectively, and the data pairs The steps of text global features, text local features and text relationship features corresponding to Chinese text data include:

对于所述训练数据集中的每组数据对,采用卷积神经网络CNN提取所述数据对所 对应图像数据的图像全局特征

Figure 21599DEST_PATH_IMAGE001
,然后采用视觉目标检测器检测所述图像数据包括的视觉 目标并提取每个视觉目标的图像局部特征
Figure 834704DEST_PATH_IMAGE002
,其中,M为所述图像数据包括的视 觉目标数量,
Figure 956243DEST_PATH_IMAGE003
为视觉目标
Figure 263728DEST_PATH_IMAGE004
的特征向量,再通过图像视觉关系编码网络提取各个视觉目 标之间的图像关系特征
Figure 108318DEST_PATH_IMAGE005
,其中,
Figure 905373DEST_PATH_IMAGE006
为视觉目标
Figure 514209DEST_PATH_IMAGE004
和视觉目标
Figure 359805DEST_PATH_IMAGE007
之间的图像关系特 征; For each group of data pairs in the training data set, the convolutional neural network CNN is used to extract the image global features of the image data corresponding to the data pairs
Figure 21599DEST_PATH_IMAGE001
, and then use a visual target detector to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 834704DEST_PATH_IMAGE002
, where M is the number of visual objects included in the image data,
Figure 956243DEST_PATH_IMAGE003
for visual target
Figure 263728DEST_PATH_IMAGE004
feature vector, and then extract the image relationship features between each visual object through the image visual relationship coding network
Figure 108318DEST_PATH_IMAGE005
,in,
Figure 905373DEST_PATH_IMAGE006
for visual target
Figure 514209DEST_PATH_IMAGE004
and visual target
Figure 359805DEST_PATH_IMAGE007
The image relationship features between;

对于所述训练数据集中的每组数据对,采用词嵌入模型将所述数据对所对应文本 数据中的每个词转换为词向量

Figure 370486DEST_PATH_IMAGE008
,其中,N为所述文本数据包括的词数量,然后将各个词 向量依次输入至递归神经网络,获得所述文本数据对应的文本全局特征
Figure 525393DEST_PATH_IMAGE009
,再将各个词向 量输入至前馈神经网络获得各个词对应的文本局部特征
Figure 621525DEST_PATH_IMAGE010
,同时将各个词向量 输入至文本关系编码网络提取各个词之间的文本关系特征
Figure 270812DEST_PATH_IMAGE011
,其中,
Figure 87065DEST_PATH_IMAGE012
为词
Figure 225922DEST_PATH_IMAGE004
和词
Figure 481454DEST_PATH_IMAGE007
之间的文本关系特征。 For each set of data pairs in the training data set, a word embedding model is used to convert each word in the text data corresponding to the data pair into a word vector
Figure 370486DEST_PATH_IMAGE008
, where N is the number of words included in the text data, and then input each word vector into the recurrent neural network in turn to obtain the text global features corresponding to the text data
Figure 525393DEST_PATH_IMAGE009
, and then input each word vector into the feedforward neural network to obtain the local text features corresponding to each word
Figure 621525DEST_PATH_IMAGE010
, and input each word vector into the text relation coding network to extract the text relation features between each word
Figure 270812DEST_PATH_IMAGE011
,in,
Figure 87065DEST_PATH_IMAGE012
for word
Figure 225922DEST_PATH_IMAGE004
and word
Figure 481454DEST_PATH_IMAGE007
textual relationship between.

在一个优选的实施例中,所述对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对,根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度的步骤,包括:In a preferred embodiment, for the target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, the target The step of calculating the image-text comprehensive similarity corresponding to the target data pair by calculating the corresponding image local feature and text local feature, the target data pair corresponding image relationship feature and text relationship feature, including:

对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对,基于所 述目标数据对中图像数据对应的图像全局特征

Figure 996749DEST_PATH_IMAGE013
和文本数据对应的文本全局特征
Figure 168974DEST_PATH_IMAGE009
的 余弦距离,计算得到所述目标数据对对应的图像-文本全局相似度
Figure 478732DEST_PATH_IMAGE014
;其中,图像-文本 全局相似度
Figure 221560DEST_PATH_IMAGE015
的计算公式如公式(1): For a target data pair composed of any image data and any text data in the training data set, based on the image global feature corresponding to the image data in the target data pair
Figure 996749DEST_PATH_IMAGE013
Text global features corresponding to text data
Figure 168974DEST_PATH_IMAGE009
The cosine distance is calculated to obtain the image-text global similarity corresponding to the target data pair
Figure 478732DEST_PATH_IMAGE014
; where, image-text global similarity
Figure 221560DEST_PATH_IMAGE015
The calculation formula is as formula (1):

Figure 274967DEST_PATH_IMAGE016
) 公式(1)
Figure 274967DEST_PATH_IMAGE016
) Formula 1)

采用文本引导注意力机制计算所述目标数据对中图像数据包括的每个视觉目标 的权重,将各个视觉目标的图像局部特征

Figure 114747DEST_PATH_IMAGE017
进行对应权重加权后,经前馈神经网络映射获 得新的图像局部表示
Figure 283822DEST_PATH_IMAGE018
,然后采用视觉引导注意力机制计算所述目标数据对中文本数据 包括的每个词的权重,将各个词的文本局部特征
Figure 779526DEST_PATH_IMAGE019
进行对应权重加权后,经前馈神经网络 映射得到新的文本局部表示
Figure 371044DEST_PATH_IMAGE020
,根据各个图像局部表示
Figure 517861DEST_PATH_IMAGE018
和各个文本局部表示
Figure 169422DEST_PATH_IMAGE020
计算所 有视觉目标和词的余弦相似度,并以其均值计算得到所述目标数据对对应的图像-文本局 部相似度
Figure 886842DEST_PATH_IMAGE021
;其中,图像-文本局部相似度
Figure 282051DEST_PATH_IMAGE021
的计算公式如公式(2),M为视觉目标数 量,N为词数量: The text-guided attention mechanism is used to calculate the weight of each visual target included in the image data in the target data pair, and the image local features of each visual target are calculated.
Figure 114747DEST_PATH_IMAGE017
After the corresponding weights are weighted, a new local representation of the image is obtained through the feedforward neural network mapping
Figure 283822DEST_PATH_IMAGE018
, and then the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the text local features of each word are calculated.
Figure 779526DEST_PATH_IMAGE019
After the corresponding weights are weighted, the new local representation of the text is obtained through the feedforward neural network mapping.
Figure 371044DEST_PATH_IMAGE020
, according to the local representation of each image
Figure 517861DEST_PATH_IMAGE018
and individual text local representations
Figure 169422DEST_PATH_IMAGE020
Calculate the cosine similarity of all visual targets and words, and calculate the corresponding image-text local similarity of the target data pair with its mean
Figure 886842DEST_PATH_IMAGE021
; where, the image-text local similarity
Figure 282051DEST_PATH_IMAGE021
The calculation formula is as formula (2), M is the number of visual objects, and N is the number of words:

Figure 111988DEST_PATH_IMAGE022
公式(2)
Figure 111988DEST_PATH_IMAGE022
Formula (2)

根据所述目标数据对中各个图像关系特征和各个文本关系特征的余弦相似度均 值,计算得到所述目标数据对对应的图像-文本关系相似度

Figure 934450DEST_PATH_IMAGE023
;其中,图像-文本关系相 似度
Figure 201483DEST_PATH_IMAGE023
的计算公式如公式(3),P表示图像数据和文本数据的关系个数: According to the average cosine similarity of each image relationship feature and each text relationship feature in the target data pair, the image-text relationship similarity corresponding to the target data pair is calculated and obtained
Figure 934450DEST_PATH_IMAGE023
; where, the image-text relationship similarity
Figure 201483DEST_PATH_IMAGE023
The calculation formula of , such as formula (3), P represents the number of relationships between image data and text data:

Figure 338067DEST_PATH_IMAGE024
公式(3)
Figure 338067DEST_PATH_IMAGE024
Formula (3)

根据所述目标数据对对应的图像-文本全局相似度

Figure 6945DEST_PATH_IMAGE014
、图像-文本局部相似度
Figure 187260DEST_PATH_IMAGE021
、图像-文本关系相似度
Figure 941589DEST_PATH_IMAGE023
计算得到所述目标数据对对应的图像-文本综合相似 度
Figure 881863DEST_PATH_IMAGE025
;其中,图像-文本综合相似度
Figure 405249DEST_PATH_IMAGE025
的计算公式如公式(4): According to the target data pair corresponding image-text global similarity
Figure 6945DEST_PATH_IMAGE014
, image-text local similarity
Figure 187260DEST_PATH_IMAGE021
, image-text relationship similarity
Figure 941589DEST_PATH_IMAGE023
Calculate the image-text comprehensive similarity corresponding to the target data pair
Figure 881863DEST_PATH_IMAGE025
; Among them, the image-text comprehensive similarity
Figure 405249DEST_PATH_IMAGE025
The calculation formula is as formula (4):

Figure 257929DEST_PATH_IMAGE026
公式(4)。
Figure 257929DEST_PATH_IMAGE026
Formula (4).

在一个优选的实施例中,所述模态间结构约束损失函数的计算公式如公式(5),其 中,B为样本数,

Figure 499555DEST_PATH_IMAGE027
为模型超参数,
Figure 977941DEST_PATH_IMAGE028
为匹配的目标数据对,
Figure 355832DEST_PATH_IMAGE029
Figure 143529DEST_PATH_IMAGE030
为非匹配 的目标数据对: In a preferred embodiment, the calculation formula of the structural constraint loss function between modes is as formula (5), where B is the number of samples,
Figure 499555DEST_PATH_IMAGE027
are model hyperparameters,
Figure 977941DEST_PATH_IMAGE028
is the matching target data pair,
Figure 355832DEST_PATH_IMAGE029
and
Figure 143529DEST_PATH_IMAGE030
For non-matching target data pairs:

Figure 606871DEST_PATH_IMAGE031
公式(5)
Figure 606871DEST_PATH_IMAGE031
Formula (5)

所述模态内结构约束损失函数的计算公式如公式(6),其中,

Figure 888948DEST_PATH_IMAGE032
为图像三 元组,相比于
Figure 121346DEST_PATH_IMAGE033
Figure 892993DEST_PATH_IMAGE034
Figure 529117DEST_PATH_IMAGE035
具有更多共同语义标签,
Figure 411623DEST_PATH_IMAGE036
为文本三元组,相比于
Figure 701790DEST_PATH_IMAGE037
Figure 644338DEST_PATH_IMAGE038
Figure 82272DEST_PATH_IMAGE039
具有更多共同语义标签:The calculation formula of the structural constraint loss function within the mode is as formula (6), wherein,
Figure 888948DEST_PATH_IMAGE032
is an image triple, compared to
Figure 121346DEST_PATH_IMAGE033
,
Figure 892993DEST_PATH_IMAGE034
and
Figure 529117DEST_PATH_IMAGE035
have more common semantic labels,
Figure 411623DEST_PATH_IMAGE036
is a text triple, compared to
Figure 701790DEST_PATH_IMAGE037
,
Figure 644338DEST_PATH_IMAGE038
and
Figure 82272DEST_PATH_IMAGE039
Has more common semantic labels:

Figure 955419DEST_PATH_IMAGE040
公式(6)。
Figure 955419DEST_PATH_IMAGE040
Formula (6).

在一个优选的实施例中,所述采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练的步骤,包括:In a preferred embodiment, the step of using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function to train the neural network model includes:

从所述训练数据集中随机采样获得匹配的目标数据对、非匹配的目标数据对、图像三元组、文本三元组,分别根据所述模态间结构约束损失函数计算模态间结构约束损失函数值,根据所述模态内结构约束损失函数计算模态内结构约束损失函数值,并按公式(7)进行融合,利用反向传播算法优化网络参数:The matching target data pairs, non-matching target data pairs, image triples, and text triples are randomly sampled from the training data set, and the inter-modal structural constraint loss is calculated according to the inter-modal structural constraint loss function respectively. function value, according to the structural constraint loss function in the mode, calculate the loss function value of the structure constraint in the mode, and fuse it according to formula (7), and use the back propagation algorithm to optimize the network parameters:

Figure 162410DEST_PATH_IMAGE041
公式(7)
Figure 162410DEST_PATH_IMAGE041
Formula (7)

其中

Figure 275859DEST_PATH_IMAGE042
是超参数。 in
Figure 275859DEST_PATH_IMAGE042
are hyperparameters.

在一个优选的实施例中,所述通过图像视觉关系编码网络提取各个视觉目标之间 的图像关系特征

Figure 138773DEST_PATH_IMAGE005
的步骤,包括: In a preferred embodiment, the image relationship feature between each visual object is extracted through an image visual relationship coding network
Figure 138773DEST_PATH_IMAGE005
steps, including:

经图像视觉目标检测器获得图像中视觉目标

Figure 363081DEST_PATH_IMAGE004
和视觉目标
Figure 378573DEST_PATH_IMAGE007
的特征
Figure 397344DEST_PATH_IMAGE003
Figure 809871DEST_PATH_IMAGE043
,以及两 个目标联合区域的特征
Figure 775553DEST_PATH_IMAGE044
,采用公式(8)对上述各个特征进行融合,计算得到各个关系特 征: The visual target in the image is obtained by the image visual target detector
Figure 363081DEST_PATH_IMAGE004
and visual target
Figure 378573DEST_PATH_IMAGE007
Characteristics
Figure 397344DEST_PATH_IMAGE003
,
Figure 809871DEST_PATH_IMAGE043
, and the features of the two target joint regions
Figure 775553DEST_PATH_IMAGE044
, using formula (8) to fuse the above features, and calculate each relationship feature:

Figure 957136DEST_PATH_IMAGE045
公式(8)
Figure 957136DEST_PATH_IMAGE045
Formula (8)

其中[]为向量拼接操作,

Figure 412388DEST_PATH_IMAGE046
为神经元激活函数,
Figure 233582DEST_PATH_IMAGE047
为模型参数。 Where [] is the vector concatenation operation,
Figure 412388DEST_PATH_IMAGE046
is the neuron activation function,
Figure 233582DEST_PATH_IMAGE047
are model parameters.

在一个优选的实施例中,所述将各个词向量输入至文本关系编码网络提取各个词 之间的文本关系特征

Figure 799693DEST_PATH_IMAGE011
的步骤,包括: In a preferred embodiment, the inputting each word vector into a text relational coding network extracts textual relational features between each word
Figure 799693DEST_PATH_IMAGE011
steps, including:

在文本关系编码网络中,采用公式(9)计算词

Figure 39044DEST_PATH_IMAGE004
和词
Figure 665197DEST_PATH_IMAGE007
之间的文本关系特征
Figure 786737DEST_PATH_IMAGE048
: In the text relational coding network, formula (9) is used to calculate the word
Figure 39044DEST_PATH_IMAGE004
and word
Figure 665197DEST_PATH_IMAGE007
textual relational features between
Figure 786737DEST_PATH_IMAGE048
:

Figure 847884DEST_PATH_IMAGE049
公式(9)
Figure 847884DEST_PATH_IMAGE049
Formula (9)

其中,

Figure 4059DEST_PATH_IMAGE046
表示神经元激活函数,
Figure 801113DEST_PATH_IMAGE050
为模型参数。 in,
Figure 4059DEST_PATH_IMAGE046
represents the neuron activation function,
Figure 801113DEST_PATH_IMAGE050
are model parameters.

在一个优选的实施例中,所述采用文本引导注意力机制计算所述目标数据对中图 像数据包括的每个视觉目标的权重,将各个视觉目标的图像局部特征

Figure 347632DEST_PATH_IMAGE017
进行对应权重加 权后,经前馈神经网络映射获得新的图像局部表示
Figure 255546DEST_PATH_IMAGE018
的步骤,包括: In a preferred embodiment, the weight of each visual target included in the image data in the target data pair is calculated by using a text-guided attention mechanism, and the image local features of each visual target are combined
Figure 347632DEST_PATH_IMAGE017
After the corresponding weights are weighted, a new local representation of the image is obtained through the feedforward neural network mapping
Figure 255546DEST_PATH_IMAGE018
steps, including:

采用文本引导注意力机制,通过公式(10)计算图像中每个视觉目标的权重:Using the text-guided attention mechanism, the weight of each visual object in the image is calculated by formula (10):

Figure 453178DEST_PATH_IMAGE051
公式(10)
Figure 453178DEST_PATH_IMAGE051
Formula (10)

其中,

Figure 421134DEST_PATH_IMAGE052
Figure 517266DEST_PATH_IMAGE053
为模型参数; in,
Figure 421134DEST_PATH_IMAGE052
,
Figure 517266DEST_PATH_IMAGE053
are model parameters;

通过公式(11)对每个视觉目标进行加权,并经过前馈神经网络映射获得新的图像 局部表示

Figure 166553DEST_PATH_IMAGE018
: Each visual object is weighted by formula (11), and a new image local representation is obtained through feedforward neural network mapping
Figure 166553DEST_PATH_IMAGE018
:

Figure 31741DEST_PATH_IMAGE054
公式(11)
Figure 31741DEST_PATH_IMAGE054
Formula (11)

其中,

Figure 859013DEST_PATH_IMAGE055
为模型参数。 in,
Figure 859013DEST_PATH_IMAGE055
are model parameters.

在一个优选的实施例中,所述采用视觉引导注意力机制计算所述目标数据对中文 本数据包括的每个词的权重,将各个词的文本局部特征

Figure 380125DEST_PATH_IMAGE019
进行对应权重加权后,经前馈神 经网络映射得到新的文本局部表示
Figure 895420DEST_PATH_IMAGE020
的步骤,包括: In a preferred embodiment, the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the local text features of each word
Figure 380125DEST_PATH_IMAGE019
After the corresponding weights are weighted, the new local representation of the text is obtained through the feedforward neural network mapping.
Figure 895420DEST_PATH_IMAGE020
steps, including:

采用视觉引导注意力机制,通过公式(12)计算文本中每个词的权重:Using a vision-guided attention mechanism, the weight of each word in the text is calculated by formula (12):

Figure 615114DEST_PATH_IMAGE056
公式(12)
Figure 615114DEST_PATH_IMAGE056
Formula (12)

其中,

Figure 111823DEST_PATH_IMAGE057
Figure 182547DEST_PATH_IMAGE058
为模型参数; in,
Figure 111823DEST_PATH_IMAGE057
,
Figure 182547DEST_PATH_IMAGE058
are model parameters;

通过公式(13)对各个词的文本局部特征

Figure 173637DEST_PATH_IMAGE019
进行对应权重加权,并经过前馈神经网 络映射获得新的文本局部表示
Figure 13417DEST_PATH_IMAGE020
: By formula (13), the text local features of each word are
Figure 173637DEST_PATH_IMAGE019
The corresponding weights are weighted, and a new local representation of the text is obtained through feedforward neural network mapping
Figure 13417DEST_PATH_IMAGE020
:

Figure 179563DEST_PATH_IMAGE059
公式(13)
Figure 179563DEST_PATH_IMAGE059
Formula (13)

其中,

Figure 737583DEST_PATH_IMAGE060
为模型参数。 in,
Figure 737583DEST_PATH_IMAGE060
are model parameters.

在一个优选的实施例中,所述训练数据集通过Wikipedia、MS COCO、Pascal Voc获取。In a preferred embodiment, the training data set is obtained through Wikipedia, MS COCO, and Pascal Voc.

与现有技术相比,本发明提供的一种基于多层次特征表示对齐的跨模态检索方法具有以下优点:Compared with the prior art, the cross-modal retrieval method based on the alignment of multi-level feature representations provided by the present invention has the following advantages:

本发明提供的一种基于多层次特征表示对齐的跨模态检索方法,通过在跨模态细粒度精确对齐阶段,分别计算图像和文本两种不同模态数据之间的全局相似度、局部相似度和关系相似度,并融合得到图像-文本综合相似度,在网络训练阶段,设计相应损失函数,挖掘跨模态结构约束信息,从多个角度约束和监督检索模型的参数学习,最后根据图像-文本综合相似度获取测试查询样例的检索结果,从而通过引入图像和文本两种不同模态数据之间的细粒度关联关系,有效提高跨模态检索的准确率,在图文检索、模式识别等领域具有广泛的市场需求和应用前景。The present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. In the cross-modal fine-grained and precise alignment stage, the global similarity and local similarity between two different modal data of image and text are calculated respectively. In the network training stage, the corresponding loss function is designed, the cross-modal structural constraint information is mined, the parameter learning of the retrieval model is constrained and supervised from multiple perspectives, and finally, according to the image - Text comprehensive similarity to obtain the retrieval results of test query samples, so as to effectively improve the accuracy of cross-modal retrieval by introducing fine-grained correlation between image and text two different modal data, in image and text retrieval, mode Identification and other fields have a wide range of market demand and application prospects.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并于说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1是本发明一个实施例提供的一种实施环境的示意图。FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention.

图2是根据一示例性实施例示出的一种基于多层次特征表示对齐的跨模态检索方法的方法流程图。Fig. 2 is a method flowchart of a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment.

图3是本发明实施例示出的一种模态间结构约束损失示意图。FIG. 3 is a schematic diagram of a structural constraint loss between modes according to an embodiment of the present invention.

图4是本发明实施例示出的一种模态内结构约束损失示意图。FIG. 4 is a schematic diagram of an intra-modal structural constraint loss according to an embodiment of the present invention.

图5是本发明实施例进行文本检索图像的一种结果示意图。FIG. 5 is a schematic diagram of a result of text retrieval of images according to an embodiment of the present invention.

图6是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。Fig. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations, according to an exemplary embodiment.

图7是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。Fig. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚,以下结合具体实施例(但不限于所举实施例)与附图详细描述本发明,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to specific embodiments (but not limited to the illustrated embodiments) and the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例可以适用于多种场景,其涉及的实施环境可以包括单个服务器的输入输出场景,或者终端与服务器的互动场景。当实施环境为单个服务器的输入输出场景时,图像数据和文本数据的获取和存储主体均为服务器;当实施环境为终端与服务器的互动场景,此时,实施例所涉及的实施环境的示意图可以如图1所示。在图1所示实施环境的示意图中,该实施环境包括终端101和服务器102。The embodiments of the present invention may be applicable to various scenarios, and the implementation environment involved may include an input and output scenario of a single server, or an interaction scenario between a terminal and a server. When the implementation environment is an input and output scene of a single server, the acquisition and storage of image data and text data are both servers; when the implementation environment is an interaction scene between a terminal and a server, at this time, the schematic diagram of the implementation environment involved in the embodiment can be As shown in Figure 1. In the schematic diagram of the implementation environment shown in FIG. 1 , the implementation environment includes a terminal 101 and a server 102 .

终端101是运行有至少一个客户端的电子设备,客户端是某个应用程序的客户端,又称APP(Application,应用程序)。终端101可以是智能手机、平板电脑等。The terminal 101 is an electronic device running at least one client, and the client is a client of an application program, also known as APP (Application, application program). The terminal 101 may be a smartphone, a tablet computer, or the like.

终端101和服务器102之间通过无线或有线网络连接。终端101用于向服务器102发送数据,或,终端用于接收服务器102发送的数据。在一种可能的实施方式中,终端101可以向服务器102发送图像数据或文本数据中的至少一种。The terminal 101 and the server 102 are connected through a wireless or wired network. The terminal 101 is used for sending data to the server 102 , or the terminal is used for receiving data sent by the server 102 . In a possible implementation, the terminal 101 may send at least one of image data or text data to the server 102 .

服务器102用于接收终端101发送的数据,或,服务器102用于向终端101发送数据。其中,服务器102可以对终端101发送的数据进行分析处理,从而从数据库中匹配出相似度最高的图像数据和文本数据并发送至终端101。The server 102 is configured to receive data sent by the terminal 101 , or the server 102 is configured to send data to the terminal 101 . The server 102 can analyze and process the data sent by the terminal 101 , so as to match the image data and text data with the highest similarity from the database and send them to the terminal 101 .

图2是根据一示例性实施例示出的一种基于多层次特征表示对齐的跨模态检索方法的方法流程图,如图2所示,该一种基于多层次特征表示对齐的跨模态检索方法,其特征在于,所述方法包括:Fig. 2 is a method flowchart of a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment. As shown in Fig. 2, the cross-modal retrieval method based on alignment of multi-level feature representations A method, characterized in that the method comprises:

步骤100:获取训练数据集,对于所述训练数据集中的每组数据对,所述数据对包括图像数据、文本数据,以及所述图像数据与所述文本数据共同对应的语义标签。Step 100: Acquire a training data set. For each data pair in the training data set, the data pair includes image data, text data, and a semantic label corresponding to the image data and the text data.

需要说明的是,文本数据可以为任意语种对应的文本内容,比如英文、中文、日文、德文等;图像数据可以为任意色彩对应的图像内容,比如彩色图像、灰度图像等。It should be noted that the text data can be text content corresponding to any language, such as English, Chinese, Japanese, German, etc.; the image data can be image content corresponding to any color, such as color images, grayscale images, etc.

步骤200:对于所述训练数据集中的每组数据对,分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征,以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征。Step 200: For each group of data pairs in the training data set, extract the image global feature, image local feature and image relationship feature corresponding to the image data in the data pair, and the text global feature corresponding to the text data in the data pair. features, text local features, and text relation features.

在一个优选的实施例中,步骤200具体包括:In a preferred embodiment, step 200 specifically includes:

步骤210:对于所述训练数据集中的每组数据对,采用卷积神经网络CNN提取所述 数据对所对应图像数据的图像全局特征

Figure 266785DEST_PATH_IMAGE001
,然后采用视觉目标检测器检测所述图像数据包 括的视觉目标并提取每个视觉目标的图像局部特征
Figure 961071DEST_PATH_IMAGE002
,其中,M为所述图像数据 包括的视觉目标数量,
Figure 799583DEST_PATH_IMAGE003
为视觉目标
Figure 579320DEST_PATH_IMAGE004
的特征向量,再通过图像视觉关系编码网络提取各 个视觉目标之间的图像关系特征
Figure 177792DEST_PATH_IMAGE005
,其中,
Figure 992164DEST_PATH_IMAGE006
为视觉目标
Figure 814627DEST_PATH_IMAGE004
和视觉目标
Figure 770075DEST_PATH_IMAGE007
之间的图 像关系特征。 Step 210: For each group of data pairs in the training data set, use a convolutional neural network CNN to extract the global image features of the image data corresponding to the data pairs
Figure 266785DEST_PATH_IMAGE001
, and then use a visual target detector to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 961071DEST_PATH_IMAGE002
, where M is the number of visual objects included in the image data,
Figure 799583DEST_PATH_IMAGE003
for visual target
Figure 579320DEST_PATH_IMAGE004
feature vector, and then extract the image relationship features between each visual object through the image visual relationship coding network
Figure 177792DEST_PATH_IMAGE005
,in,
Figure 992164DEST_PATH_IMAGE006
for visual target
Figure 814627DEST_PATH_IMAGE004
and visual target
Figure 770075DEST_PATH_IMAGE007
The image relationship features between them.

步骤220:对于所述训练数据集中的每组数据对,采用词嵌入模型将所述数据对所 对应文本数据中的每个词转换为词向量

Figure 968976DEST_PATH_IMAGE008
,其中,N为所述文本数据包括的词数量,然后 将各个词向量依次输入至递归神经网络,获得所述文本数据对应的文本全局特征
Figure 575537DEST_PATH_IMAGE009
,再将 各个词向量输入至前馈神经网络获得各个词对应的文本局部特征
Figure 568901DEST_PATH_IMAGE010
,同时将各 个词向量输入至文本关系编码网络提取各个词之间的文本关系特征
Figure 510181DEST_PATH_IMAGE011
,其中,
Figure 247193DEST_PATH_IMAGE012
为 词
Figure 770578DEST_PATH_IMAGE004
和词
Figure 872527DEST_PATH_IMAGE007
之间的文本关系特征。 Step 220: For each group of data pairs in the training data set, use a word embedding model to convert each word in the text data corresponding to the data pair into a word vector
Figure 968976DEST_PATH_IMAGE008
, where N is the number of words included in the text data, and then input each word vector into the recurrent neural network in turn to obtain the text global features corresponding to the text data
Figure 575537DEST_PATH_IMAGE009
, and then input each word vector into the feedforward neural network to obtain the local text features corresponding to each word
Figure 568901DEST_PATH_IMAGE010
, and input each word vector into the text relation coding network to extract the text relation features between each word
Figure 510181DEST_PATH_IMAGE011
,in,
Figure 247193DEST_PATH_IMAGE012
for word
Figure 770578DEST_PATH_IMAGE004
and word
Figure 872527DEST_PATH_IMAGE007
textual relationship between.

通过上述步骤200的实施,可实现跨模态多层次精细化表示。Through the implementation of the above-mentioned step 200, a cross-modal multi-level refined representation can be realized.

步骤300:对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对,根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度。Step 300: For the target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, and the image local feature corresponding to the target data pair. The image-text comprehensive similarity corresponding to the target data pair is obtained by calculating with the local text feature, the image relationship feature and the text relationship feature corresponding to the target data pair.

在一个优选的实施例中,步骤300具体包括:In a preferred embodiment, step 300 specifically includes:

步骤310:对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据 对,基于所述目标数据对中图像数据对应的图像全局特征

Figure 114152DEST_PATH_IMAGE013
和文本数据对应的文本全局 特征
Figure 334481DEST_PATH_IMAGE009
的余弦距离,计算得到所述目标数据对对应的图像-文本全局相似度
Figure 977952DEST_PATH_IMAGE014
。 Step 310: For a target data pair composed of any image data and any text data in the training data set, based on the image global feature corresponding to the image data in the target data pair
Figure 114152DEST_PATH_IMAGE013
Text global features corresponding to text data
Figure 334481DEST_PATH_IMAGE009
The cosine distance is calculated to obtain the image-text global similarity corresponding to the target data pair
Figure 977952DEST_PATH_IMAGE014
.

其中,图像-文本全局相似度

Figure 313119DEST_PATH_IMAGE015
的计算公式如公式(1): Among them, the image-text global similarity
Figure 313119DEST_PATH_IMAGE015
The calculation formula is as formula (1):

Figure 979723DEST_PATH_IMAGE016
) 公式(1)
Figure 979723DEST_PATH_IMAGE016
) Formula 1)

步骤320:采用文本引导注意力机制计算所述目标数据对中图像数据包括的每个 视觉目标的权重,将各个视觉目标的图像局部特征

Figure 58538DEST_PATH_IMAGE017
进行对应权重加权后,经前馈神经网 络映射获得新的图像局部表示
Figure 477887DEST_PATH_IMAGE018
,然后采用视觉引导注意力机制计算所述目标数据对中 文本数据包括的每个词的权重,将各个词的文本局部特征
Figure 249534DEST_PATH_IMAGE019
进行对应权重加权后,经前馈 神经网络映射得到新的文本局部表示
Figure 200172DEST_PATH_IMAGE020
,根据各个图像局部表示
Figure 285940DEST_PATH_IMAGE018
和各个文本局部表示
Figure 638424DEST_PATH_IMAGE020
计算所有视觉目标和词的余弦相似度,并以其均值计算得到所述目标数据对对应的图 像-文本局部相似度
Figure 3808DEST_PATH_IMAGE021
。 Step 320: Calculate the weight of each visual target included in the image data in the target data pair by using the text-guided attention mechanism, and combine the image local features of each visual target.
Figure 58538DEST_PATH_IMAGE017
After the corresponding weights are weighted, a new local representation of the image is obtained through the feedforward neural network mapping
Figure 477887DEST_PATH_IMAGE018
, and then the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the text local features of each word are calculated.
Figure 249534DEST_PATH_IMAGE019
After the corresponding weights are weighted, the new local representation of the text is obtained through the feedforward neural network mapping.
Figure 200172DEST_PATH_IMAGE020
, according to the local representation of each image
Figure 285940DEST_PATH_IMAGE018
and individual text local representations
Figure 638424DEST_PATH_IMAGE020
Calculate the cosine similarity of all visual targets and words, and calculate the corresponding image-text local similarity of the target data pair with its mean
Figure 3808DEST_PATH_IMAGE021
.

其中,图像-文本局部相似度

Figure 441743DEST_PATH_IMAGE021
的计算公式如公式(2),M为视觉目标数量,N为 词数量: Among them, the image-text local similarity
Figure 441743DEST_PATH_IMAGE021
The calculation formula of is as formula (2), M is the number of visual objects, and N is the number of words:

Figure 65622DEST_PATH_IMAGE022
公式(2)
Figure 65622DEST_PATH_IMAGE022
Formula (2)

步骤330:根据所述目标数据对中各个图像关系特征和各个文本关系特征的余弦 相似度均值,计算得到所述目标数据对对应的图像-文本关系相似度

Figure 538192DEST_PATH_IMAGE023
。其中,图像-文 本关系相似度
Figure 651641DEST_PATH_IMAGE023
的计算公式如公式(3),P表示图像数据和文本数据的关系个数: Step 330: According to the average cosine similarity of each image relationship feature and each text relationship feature in the target data pair, calculate the image-text relationship similarity corresponding to the target data pair.
Figure 538192DEST_PATH_IMAGE023
. Among them, the image-text relationship similarity
Figure 651641DEST_PATH_IMAGE023
The calculation formula of , such as formula (3), P represents the number of relationships between image data and text data:

Figure 763823DEST_PATH_IMAGE024
公式(3)
Figure 763823DEST_PATH_IMAGE024
Formula (3)

步骤340:根据所述目标数据对对应的图像-文本全局相似度

Figure 988131DEST_PATH_IMAGE014
、图像-文本局 部相似度
Figure 721731DEST_PATH_IMAGE021
、图像-文本关系相似度计算得到所述目标数据对对应的图像-文本综合相 似度
Figure 6082DEST_PATH_IMAGE025
。 Step 340: According to the target data, the corresponding image-text global similarity
Figure 988131DEST_PATH_IMAGE014
, image-text local similarity
Figure 721731DEST_PATH_IMAGE021
, the image-text relationship similarity is calculated to obtain the corresponding image-text comprehensive similarity of the target data pair
Figure 6082DEST_PATH_IMAGE025
.

其中,图像-文本综合相似度

Figure 369674DEST_PATH_IMAGE025
的计算公式如公式(4): Among them, the image-text comprehensive similarity
Figure 369674DEST_PATH_IMAGE025
The calculation formula is as formula (4):

Figure 397673DEST_PATH_IMAGE026
公式(4)
Figure 397673DEST_PATH_IMAGE026
Formula (4)

通过上述步骤300的实施,可实现跨模态细粒度精确对齐。Through the implementation of the above step 300, fine-grained and precise alignment across modalities can be achieved.

步骤400:基于各组目标数据对对应的图像-文本综合相似度,设计模态间结构约束损失函数和模态内结构约束损失函数,并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练。Step 400: Design the inter-modal structural constraint loss function and the intra-modal structural constraint loss function based on the image-text comprehensive similarity corresponding to each set of target data pairs, and use the inter-modal structural constraint loss function and the modality constraint loss function. The neural network model is trained using the in-state structure-constrained loss function.

在一个优选的实施例中,所述模态间结构约束损失函数的计算公式如公式(5),其 中,B为样本数,

Figure 579256DEST_PATH_IMAGE027
为模型超参数,
Figure 441032DEST_PATH_IMAGE028
为匹配的目标数据对,
Figure 340855DEST_PATH_IMAGE029
Figure 93916DEST_PATH_IMAGE030
为非匹配 的目标数据对: In a preferred embodiment, the calculation formula of the structural constraint loss function between modes is as formula (5), where B is the number of samples,
Figure 579256DEST_PATH_IMAGE027
are model hyperparameters,
Figure 441032DEST_PATH_IMAGE028
is the matching target data pair,
Figure 340855DEST_PATH_IMAGE029
and
Figure 93916DEST_PATH_IMAGE030
For non-matching target data pairs:

Figure 395585DEST_PATH_IMAGE031
公式(5)
Figure 395585DEST_PATH_IMAGE031
Formula (5)

所述模态内结构约束损失函数的计算公式如公式(6),其中,

Figure 756159DEST_PATH_IMAGE032
为图像三 元组,相比于
Figure 346540DEST_PATH_IMAGE033
Figure 716342DEST_PATH_IMAGE034
Figure 29774DEST_PATH_IMAGE035
具有更多共同语义标签,
Figure 826828DEST_PATH_IMAGE036
为文本三元组,相比于
Figure 638927DEST_PATH_IMAGE037
Figure 812419DEST_PATH_IMAGE038
Figure 823100DEST_PATH_IMAGE039
具有更多共同语义标签:The calculation formula of the structural constraint loss function within the mode is as formula (6), wherein,
Figure 756159DEST_PATH_IMAGE032
is an image triple, compared to
Figure 346540DEST_PATH_IMAGE033
,
Figure 716342DEST_PATH_IMAGE034
and
Figure 29774DEST_PATH_IMAGE035
have more common semantic labels,
Figure 826828DEST_PATH_IMAGE036
is a text triple, compared to
Figure 638927DEST_PATH_IMAGE037
,
Figure 812419DEST_PATH_IMAGE038
and
Figure 823100DEST_PATH_IMAGE039
Has more common semantic labels:

Figure 712428DEST_PATH_IMAGE040
公式(6)
Figure 712428DEST_PATH_IMAGE040
Formula (6)

其中,图3是本发明实施例示出的一种模态间结构约束损失示意图。3 is a schematic diagram of a structural constraint loss between modes according to an embodiment of the present invention.

在一个优选的实施例中,所述采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练的步骤,包括:In a preferred embodiment, the step of using the inter-modal structural constraint loss function and the intra-modal structural constraint loss function to train the neural network model includes:

从所述训练数据集中随机采样获得匹配的目标数据对、非匹配的目标数据对、图像三元组、文本三元组,分别根据所述模态间结构约束损失函数计算模态间结构约束损失函数值,根据所述模态内结构约束损失函数计算模态内结构约束损失函数值,并按公式(7)进行融合,利用反向传播算法优化网络参数:The matching target data pairs, non-matching target data pairs, image triples, and text triples are randomly sampled from the training data set, and the inter-modal structural constraint loss is calculated according to the inter-modal structural constraint loss function respectively. function value, according to the structural constraint loss function in the mode, calculate the loss function value of the structure constraint in the mode, and fuse it according to formula (7), and use the back propagation algorithm to optimize the network parameters:

Figure 808560DEST_PATH_IMAGE041
公式(7)
Figure 808560DEST_PATH_IMAGE041
Formula (7)

其中

Figure 457847DEST_PATH_IMAGE042
是超参数。 in
Figure 457847DEST_PATH_IMAGE042
are hyperparameters.

其中,图4是本发明实施例示出的一种模态内结构约束损失示意图。Among them, FIG. 4 is a schematic diagram of an intra-modal structural constraint loss shown in an embodiment of the present invention.

通过上述步骤400的实施,可实现利用跨模态结构约束信息监督检索模型的训练,从而使网络训练朝着拉升匹配的目标数据对之间相似度,降低非匹配的目标数据对之间相似度的方向进行,同时使训练后的网络能够学习到更具判别力的图像和文本表示。Through the implementation of the above step 400, it is possible to use the cross-modal structural constraint information to supervise the training of the retrieval model, so that the network training can increase the similarity between matched target data pairs and reduce the similarity between non-matched target data pairs. degree direction, while enabling the trained network to learn more discriminative image and text representations.

在一个优选的实施例中,所述通过图像视觉关系编码网络提取各个视觉目标之间 的图像关系特征

Figure 323035DEST_PATH_IMAGE005
的步骤,包括: In a preferred embodiment, the image relationship feature between each visual object is extracted through an image visual relationship coding network
Figure 323035DEST_PATH_IMAGE005
steps, including:

经图像视觉目标检测器获得图像中视觉目标

Figure 418816DEST_PATH_IMAGE004
和视觉目标
Figure 2245DEST_PATH_IMAGE007
的特征
Figure 251960DEST_PATH_IMAGE003
Figure 909338DEST_PATH_IMAGE043
,以及两 个目标联合区域的特征
Figure 219096DEST_PATH_IMAGE044
,采用公式(8)对上述各个特征进行融合,计算得到各个关系特 征: The visual target in the image is obtained by the image visual target detector
Figure 418816DEST_PATH_IMAGE004
and visual target
Figure 2245DEST_PATH_IMAGE007
Characteristics
Figure 251960DEST_PATH_IMAGE003
,
Figure 909338DEST_PATH_IMAGE043
, and the features of the two target joint regions
Figure 219096DEST_PATH_IMAGE044
, using formula (8) to fuse the above features, and calculate each relationship feature:

Figure 476771DEST_PATH_IMAGE045
公式(8)
Figure 476771DEST_PATH_IMAGE045
Formula (8)

其中[]为向量拼接操作,

Figure 530178DEST_PATH_IMAGE046
为神经元激活函数,
Figure 369958DEST_PATH_IMAGE047
为模型参数。 Where [] is the vector concatenation operation,
Figure 530178DEST_PATH_IMAGE046
is the neuron activation function,
Figure 369958DEST_PATH_IMAGE047
are model parameters.

在一个优选的实施例中,所述将各个词向量输入至文本关系编码网络提取各个词 之间的文本关系特征

Figure 788301DEST_PATH_IMAGE011
的步骤,包括: In a preferred embodiment, the inputting each word vector into a text relational coding network extracts textual relational features between each word
Figure 788301DEST_PATH_IMAGE011
steps, including:

在文本关系编码网络中,采用公式(9)计算词

Figure 80742DEST_PATH_IMAGE004
和词
Figure 626255DEST_PATH_IMAGE007
之间的文本关系特征
Figure 320542DEST_PATH_IMAGE048
: In the text relational coding network, formula (9) is used to calculate the word
Figure 80742DEST_PATH_IMAGE004
and word
Figure 626255DEST_PATH_IMAGE007
textual relational features between
Figure 320542DEST_PATH_IMAGE048
:

Figure 175365DEST_PATH_IMAGE049
公式(9)
Figure 175365DEST_PATH_IMAGE049
Formula (9)

其中,

Figure 220681DEST_PATH_IMAGE046
表示神经元激活函数,
Figure 615891DEST_PATH_IMAGE050
为模型参数。 in,
Figure 220681DEST_PATH_IMAGE046
represents the neuron activation function,
Figure 615891DEST_PATH_IMAGE050
are model parameters.

在一个优选的实施例中,所述采用文本引导注意力机制计算所述目标数据对中图 像数据包括的每个视觉目标的权重,将各个视觉目标的图像局部特征

Figure 86055DEST_PATH_IMAGE017
进行对应权重加 权后,经前馈神经网络映射获得新的图像局部表示
Figure 174097DEST_PATH_IMAGE018
的步骤,包括: In a preferred embodiment, the weight of each visual target included in the image data in the target data pair is calculated by using a text-guided attention mechanism, and the image local features of each visual target are combined
Figure 86055DEST_PATH_IMAGE017
After the corresponding weights are weighted, a new local representation of the image is obtained through the feedforward neural network mapping
Figure 174097DEST_PATH_IMAGE018
steps, including:

采用文本引导注意力机制,通过公式(10)计算图像中每个视觉目标的权重:Using the text-guided attention mechanism, the weight of each visual object in the image is calculated by formula (10):

Figure 378813DEST_PATH_IMAGE051
公式(10)
Figure 378813DEST_PATH_IMAGE051
Formula (10)

其中,

Figure 577713DEST_PATH_IMAGE052
Figure 666499DEST_PATH_IMAGE053
为模型参数; in,
Figure 577713DEST_PATH_IMAGE052
,
Figure 666499DEST_PATH_IMAGE053
are model parameters;

通过公式(11)对每个视觉目标进行加权,并经过前馈神经网络映射获得新的图像 局部表示

Figure 925442DEST_PATH_IMAGE018
: Each visual object is weighted by formula (11), and a new image local representation is obtained through feedforward neural network mapping
Figure 925442DEST_PATH_IMAGE018
:

Figure 679771DEST_PATH_IMAGE054
公式(11)
Figure 679771DEST_PATH_IMAGE054
Formula (11)

其中,

Figure 354466DEST_PATH_IMAGE055
为模型参数。 in,
Figure 354466DEST_PATH_IMAGE055
are model parameters.

在一个优选的实施例中,所述采用视觉引导注意力机制计算所述目标数据对中文 本数据包括的每个词的权重,将各个词的文本局部特征

Figure 877851DEST_PATH_IMAGE019
进行对应权重加权后,经前馈神 经网络映射得到新的文本局部表示
Figure 494646DEST_PATH_IMAGE020
的步骤,包括: In a preferred embodiment, the visual guidance attention mechanism is used to calculate the weight of each word included in the text data in the target data pair, and the local text features of each word
Figure 877851DEST_PATH_IMAGE019
After the corresponding weights are weighted, the new local representation of the text is obtained through the feedforward neural network mapping.
Figure 494646DEST_PATH_IMAGE020
steps, including:

采用视觉引导注意力机制,通过公式(12)计算文本中每个词的权重:Using a vision-guided attention mechanism, the weight of each word in the text is calculated by formula (12):

Figure 470693DEST_PATH_IMAGE056
公式(12)
Figure 470693DEST_PATH_IMAGE056
Formula (12)

其中,

Figure 949079DEST_PATH_IMAGE057
Figure 592550DEST_PATH_IMAGE058
为模型参数; in,
Figure 949079DEST_PATH_IMAGE057
,
Figure 592550DEST_PATH_IMAGE058
are model parameters;

通过公式(13)对各个词的文本局部特征

Figure 616131DEST_PATH_IMAGE019
进行对应权重加权,并经过前馈神经网 络映射获得新的文本局部表示
Figure 345053DEST_PATH_IMAGE020
: By formula (13), the text local features of each word are
Figure 616131DEST_PATH_IMAGE019
The corresponding weights are weighted, and a new local representation of the text is obtained through feedforward neural network mapping
Figure 345053DEST_PATH_IMAGE020
:

Figure 423867DEST_PATH_IMAGE059
公式(13)
Figure 423867DEST_PATH_IMAGE059
Formula (13)

其中,

Figure 859528DEST_PATH_IMAGE060
为模型参数。 in,
Figure 859528DEST_PATH_IMAGE060
are model parameters.

在一个优选的实施例中,所述训练数据集通过Wikipedia、MS COCO、Pascal Voc获取。In a preferred embodiment, the training data set is obtained through Wikipedia, MS COCO, and Pascal Voc.

需要说明的是,当采用上述步骤100-400实现神经网络模型的训练后,不同模态的数据经过神经网络模型计算就能准确输出二者之间的相似度。使用测试数据集中的任意一种模态类型作为查询模态,以另一种模态类型作为目标模态,将查询模态的每个数据作为查询样例,检索目标模态中的数据,按照公式(4)所示图像-文本综合相似度计算公式,计算查询样例和查询目标的相似性。在一种可能的实施方式中,神经网络模型可以将相似性最高的目标模态数据作为匹配数据进行输出,或,神经网络模型将各个神经网络模型相似性按照从大到小排序,得到预设数量的目标模态数据的相关结果列表,从而实现不同模态数据间的跨模态检索作业。It should be noted that after the above steps 100-400 are used to implement the training of the neural network model, the similarity between the data of different modes can be accurately outputted through the calculation of the neural network model. Use any modality type in the test data set as the query modality, use another modality type as the target modality, take each data of the query modality as a query sample, retrieve the data in the target modality, and follow The image-text comprehensive similarity calculation formula shown in formula (4) calculates the similarity between the query sample and the query target. In a possible implementation, the neural network model may output the target modal data with the highest similarity as matching data, or the neural network model may sort the similarities of each neural network model in descending order to obtain a preset A list of relevant results for a large number of target modal data, so as to realize cross-modal retrieval operations between different modal data.

本实施例采用了MS COCO跨模态数据集进行实验,该数据集由文献(T. Lin, etal. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.)首次提出,已成为跨模态检索领域最常用的实验数据集之一。该数据集中的每张图片均带有5个文本标注,其中82783张图片及其文本标注作为训练样本集,在剩余样本中随机挑选5000张图片及其文本标注作为测试样本集。为了更好地说明本发明实施例提供的基于多层次特征表示对齐的跨模态检索方法的有益效果,将本发明提供的基于多层次特征表示对齐的跨模态检索方法与以下3种现有跨模态检索方法进行实验测试比对:This example uses the MS COCO cross-modal dataset for experiments, which was first proposed by the literature (T. Lin, etal. Microsoft COCO: Common objects in context, ECCV 2014, pp.740-755.), and has become a One of the most commonly used experimental datasets in the field of cross-modal retrieval. Each image in this dataset has 5 text annotations, of which 82,783 images and their text annotations are used as the training sample set, and 5,000 images and their text annotations are randomly selected from the remaining samples as the test sample set. In order to better illustrate the beneficial effects of the cross-modal retrieval method based on alignment of multi-level feature representations provided by the embodiments of the present invention, the cross-modal retrieval method based on alignment of multi-level feature representations provided by the present invention is compared with the following three existing Experimental test comparison of cross-modal retrieval methods:

现有方法一:文献(I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun,Order-embeddings ofimages and language, ICLR, 2016.)中记载的Order-embedding方法。Existing method 1: The Order-embedding method described in the literature (I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, Order-embeddings of images and language, ICLR, 2016.).

现有方法二:文献(F. Faghri, D. Fleet, R. Kiros, and S. Fidler, VSE++:Improved visualsemantic embeddings with hard negatives, BMVC, 2018.)中记载的VSE++方法。Existing method 2: The VSE++ method described in the literature (F. Faghri, D. Fleet, R. Kiros, and S. Fidler, VSE++: Improved visualsemantic embeddings with hard negatives, BMVC, 2018.).

现有方法三:文献(J. Yu, W. Zhang, Y. Lu, Z. Qin, et al. Reasoning onthe relation: Enhancing visual representation for visual question answeringand cross-modal retrieval, IEEE Transactions on Multimedia, 22(12):3196-3209,2020.)中记载的c-VRANet方法。Existing method three: literature (J. Yu, W. Zhang, Y. Lu, Z. Qin, et al. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval, IEEE Transactions on Multimedia, 22(12) ): 3196-3209, 2020.) The c-VRANet method described in.

实验采用跨模态检索领域常用的R@n指标来评测跨模态检索的准确率,该指标表示检索返回的n个样例中正确样例的百分比,该指标越高表示检索的结果越好,本实验中n分别取1,5,10。In the experiment, the R@n index commonly used in the field of cross-modal retrieval is used to evaluate the accuracy of cross-modal retrieval. This index represents the percentage of correct samples among the n samples returned by the retrieval. , in this experiment, n is taken as 1, 5, and 10, respectively.

Figure 365596DEST_PATH_IMAGE061
Figure 365596DEST_PATH_IMAGE061

表一Table I

通过表一示出的数据可知,与现有跨模态检索方法相比,本发明提供的一种基于多层次特征表示对齐的跨模态检索方法在图像数据检索文本数据,以及文本数据检索图像数据两大任务上的检索准确率均有明显提升,从而充分证明了本发明提出的图像文本全局-局部-关系多层次特征表示精细化对齐的有效性。为了便于理解,还示出采用本发明实施例进行文本检索图像的结果示意图,如图5所示,其中,第一列为检索用文本,第二列为数据集给定的匹配图像,第三列到第七列为相似度前五的对应检索结果。It can be seen from the data shown in Table 1 that, compared with the existing cross-modal retrieval method, the cross-modal retrieval method based on the alignment of multi-level feature representation provided by the present invention retrieves text data from image data, and retrieves image data from text data. The retrieval accuracy rates of the two major tasks of the data are significantly improved, which fully proves the effectiveness of the refined alignment of the global-local-relational multi-level feature representation of the image text proposed by the present invention. For ease of understanding, a schematic diagram of the results of text retrieval images using an embodiment of the present invention is also shown, as shown in FIG. 5 , in which the first column is the text for retrieval, the second column is the matching image given by the dataset, and the third column is the text for retrieval. Columns to the seventh column are the corresponding search results of the top five similarities.

下面的实验结果表明,与现有方法相比,本发明基于多层次特征表示对齐的跨模态检索方法,可以取得更高的检索准确率。The following experimental results show that, compared with the existing methods, the cross-modal retrieval method based on the alignment of multi-level feature representations of the present invention can achieve higher retrieval accuracy.

综上所述,本发明提供的一种基于多层次特征表示对齐的跨模态检索方法,通过在跨模态细粒度精确对齐阶段,分别计算图像和文本两种不同模态数据之间的全局相似度、局部相似度和关系相似度,并融合得到图像-文本综合相似度,在网络训练阶段,设计相应损失函数,挖掘跨模态结构约束信息,从多个角度约束和监督检索模型的参数学习,最后根据图像-文本综合相似度获取测试查询样例的检索结果,从而通过引入图像和文本两种不同模态数据之间的细粒度关联关系,有效提高跨模态检索的准确率,在图文检索、模式识别等领域具有广泛的市场需求和应用前景。To sum up, the present invention provides a cross-modal retrieval method based on multi-level feature representation alignment. In the cross-modal fine-grained and precise alignment stage, the global image and text data between two different modal data are calculated respectively. Similarity, local similarity and relational similarity are combined to obtain the comprehensive image-text similarity. In the network training stage, the corresponding loss function is designed to mine the cross-modal structural constraint information, and the parameters of the retrieval model are constrained and supervised from multiple perspectives. Finally, the retrieval results of the test query samples are obtained according to the comprehensive image-text similarity, so as to effectively improve the accuracy of cross-modal retrieval by introducing the fine-grained correlation between the two different modal data of image and text. Graphic retrieval, pattern recognition and other fields have a wide range of market demand and application prospects.

图6是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。例如,装置600可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。Fig. 6 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations, according to an exemplary embodiment. For example, apparatus 600 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

参照图6,装置600可以包括以下一个或多个组件:处理组件602,存储器604,电源组件606,多媒体组件608,音频组件610,输入/输出(I/ O)的接口612,传感器组件614,以及通信组件616。6, the apparatus 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and communication component 616 .

处理组件602通常控制装置600的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件602可以包括一个或多个处理器620来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件602可以包括一个或多个模块,便于处理组件602和其他组件之间的交互。例如,处理组件602可以包括多媒体模块,以方便多媒体组件608和处理组件602之间的交互。The processing component 602 generally controls the overall operation of the device 600, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 602 may include one or more modules that facilitate interaction between processing component 602 and other components. For example, processing component 602 may include a multimedia module to facilitate interaction between multimedia component 608 and processing component 602.

存储器604被配置为存储各种类型的数据以支持在装置600的操作。这些数据的示例包括用于在装置600上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。Memory 604 is configured to store various types of data to support operations at device 600 . Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and the like. Memory 604 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

电源组件606为装置600的各种组件提供电力。电源组件606可以包括电源管理系统,一个或多个电源,及其他与为装置600生成、管理和分配电力相关联的组件。Power supply assembly 606 provides power to the various components of device 600 . Power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 600 .

多媒体组件608包括在装置600和目标用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自目标用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件608包括一个前置摄像头和/或后置摄像头。当装置600处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。Multimedia component 608 includes screens that provide an output interface between device 600 and the intended user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a target user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. A touch sensor can sense not only the boundaries of a touch or swipe action, but also the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 608 includes a front-facing camera and/or a rear-facing camera. When the apparatus 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件610被配置为输出和/或输入音频信号。例如,音频组件610包括一个麦克风(MIC),当装置600处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中,音频组件610还包括一个扬声器,用于输出音频信号。Audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a microphone (MIC) that is configured to receive external audio signals when device 600 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 604 or transmitted via communication component 616 . In some embodiments, audio component 610 also includes a speaker for outputting audio signals.

I/ O接口612为处理组件602和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。I/O interface 612 provides an interface between processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, and the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

传感器组件614包括一个或多个传感器,用于为装置600提供各个方面的状态评估。例如,传感器组件614可以检测到装置600的打开/关闭状态,组件的相对定位,例如组件为装置600的显示器和小键盘,传感器组件614还可以检测装置600或装置600一个组件的位置改变,目标用户与装置600接触的存在或不存在,装置600方位或加速/减速和装置600的温度变化。传感器组件614可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件614还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。Sensor assembly 614 includes one or more sensors for providing status assessment of various aspects of device 600 . For example, the sensor assembly 614 can detect the open/closed state of the device 600, the relative positioning of the components, such as the display and keypad of the device 600, the sensor assembly 614 can also detect the position change of the device 600 or a component of the device 600, the target The presence or absence of user contact with the device 600 , the orientation or acceleration/deceleration of the device 600 and the temperature change of the device 600 . Sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件616被配置为便于装置600和其他设备之间有线或无线方式的通信。装置600可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件616还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。Communication component 616 is configured to facilitate wired or wireless communication between apparatus 600 and other devices. Device 600 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 616 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中,装置600可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, apparatus 600 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the above method.

在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器604,上述指令可由装置600的处理器620执行以完成上述方法。例如,非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory 604 including instructions, executable by the processor 620 of the apparatus 600 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

一种非临时性计算机可读存储介质,当存储介质中的指令由装置600的处理器执行时,使得装置600能够执行一种基于多层次特征表示对齐的跨模态检索方法,该方法包括:A non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the apparatus 600, the apparatus 600 can execute a cross-modal retrieval method based on alignment of multi-level feature representations, the method comprising:

获取训练数据集,对于所述训练数据集中的每组数据对,所述数据对包括图像数据、文本数据,以及所述图像数据与所述文本数据共同对应的语义标签;Obtaining a training data set, for each group of data pairs in the training data set, the data pairs include image data, text data, and a semantic label corresponding to the image data and the text data in common;

对于所述训练数据集中的每组数据对,分别提取所述数据对中图像数据对应的图像全局特征、图像局部特征和图像关系特征,以及所述数据对中文本数据对应的文本全局特征、文本局部特征和文本关系特征;For each group of data pairs in the training data set, extract the image global features, image local features and image relationship features corresponding to the image data in the data pair, and text global features, textual features corresponding to the text data in the data pair, respectively. Local features and text relationship features;

对于所述训练数据集中任一图像数据与任一文本数据组成的目标数据对,根据所述目标数据对对应的图像全局特征和文本全局特征、所述目标数据对对应的图像局部特征和文本局部特征、所述目标数据对对应的图像关系特征和文本关系特征计算得到所述目标数据对对应的图像-文本综合相似度;For a target data pair composed of any image data and any text data in the training data set, according to the image global feature and text global feature corresponding to the target data pair, the image local feature and text local feature corresponding to the target data pair The image-text comprehensive similarity corresponding to the target data pair is obtained by calculating the corresponding image relationship feature and the text relationship feature of the feature, the target data pair;

基于各组目标数据对对应的图像-文本综合相似度,设计模态间结构约束损失函数和模态内结构约束损失函数,并采用所述模态间结构约束损失函数和所述模态内结构约束损失函数对神经网络模型进行训练。Based on the image-text comprehensive similarity corresponding to each set of target data pairs, an inter-modal structural constraint loss function and an intra-modal structural constraint loss function are designed, and the inter-modal structural constraint loss function and the intra-modal structural constraint loss function are used. The constrained loss function trains the neural network model.

图7是根据一示例性实施例示出的一种用于实现基于多层次特征表示对齐的跨模态检索方法的装置框图。例如,装置700可以被提供为一服务器。参照图7,装置700包括处理组件722,其进一步包括一个或多个处理器,以及由存储器732所代表的存储器资源,用于存储可由处理部件722执行的指令,例如应用程序。存储器732中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件722被配置为执行指令,以执行上述启动页面生成方法。Fig. 7 is a block diagram of an apparatus for implementing a cross-modal retrieval method based on alignment of multi-level feature representations according to an exemplary embodiment. For example, the apparatus 700 may be provided as a server. 7, apparatus 700 includes a processing component 722, which further includes one or more processors, and a memory resource, represented by memory 732, for storing instructions executable by processing component 722, such as application programs. An application program stored in memory 732 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 722 is configured to execute instructions to perform the above-described startup page generation method.

装置700还可以包括一个电源组件726被配置为执行装置700的电源管理,一个有线或无线网络接口750被配置为将装置700连接到网络,和一个输入输出(I/O)接口758。装置700可以操作基于存储在存储器732的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM, LinuxTM,FreeBSDTM或类似。Device 700 may also include a power supply assembly 726 configured to perform power management of device 700 , a wired or wireless network interface 750 configured to connect device 700 to a network, and an input output (I/O) interface 758 . Device 700 may operate based on an operating system stored in memory 732, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

虽然,前文已经用一般性说明、具体实施方式及试验,对本发明做了详尽的描述,但在本发明基础上,可以对之进行修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本发明精神的基础上所做的这些修改或改进,均属于本发明要求保护的范围。Although the present invention has been described in detail above with general description, specific embodiments and tests, it is obvious to those skilled in the art that modifications or improvements can be made on the basis of the present invention. Therefore, these modifications or improvements made without departing from the spirit of the present invention fall within the scope of the claimed protection of the present invention.

本领域技术人员在考虑说明书及实践这里的发明的后,将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention herein. The present invention is intended to cover any variations, uses or adaptations of the present invention which follow the general principles of the present invention and include common knowledge or conventional techniques in the technical field not disclosed by the present invention . It should be understood that the present invention is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from its scope.

Claims (10)

1. A cross-modal retrieval method based on multi-level feature representation alignment is characterized by comprising the following steps:
acquiring a training data set, wherein for each group of data pairs in the training data set, the data pairs comprise image data, text data and semantic labels corresponding to the image data and the text data together;
for each group of data pairs in the training data set, respectively extracting image global features, image local features and image relation features corresponding to image data in the data pairs, and text global features, text local features and text relation features corresponding to text data in the data pairs;
for a target data pair consisting of any image data and any text data in the training data set, calculating to obtain image-text comprehensive similarity corresponding to the target data pair according to image global features and text global features corresponding to the target data pair, image local features and text local features corresponding to the target data pair, and image relation features and text relation features corresponding to the target data pair;
designing an inter-modal structural constraint loss function and an intra-modal structural constraint loss function based on the corresponding image-text comprehensive similarity of each group of target data, and training a neural network model by adopting the inter-modal structural constraint loss function and the intra-modal structural constraint loss function.
2. The method according to claim 1, wherein the step of extracting, for each group of data pairs in the training data set, an image global feature, an image local feature and an image relation feature corresponding to image data in the data pair, and a text global feature, a text local feature and a text relation feature corresponding to text data in the data pair, respectively, comprises:
for each group of data pairs in the training data set, extracting the image global characteristics of the image data corresponding to the data pairs by adopting a Convolutional Neural Network (CNN)
Figure 453193DEST_PATH_IMAGE001
Then, a visual target detector is used to detect the visual targets included in the image data and extract the image local features of each visual target
Figure 504325DEST_PATH_IMAGE002
WhereinMfor the number of visual objects comprised by the image data,
Figure 429556DEST_PATH_IMAGE003
as a visual target
Figure 653864DEST_PATH_IMAGE004
Extracting image relation characteristics among all visual targets through an image visual relation coding network
Figure 918623DEST_PATH_IMAGE005
Wherein
Figure 202974DEST_PATH_IMAGE006
as a visual target
Figure 287605DEST_PATH_IMAGE004
And a visual target
Figure 315603DEST_PATH_IMAGE007
Image relationship features between;
for each group of data pairs in the training data set, converting each word in the text data corresponding to the data pair into a word vector using a word embedding model
Figure 933404DEST_PATH_IMAGE008
WhereinNthe word quantity included in the text data is input into a recurrent neural network in sequence, and the global text feature corresponding to the text data is obtained
Figure 388656DEST_PATH_IMAGE009
Then, each word vector is input to a feedforward neural network to obtain the local text characteristics corresponding to each word
Figure 226162DEST_PATH_IMAGE010
Simultaneously, each word vector is input into a text relation coding network to extract text relation characteristics among words
Figure 792273DEST_PATH_IMAGE011
Wherein
Figure 766045DEST_PATH_IMAGE012
is a word
Figure 392199DEST_PATH_IMAGE004
Hehe word
Figure 513738DEST_PATH_IMAGE007
A textual relationship feature between.
3. The method according to claim 2, wherein the step of calculating, for a target data pair composed of any image data and any text data in the training data set, an image-text comprehensive similarity corresponding to the target data pair according to an image global feature and a text global feature corresponding to the target data pair, an image local feature and a text local feature corresponding to the target data pair, and an image relationship feature and a text relationship feature corresponding to the target data pair includes:
for a target data pair consisting of any image data and any text data in the training data set, based on image global features corresponding to the image data in the target data pair
Figure 821223DEST_PATH_IMAGE013
Global features of text corresponding to text data
Figure 977398DEST_PATH_IMAGE009
The cosine distance of the target data is calculated to obtain the image-text global similarity corresponding to the target data
Figure 774453DEST_PATH_IMAGE014
(ii) a Wherein image-text global similarity
Figure 837085DEST_PATH_IMAGE015
Is as in formula (1):
Figure 744998DEST_PATH_IMAGE016
) Formula (1)
Calculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and carrying out local image feature on each visual target
Figure 693362DEST_PATH_IMAGE017
After weighting corresponding weight, obtaining new image local representation through feedforward neural network mapping
Figure 661318DEST_PATH_IMAGE018
Then, a visual guidance attention mechanism is adopted to calculate the weight of each word included in the text data in the target data pair, and the text local characteristics of each word are calculated
Figure 695134DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 406738DEST_PATH_IMAGE020
From the respective image partial representation
Figure 475188DEST_PATH_IMAGE018
Calculating cosine similarity of all visual targets and words according to local representation of each text, and calculating the local similarity of the target data to the corresponding image-text according to the mean value of the cosine similarity
Figure 614045DEST_PATH_IMAGE021
(ii) a Wherein the image-text local similarity
Figure 931894DEST_PATH_IMAGE021
The formula (2) is shown in the specification, wherein M is the number of visual objects, and N is the number of words:
Figure 883407DEST_PATH_IMAGE022
formula (2)
Calculating to obtain image-text relation similarity corresponding to the target data pair according to the cosine similarity mean value of each image relation feature and each text relation feature in the target data pair
Figure 806364DEST_PATH_IMAGE023
(ii) a Wherein, the similarity of image-text relationship
Figure 53805DEST_PATH_IMAGE023
Is as in formula (3),Pnumber of relationships representing image data and text data:
Figure 858950DEST_PATH_IMAGE024
formula (3)
According to the global similarity of the target data to the corresponding image-text
Figure 912357DEST_PATH_IMAGE014
Image-text local similarity
Figure 689820DEST_PATH_IMAGE021
Similarity of image-text relationship
Figure 170480DEST_PATH_IMAGE023
Is calculated toTo the corresponding image-text comprehensive similarity of the target data pair
Figure 728500DEST_PATH_IMAGE025
(ii) a Wherein, the image-text comprehensive similarity
Figure 756237DEST_PATH_IMAGE025
The calculation formula of (2) is as formula (4):
Figure 716103DEST_PATH_IMAGE026
equation (4).
4. The method according to claim 3, wherein the inter-modal structural constraint loss function is calculated as in equation (5),Bis the number of samples to be tested,
Figure 305347DEST_PATH_IMAGE027
in order to be a hyper-parameter of the model,
Figure 85084DEST_PATH_IMAGE028
for the pair of target data that is matched,
Figure 417977DEST_PATH_IMAGE029
and
Figure 232349DEST_PATH_IMAGE030
for non-matching target data pairs:
Figure 992494DEST_PATH_IMAGE031
formula (5)
The calculation formula of the intra-modal structure constraint loss function is shown as formula (6), wherein,
Figure 259528DEST_PATH_IMAGE032
for image triplets, in contrast to
Figure 894646DEST_PATH_IMAGE033
Figure 563525DEST_PATH_IMAGE034
And
Figure 494572DEST_PATH_IMAGE035
there are more common semantic labels that are present,
Figure 248901DEST_PATH_IMAGE036
for text triplets, in contrast to
Figure 251492DEST_PATH_IMAGE037
Figure 712561DEST_PATH_IMAGE038
And
Figure 876826DEST_PATH_IMAGE039
with more common semantic labels:
Figure 56134DEST_PATH_IMAGE040
equation (6).
5. The method of claim 4, wherein the step of training a neural network model using the inter-modal and intra-modal structural constraint loss functions comprises:
randomly sampling from the training data set to obtain a matched target data pair, a non-matched target data pair, an image triple and a text triple, respectively calculating an inter-modal structure constraint loss function value according to the inter-modal structure constraint loss function, calculating an intra-modal structure constraint loss function value according to the intra-modal structure constraint loss function, fusing according to a formula (7), and optimizing network parameters by using a back propagation algorithm:
Figure 596837DEST_PATH_IMAGE041
formula (7)
Wherein
Figure 416806DEST_PATH_IMAGE042
Is a hyper-parameter.
6. The method according to claim 2, wherein the extracting of the image relation features between the visual targets through the image visual relation coding network
Figure 17552DEST_PATH_IMAGE043
The method comprises the following steps:
obtaining visual objects in an image via an image visual object detector
Figure 418577DEST_PATH_IMAGE004
And a visual target
Figure 762971DEST_PATH_IMAGE007
Is characterized by
Figure 933052DEST_PATH_IMAGE044
Figure 704699DEST_PATH_IMAGE045
And the characteristics of the two target union regions
Figure 593021DEST_PATH_IMAGE046
And fusing the characteristics by adopting a formula (8), and calculating to obtain the relationship characteristics:
Figure 475526DEST_PATH_IMAGE047
formula (8)
Wherein]In order to perform the vector splicing operation,
Figure 264229DEST_PATH_IMAGE048
in order to function the activation of the neuron,
Figure 206777DEST_PATH_IMAGE049
are model parameters.
7. The method of claim 2, wherein inputting the word vectors into the text-relation coding network extracts text-relation features between words
Figure 582394DEST_PATH_IMAGE050
The method comprises the following steps:
in a text-relational coding network, words are calculated using equation (9)
Figure 268591DEST_PATH_IMAGE004
Hehe word
Figure 475581DEST_PATH_IMAGE007
Characteristic of textual relationship between
Figure 526714DEST_PATH_IMAGE051
Figure 451944DEST_PATH_IMAGE052
Formula (9)
Wherein,
Figure 613935DEST_PATH_IMAGE048
representing the neuron activation function as a model parameter.
8. The method of claim 3, wherein the step of removing the substrate comprises removing the substrate from the substrateCalculating the weight of each visual target included in the image data in the target data pair by adopting a text-guided attention mechanism, and calculating the image local characteristics of each visual target
Figure 941012DEST_PATH_IMAGE053
After weighting the corresponding weight, obtaining a new image local representation through feedforward neural network mapping, comprising the following steps:
using the text-guided attention mechanism, the weight of each visual object in the image is calculated by equation (10):
Figure 959783DEST_PATH_IMAGE054
formula (10)
Wherein,
Figure 808528DEST_PATH_IMAGE055
Figure 836527DEST_PATH_IMAGE056
is a model parameter;
each visual target is weighted by formula (11) and a new image local representation is obtained through feed-forward neural network mapping
Figure 18110DEST_PATH_IMAGE057
Figure 411045DEST_PATH_IMAGE058
Formula (11)
Wherein,
Figure 45289DEST_PATH_IMAGE059
are model parameters.
9. The method of claim 3, wherein said directing attention visually is performedThe force mechanism calculates the weight of each word included in the text data in the target data pair and the local text characteristics of each word
Figure 549082DEST_PATH_IMAGE019
After weighting corresponding weight, obtaining new text local representation through feedforward neural network mapping
Figure 850751DEST_PATH_IMAGE020
The method comprises the following steps:
using the visual guidance attention mechanism, the weight of each word in the text is calculated by equation (12):
Figure 476904DEST_PATH_IMAGE060
formula (12)
Wherein,
Figure 536127DEST_PATH_IMAGE061
Figure 905928DEST_PATH_IMAGE062
is a model parameter;
local text features for individual words by means of formula (13)
Figure 62103DEST_PATH_IMAGE019
Weighting corresponding weight, and obtaining new text local representation through feedforward neural network mapping
Figure 295376DEST_PATH_IMAGE020
Figure 904212DEST_PATH_IMAGE063
Formula (13)
Wherein,
Figure 749808DEST_PATH_IMAGE064
are model parameters.
10. The method of claim 1, wherein the training data set is obtained via Wikipedia, MS COCO, Pascal Voc.
CN202111149240.4A 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment Active CN113792207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111149240.4A CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111149240.4A CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Publications (2)

Publication Number Publication Date
CN113792207A true CN113792207A (en) 2021-12-14
CN113792207B CN113792207B (en) 2023-11-17

Family

ID=78877521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111149240.4A Active CN113792207B (en) 2021-09-29 2021-09-29 Cross-modal retrieval method based on multi-level feature representation alignment

Country Status (1)

Country Link
CN (1) CN113792207B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239730A (en) * 2021-12-20 2022-03-25 华侨大学 A Cross-modal Retrieval Method Based on Neighbor Ranking Relation
CN114550302A (en) * 2022-02-25 2022-05-27 北京京东尚科信息技术有限公司 Method and device for generating action sequence and method and device for training correlation model
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
CN115129917A (en) * 2022-06-06 2022-09-30 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common features
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval
US20230162490A1 (en) * 2021-11-19 2023-05-25 Salesforce.Com, Inc. Systems and methods for vision-language distribution alignment
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamically weighted cross-modal fusion network retrieval method, system, and electronic device
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multimodal satire recognition method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 A Cross-modal Image Text Retrieval Method Based on Credibility Adaptive Matching Network
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN113157974A (en) * 2021-03-24 2021-07-23 西安维塑智能科技有限公司 Pedestrian retrieval method based on character expression

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110122A (en) * 2018-06-22 2019-08-09 北京交通大学 Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 A Cross-modal Image Text Retrieval Method Based on Credibility Adaptive Matching Network
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN113157974A (en) * 2021-03-24 2021-07-23 西安维塑智能科技有限公司 Pedestrian retrieval method based on character expression

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230162490A1 (en) * 2021-11-19 2023-05-25 Salesforce.Com, Inc. Systems and methods for vision-language distribution alignment
US12112523B2 (en) * 2021-11-19 2024-10-08 Salesforce, Inc. Systems and methods for vision-language distribution alignment
CN114239730A (en) * 2021-12-20 2022-03-25 华侨大学 A Cross-modal Retrieval Method Based on Neighbor Ranking Relation
CN114550302A (en) * 2022-02-25 2022-05-27 北京京东尚科信息技术有限公司 Method and device for generating action sequence and method and device for training correlation model
CN115129917A (en) * 2022-06-06 2022-09-30 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common features
CN115129917B (en) * 2022-06-06 2024-04-09 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamically weighted cross-modal fusion network retrieval method, system, and electronic device
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multimodal satire recognition method, device, equipment and storage medium
CN116402063B (en) * 2023-06-09 2023-08-15 华南师范大学 Multimodal satire recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113792207B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
US11120078B2 (en) Method and device for video processing, electronic device, and storage medium
CN113792207B (en) Cross-modal retrieval method based on multi-level feature representation alignment
TWI754855B (en) Method and device, electronic equipment for face image recognition and storage medium thereof
CN107491541B (en) Text classification method and device
CN111259148B (en) Information processing method, device and storage medium
TWI766286B (en) Image processing method and image processing device, electronic device and computer-readable storage medium
CN110008401B (en) Keyword extraction method, keyword extraction device, and computer-readable storage medium
WO2022011892A1 (en) Network training method and apparatus, target detection method and apparatus, and electronic device
CN109145213B (en) Method and device for query recommendation based on historical information
CN111368541B (en) Named entity identification method and device
CN109800325A (en) Video recommendation method, device and computer readable storage medium
CN111931844B (en) Image processing method and device, electronic equipment and storage medium
CN110781305A (en) Text classification method and device based on classification model and model training method
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN111259967A (en) Image classification and neural network training method, device, equipment and storage medium
KR20210091076A (en) Method and apparatus for processing video, electronic device, medium and computer program
CN111611490A (en) Resource searching method, device, equipment and storage medium
CN114332503A (en) Object re-identification method and device, electronic device and storage medium
CN113705210B (en) A method and device for generating article outline and a device for generating article outline
CN116166843B (en) Text video cross-modal retrieval method and device based on fine granularity perception
CN112926310B (en) Keyword extraction method and device
CN115294327A (en) A small target detection method, device and storage medium based on knowledge graph
WO2024179519A1 (en) Semantic recognition method and apparatus
CN113869063A (en) Data recommendation method and device, electronic equipment and storage medium
CN111538998B (en) Text encryption method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 314000 No. 899, guangqiong Road, Nanhu District, Jiaxing City, Zhejiang Province

Patentee after: Jiaxing University

Country or region after: China

Address before: No. 899 Guangqiong Road, Nanhu District, Jiaxing City, Zhejiang Province

Patentee before: JIAXING University

Country or region before: China