CN117390213A

CN117390213A - Training method of image and text retrieval model based on OSCAR and method of implementing image and text retrieval

Info

Publication number: CN117390213A
Application number: CN202311395517.0A
Authority: CN
Inventors: 武芳宇; 邱文婷; 刘净心; 林永义
Original assignee: Xian Jiaotong Liverpool University
Current assignee: Xian Jiaotong Liverpool University
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-12

Abstract

The present invention provides a training method for an OSCAR-based image and text retrieval model and a method for realizing image and text retrieval. The training method includes: obtaining a training set; inputting multiple image-text sample pairs in the training set into pre-training for visual language tasks. In the model OSCAR, feature extraction is performed to obtain image feature representation and text feature representation; each sample in the training set is used as an anchor sample, and multiple negative samples of different difficulties corresponding to the anchor sample are generated based on the image feature representation and text feature representation. ; Calculate the positive similarity between the image and text in the positive sample pair, the negative similarity between the image and text in the negative sample pair and the generated negative sample pair; calculate the loss function based on the positive similarity and negative similarity, and use the loss function to The pre-trained model OSCAR is fine-tuned to obtain the fully trained OSCAR image and text retrieval model. In this case, it can improve the generalization ability of the model and improve the accuracy and efficiency of model image and text retrieval.

Description

Training method of image and text retrieval model based on OSCAR and implementation of image and text retrieval method

技术领域Technical field

本发明涉及信息检索技术领域，尤其涉及一种基于OSCAR的图文检索模型的训练方法和实现图文检索的方法。The invention relates to the technical field of information retrieval, and in particular to a training method of an OSCAR-based image and text retrieval model and a method for realizing image and text retrieval.

背景技术Background technique

图文检索的目的是将给定的一张图片与对应的文字描述进行关联，从而实现图像和文本之间的匹配。图文检索在许多重要的跨模态任务中发挥着关键作用，如语义图像检索、图像描述、视觉质量保证等。然而，图文匹配面临着一些重要挑战，主要包括异质性差异和语义差异，异质性差异指的是来自不同模态的图像和文本数据的特征表示不一致，而语义差异则是指在捕捉图像和文本之间的跨模态对应关系时出现的错位问题。The purpose of image and text retrieval is to associate a given picture with the corresponding text description to achieve matching between images and text. Image and text retrieval plays a key role in many important cross-modal tasks, such as semantic image retrieval, image description, visual quality assurance, etc. However, image-text matching faces some important challenges, mainly including heterogeneous differences and semantic differences. Heterogeneous differences refer to the inconsistent feature representation of image and text data from different modalities, while semantic differences refer to the inconsistency in capturing Misalignment problems in cross-modal correspondence between images and text.

目前，许多研究通过使用卷积神经网络和循环神经网络等预训练模块来提取图像和文本特征，以弥合异质性差异。然而，这些预训练模块中的特征提取器未经过专门的图像-文本对数据训练或者网络化处理，因此无法达到较好的图像或文本嵌入效果。另一种常见的图文匹配方法是利用三重损失来鼓励模型使得正图像-文本对的相似性得分高于负图像-文本对的相似的得分。然而现有的成本函数涉及并未充分考虑负样本的难度，这是导致模型图文匹配不准确的主要原因之一。一些研究表明，增加批次规模以获得更多的负样本会导致计算复杂度的急剧增加，同时性能提升的回报也逐渐降低。Currently, many studies extract image and text features by using pre-trained modules such as convolutional neural networks and recurrent neural networks to bridge heterogeneous differences. However, the feature extractors in these pre-training modules have not been specially trained or networked on image-text pair data, so they cannot achieve better image or text embedding effects. Another common image-text matching method is to use a triplet loss to encourage the model such that the similarity score of positive image-text pairs is higher than the similarity score of negative image-text pairs. However, the existing cost function does not fully consider the difficulty of negative samples, which is one of the main reasons for inaccurate image-text matching in the model. Some studies have shown that increasing the batch size to obtain more negative samples leads to a sharp increase in computational complexity with gradually decreasing returns in performance gains.

目前在面向视觉语言任务中，OSCAR模型的性能非常强大，其已经在几百万对图像-文本对上进行了预训练，将图像和文本进行联合处理以获取有意义的特征表示，能够捕捉到文本和图像之间错综复杂的关联，并学习更具有判别性的图像-文本嵌入。OSCAR模型对图像和本文的特征表示有着很好的学习和理解能力，但模型的泛化能力还较弱。Currently, the performance of the OSCAR model is very powerful in visual language tasks. It has been pre-trained on millions of image-text pairs, and jointly processes images and text to obtain meaningful feature representations that can capture Intricate associations between text and images and learn more discriminative image-text embeddings. The OSCAR model has good learning and understanding capabilities for feature representations of images and articles, but the model's generalization ability is still weak.

因此，本申请基于OSCAR模型构建一种新的图文检索模型，以提升模型的泛化能力，提高模型图文检索的准确性和效率。Therefore, this application builds a new image and text retrieval model based on the OSCAR model to improve the generalization ability of the model and improve the accuracy and efficiency of the model's image and text retrieval.

发明内容Contents of the invention

本发明的目的在于提供一种基于OSCAR的图文检索模型的训练方法和实现图文检索的方法，能够提升模型的泛化能力，并提高模型图文检索的准确性和效率。The purpose of the present invention is to provide a training method for an OSCAR-based image and text retrieval model and a method for implementing image and text retrieval, which can improve the generalization ability of the model and improve the accuracy and efficiency of the model's image and text retrieval.

为达到上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

第一方面，本发明提供一种基于OSCAR的图文检索模型的训练方法，该方法包括：In a first aspect, the present invention provides a training method for an OSCAR-based image and text retrieval model, which method includes:

获取训练集，所述训练集包括多个图像-文本样本对；Obtain a training set, where the training set includes a plurality of image-text sample pairs;

将所述训练集中的多个图像-文本样本对输入面向视觉语言任务的预训练模型OSCAR中，进行特征提取生成图像特征表示和文本特征表示；Input multiple image-text sample pairs in the training set into the pre-trained model OSCAR for visual language tasks, and perform feature extraction to generate image feature representations and text feature representations;

将所述训练集中的每个样本作为锚点样本，基于所述图像特征表示和所述文本特征表示，生成所述锚点样本对应的多个不同难度的负样本；生成的负样本与所述锚点样本组成生成的负样本对；Each sample in the training set is used as an anchor sample, and based on the image feature representation and the text feature representation, multiple negative samples of different difficulties corresponding to the anchor sample are generated; the generated negative samples are consistent with the Anchor point samples form the generated negative sample pairs;

计算正样本对中图像与文本的正相似度、负样本对和所述生成的负样本对中图像与文本之间的负相似度；Calculate the positive similarity between the image and the text in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair between the image and the text;

基于所述正相似度和负相似度计算损失函数，通过所述损失函数对所述预训练模型OSCAR进行微调，得到完成训练的OSCAR图文检索模型。A loss function is calculated based on the positive similarity and negative similarity, and the pre-trained model OSCAR is fine-tuned through the loss function to obtain a trained OSCAR image and text retrieval model.

进一步地，所述将所述训练集中的每个样本作为锚点样本，基于所述图像特征表示和所述文本特征表示，生成所述锚点样本对应的多个不同难度的负样本，包括：Further, each sample in the training set is used as an anchor sample, and based on the image feature representation and the text feature representation, a plurality of negative samples of different difficulties corresponding to the anchor point sample are generated, including:

选取一个样本作为所述锚点样本q，所述样本为图像样本或文本样本；Select a sample as the anchor point sample q, where the sample is an image sample or a text sample;

基于所述锚点样本q，对所述训练集中的每个样本进行全局语义聚类，得到负样本聚类集合G＝{g₁，g₂，…，g_M}，其中，g_i＝{x_i1，x_i2，…，x_iN}表示具有相似语义的N个负样本的负样本集合，x_ij表示该负样本集合g_i中第j个负样本，i取1到M中的任意整数，j取1到N中的任意整数；Based on the anchor sample q, global semantic clustering is performed on each sample in the training set to obtain a negative sample cluster set G={g ₁ , g ₂ ,..., g _M }, where _gi ={ x _i1 , x _i2 ,…, x _iN } represents a negative sample set of N negative samples with similar semantics, x _ij represents the j-th negative sample in the negative sample set g _i , i takes any integer from 1 to M , j takes any integer from 1 to N;

基于核函数计算每个负样本与所述锚点样本q之间的相似度以及对应的权重，进行加权平均获得多个不同难度的负样本。Based on the kernel function, the similarity and the corresponding weight between each negative sample and the anchor sample q are calculated, and a weighted average is performed to obtain multiple negative samples of different difficulties.

进一步地，所述基于核函数计算每个负样本与所述锚点样本q之间的相似度以及对应的权重，进行加权平均获得多个不同难度的负样本，包括：Further, the similarity and the corresponding weight between each negative sample and the anchor sample q are calculated based on the kernel function, and a weighted average is performed to obtain multiple negative samples of different difficulties, including:

基于高斯径向基函数计算每个负样本与锚点样本之间的相似度：Calculate the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:

其中，k表示锚点样本q与负样本x_jn之间的相似度，||·||表示范式距离，σ为宽度参数；Among them, k represents the similarity between the anchor sample q and the negative sample x _jn , ||·|| represents the paradigm distance, and σ is the width parameter;

根据以下公式计算每个负样本与锚点样本间的相似度对应的权重W_n：Calculate the weight W _n corresponding to the similarity between each negative sample and the anchor sample according to the following formula:

J(W)＝min|X-W_n||J(W)＝min|XW _n ||

其中，J(W)为最小二乘法中表示误差的成本函数，W为待优化的权重矩阵；X表示输入数据矩阵，每一行表示负样本，每一列表示一个特征；W_n为权重矩阵W的权重值；||·||表示计算误差；Among them, J(W) is the cost function representing the error in the least squares method, W is the weight matrix to be optimized; X represents the input data matrix, each row represents a negative sample, and each column represents a feature; W _n is the weight matrix W Weight value; ||·|| represents calculation error;

通过加权平均计算得到生成的负样本：The generated negative samples are calculated by weighted average:

其中，表示生成的与锚点样本对应的负样本。in, Represents the generated negative sample corresponding to the anchor sample.

进一步地，所述损失函数，可表述为：Further, the loss function can be expressed as:

其中，v表示图像特征表示，c表示文本特征表示；s^vC+表示锚点样本为图像样本时的正相似度，s^cv+表示锚点样本为文本样本时的正相似度；^wvc表示锚点样本为图像样本时正相似度和负相似度的集合，S^cv表示锚点样本为文本样本时正相似度和负相似度的集合；和/>表示惩罚项；τ是一个超参数；||·|||表示集合大小。Among them, v represents the image feature representation, c represents the text feature representation; s ^vC+ represents the positive similarity when the anchor sample is an image sample, s ^cv+ represents the positive similarity when the anchor sample is a text sample; ^wvc represents the anchor sample. When the image sample is a set of positive similarities and negative similarities, S ^cv represents the set of positive similarities and negative similarities when the anchor sample is a text sample; and/> represents the penalty term; τ is a hyperparameter; ||·||| represents the set size.

进一步地，将所述训练集中的多个图像-文本样本对输入预训练的OSCAR图文检索模型中，进行特征提取生成图像特征表示和文本特征表示，包括：Further, multiple image-text sample pairs in the training set are input into the pre-trained OSCAR image and text retrieval model, and feature extraction is performed to generate image feature representations and text feature representations, including:

获取所述训练集中的图像样本，提取所述图像样本的区域视觉特征和区域位置特征，将所述区域视觉特征和所述区域位置特征进行线性组合，得到图像嵌入；所述图像样本包含n个物体区域；Obtain image samples in the training set, extract regional visual features and regional position features of the image samples, linearly combine the regional visual features and the regional position features to obtain image embeddings; the image samples include n object area;

获取所述训练集中的文本样本，采用分词技术将所述文本样本划分为多个标记，基于OSCAR-base模型获得每个标记对应的文本嵌入；Obtain the text samples in the training set, use word segmentation technology to divide the text samples into multiple tags, and obtain the text embedding corresponding to each tag based on the OSCAR-base model;

基于所述图像嵌入和所述文本嵌入，采用注意力机制生成联合特征表示，通过平均池化生成所述图像特征表示和所述文本特征表示。Based on the image embedding and the text embedding, an attention mechanism is used to generate a joint feature representation, and the image feature representation and the text feature representation are generated through average pooling.

第二方面，本发明还提供一种利用OSCAR图文检索模型实现图文检索的方法，所述OSCAR图文检索模型为通过权利要求1至5中任一项所述的训练方法训练得到，所述方法包括：In a second aspect, the present invention also provides a method for realizing image and text retrieval using the OSCAR image and text retrieval model, which is trained by the training method described in any one of claims 1 to 5, so The methods include:

获取待检索的目标文本和目标图像；Obtain the target text and target image to be retrieved;

基于所述图文检索模型中的文本编码器对所述目标文本进行特征提取，得到文本特征表示；Perform feature extraction on the target text based on the text encoder in the image-text retrieval model to obtain text feature representation;

基于所述图文检索模型中的图像编码器对所述目标图像进行特征提取，得到图像特征表示；Perform feature extraction on the target image based on the image encoder in the image-text retrieval model to obtain image feature representation;

基于所述文本特征表示及所述图像特征表示，确定所述目标文本在所述目标图像中的图像检索结果，和/或，确定所述目标图像在所述目标文本中的文本检索结果。Based on the text feature representation and the image feature representation, an image retrieval result of the target text in the target image is determined, and/or a text retrieval result of the target image in the target text is determined.

第三方面，本发明还提供一种基于OSCAR的图文检索模型训练装置，所述装置包括：In a third aspect, the present invention also provides an OSCAR-based image and text retrieval model training device. The device includes:

数据获取模块，用于获取训练集，所述训练集包括多个图像-文本样本对；A data acquisition module, used to acquire a training set, where the training set includes multiple image-text sample pairs;

特征提取模块，用于将所述训练集中的多个图像-文本样本对输入面向视觉语言任务的预训练模型OSCAR中，进行特征提取生成图像特征表示和文本特征表示；A feature extraction module, used to input multiple image-text sample pairs in the training set into the pre-training model OSCAR for visual language tasks, and perform feature extraction to generate image feature representations and text feature representations;

负样本合成模块，用于将所述训练集中的每个样本作为锚点样本，基于所述图像特征表示和所述文本特征表示，生成所述锚点样本对应的多个不同难度的负样本；生成的负样本与所述锚点样本组成生成的负样本对；A negative sample synthesis module, configured to use each sample in the training set as an anchor sample, and generate multiple negative samples of different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation; The generated negative sample and the anchor point sample form a generated negative sample pair;

相似度计算模块，用于计算正样本对中图像与文本的正相似度、负样本对和所述生成的负样本对中图像与文本之间的负相似度；A similarity calculation module, used to calculate the positive similarity between the image and the text in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair between the image and the text;

对比损失计算模块，用于基于所述正相似度和负相似度计算损失函数，通过所述损失函数对所述预训练模型OSCAR进行微调，得到完成训练的OSCAR图文检索模型。A comparison loss calculation module is used to calculate a loss function based on the positive similarity and negative similarity, and fine-tune the pre-trained model OSCAR through the loss function to obtain a trained OSCAR image and text retrieval model.

第四方面，本发明还提供一种计算机设备，所述计算机设备包括处理器和存储器；所述存储器存储有至少一条指令，所述至少一条指令用于被所述处理器执行以实现如上述任一所述的方法。In a fourth aspect, the present invention also provides a computer device. The computer device includes a processor and a memory; the memory stores at least one instruction, and the at least one instruction is used to be executed by the processor to implement any of the above. The method described in 1.

第五方面，本发明还提供一种计算机可读存储介质，所述存储介质存储有至少一条指令，所述至少一条指令用于被处理器执行以实现如上述任一所述的方法。In a fifth aspect, the present invention also provides a computer-readable storage medium that stores at least one instruction, and the at least one instruction is used to be executed by a processor to implement any of the above methods.

本发明的有益效果在于：本发明实施例提供的一种基于OSCAR的图文检索模型的训练方法，利用视觉语言预训练模型OSCAR对图像样本和文本样本进行特征提取，通过负样本合成模块生成不具有挑战性的负样本，增加了图像与文本之间的难度，利用正样本对中图像与本文的正相似度、负样本对和生成的负样本对中图像与文本的负相似度设计损失函数，基于全新的损失函数训练得到目标OSCAR模型，提升了图文检索模型的泛化能力，进而提高模型进行图文检索的效率和准确性。The beneficial effects of the present invention are: the embodiment of the present invention provides a training method for an image and text retrieval model based on OSCAR, which uses the visual language pre-training model OSCAR to extract features from image samples and text samples, and generates non-linear images through a negative sample synthesis module. Challenging negative samples increase the difficulty between images and texts. The loss function is designed using the positive similarity between the image and the article in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair between the image and the text. , the target OSCAR model is obtained based on the new loss function training, which improves the generalization ability of the image and text retrieval model, thereby improving the efficiency and accuracy of the model for image and text retrieval.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，并可依照说明书的内容予以实施，以下以本发明的较佳实施例并配合附图详细说明如后。The above description is only an overview of the technical solutions of the present invention. In order to have a clearer understanding of the technical means of the present invention and implement them according to the contents of the description, the preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings.

附图说明Description of the drawings

图1为本发明实施例提供的一种基于OSCAR的图文检索模型训练方法的流程示意图；Figure 1 is a schematic flow chart of an OSCAR-based image and text retrieval model training method provided by an embodiment of the present invention;

图2为本发明实施例提供的一种实现图文检索方法的流程示意图；Figure 2 is a schematic flowchart of a method for implementing image and text retrieval provided by an embodiment of the present invention;

图3为本发明实施例提供的一种基于OSCAR的图文检索模型训练装置的结构框图；Figure 3 is a structural block diagram of an OSCAR-based image and text retrieval model training device provided by an embodiment of the present invention;

图4为本发明实施例提供的一种计算机设备的结构示意图。Figure 4 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例时本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的全部其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

另外，本文中属于“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。In addition, "and/or" in this article is just an association relationship that describes related objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, and A and B exist at the same time. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship.

本申请实施例提供一种基于OSCAR的图文检索模型训练方法，该训练方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的点子设备中的一种。Embodiments of the present application provide an OSCAR-based image and text retrieval model training method. The execution subjects of the training method include but are not limited to servers, terminals, and other creative devices that can be configured to execute the method provided by the embodiments of the present application. A sort of.

请参阅图1所示，为本发明实施例提供的一种基于OSCAR的图文检索模型训练方法的流程示意图。在本实施例中，该训练方法包括：Please refer to FIG. 1 , which is a schematic flow chart of an OSCAR-based image and text retrieval model training method provided by an embodiment of the present invention. In this embodiment, the training method includes:

步骤S101，获取训练集，该训练集包括多个图像-文本样本对。Step S101: Obtain a training set, which includes multiple image-text sample pairs.

本发明实施例中，可以从指定的开源的自然语言学习模型语料库中获取数据集，也可以利用具有数据抓取能力的Python脚本从指定的网站获取大量的图文对，本发明实施例不对训练集的获取方式作具体限定。In the embodiment of the present invention, a data set can be obtained from a designated open source natural language learning model corpus, or a large number of image-text pairs can be obtained from a designated website using a Python script with data capture capabilities. The embodiment of the present invention does not provide training The method of obtaining the set is specifically limited.

步骤S102，将训练集中的多个图像-文本样本对输入预面向视觉语言任务的预训练模型OSCAR中，进行特征提取获取图像特征表示和文本特征表示。Step S102: Input multiple image-text sample pairs in the training set into the pre-trained model OSCAR pre-trained for visual language tasks, and perform feature extraction to obtain image feature representation and text feature representation.

可以理解的是，预训练的OSCAR视觉语言模型已经对几百万对图像-文本对上进行了预训练，能够将图像和文本进行联合处理以获取有意义的特征表示，以捕捉到文本和图像之间错综复杂的关联，并学习更具有判别性的图像-文本嵌入。也就是说，OSCAR模型对图像和本文的特征表示有着很好的学习和理解能力，能够提取图像和文本中更丰富的特征信息。It is understandable that the pre-trained OSCAR visual language model has been pre-trained on millions of image-text pairs, and can jointly process images and text to obtain meaningful feature representations to capture text and images. and learn more discriminative image-text embeddings. In other words, the OSCAR model has good learning and understanding capabilities for the feature representation of images and texts, and can extract richer feature information from images and texts.

具体地，基于预训练的OSCAR模型生成图像特征表示和文本特征表示的步骤，包括：Specifically, the steps to generate image feature representation and text feature representation based on the pre-trained OSCAR model include:

1)获取训练集中的图像样本，提取图像样本的区域视觉特征和区域位置特征，将区域视觉特征和区域位置特征进行线性组合，得到图像嵌入。1) Obtain the image samples in the training set, extract the regional visual features and regional position features of the image samples, and linearly combine the regional visual features and regional position features to obtain the image embedding.

其中，图像样本被划分为n个物体区域。Among them, the image sample is divided into n object areas.

在一个示例中，利用预训练于Visual Genome数据集上的Faster R-CNN模型提取图像的区域视觉特征和区域位置特征，通过线性投影将区域视觉特征和区域位置特征进行线性组合，就能够得到每个图像样本对应的图像嵌入。In one example, the Faster R-CNN model pre-trained on the Visual Genome data set is used to extract the regional visual features and regional position features of the image, and the regional visual features and regional position features are linearly combined through linear projection to obtain each The image embedding corresponding to the image sample.

2)获取训练集中的文本样本，采用分词技术将文本样本划分为多个标记，基于OSCAR-base模型获得每个标记对应的文本嵌入。2) Obtain the text samples in the training set, use word segmentation technology to divide the text samples into multiple tags, and obtain the text embedding corresponding to each tag based on the OSCAR-base model.

本发明实施例中，对于给定的文本样本c，首先利用分词技术将其划分为z个标记，即c＝{o₁,o₂,…,o_z}，再通过OSCAR-base模型获取每个标记对应的文本嵌入E_tok，则有：In the embodiment of the present invention, for a given text sample c, first use word segmentation technology to divide it into z tags, that is, c = {o ₁ , o ₂ ,..., o _z }, and then obtain each tag through the OSCAR-base model. The text corresponding to the mark is embedded in E _tok , then there are:

其中，表示文本样本的第i个标记。in, Represents the i-th token of the text sample.

由此，文本样本对应的文本嵌入表示为： Therefore, the text embedding corresponding to the text sample is expressed as:

3)基于图像嵌入和文本嵌入，采用注意力机制生成联合特征表示，通过平均池化生成图像特征表示和文本特征表示。3) Based on image embedding and text embedding, the attention mechanism is used to generate joint feature representation, and the image feature representation and text feature representation are generated through average pooling.

本实施例中，将获取的图像嵌入和文本嵌入输入至OSCAR视觉语言模型中的单个Transformer模型来获得图像和文本的联合特征表示，再通过平均池化将图像的局部特征和文本局部特征映射为更小维度的全局特征，并保留特征的平均信息，以生成图像特征表示和文本特征表示。其中，Transformer模型基于注意力机制，捕捉图像和文本元素之间的复杂关系，基于图像和文本的相互关系获取图像-文本对的联合特征表示。In this embodiment, the obtained image embedding and text embedding are input to a single Transformer model in the OSCAR visual language model to obtain the joint feature representation of the image and text, and then the local features of the image and the local features of the text are mapped as Global features with smaller dimensions and retain the average information of features to generate image feature representation and text feature representation. Among them, the Transformer model is based on the attention mechanism, captures the complex relationship between image and text elements, and obtains the joint feature representation of the image-text pair based on the mutual relationship between the image and text.

步骤S103，将训练集中的每个样本作为锚点样本，基于图像特征表示和文本特征表示，生成锚点样本对应的多个不同难度的负样本。其中，生成的负样本与锚点样本组成生成的负样本。Step S103: Use each sample in the training set as an anchor sample, and generate multiple negative samples of different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation. Among them, the generated negative sample and the anchor point sample constitute the generated negative sample.

在本发明实施例中，考虑到在对图文检索模型的训练过程中，样本的多样性会影响模型的检索效果，通过设计负样本合成模块以生成不同难度的负样本，以具有挑战性的负样本对模型进行训练，能够提高模型的泛化能力。In the embodiment of the present invention, considering that during the training process of the image and text retrieval model, the diversity of samples will affect the retrieval effect of the model, a negative sample synthesis module is designed to generate negative samples of different difficulties to achieve challenging results. Training the model with negative samples can improve the generalization ability of the model.

其中，生成锚点样本对应的多个不同难度的负样本的步骤，包括：Among them, the steps of generating multiple negative samples of different difficulties corresponding to the anchor point sample include:

1)选取一个样本作为锚点样本q，样本为图像样本或文本样本。1) Select a sample as the anchor sample q, which is an image sample or a text sample.

在本文下述实施例中，以锚点样本q为图像样本为例进行具体说明。In the following embodiments of this article, the anchor point sample q is an image sample as an example for detailed explanation.

2)基于该锚点样本q，对训练集中的每个样本进行全局语义聚类，得到负样本聚类集合G＝{g₁，g₂，…，g_M}，其中，g_i＝{x_i1，x_i2，…，x_iN}表示一组具有相似语义的N个负样本的负样本集合，x_ij表示该负样本集合g_i中第j个负样本，i取1到M中的任意整数，j取1到N中的任意整数。2) Based on the anchor sample q, perform global semantic clustering on each sample in the training set to obtain a negative sample cluster set G = {g ₁ , g ₂ ,..., g _M }, where _gi = {x _i1 , x _i2 ,..., x _iN } represents a set of negative samples with similar semantics of N negative samples, x _ij represents the j-th negative sample in the negative sample set g _i , i is any number from 1 to M Integer, j takes any integer from 1 to N.

具体地，在训练集的小批量中选择与锚点样本不匹配的负样本，对这些负样本执行k均值算法，将这些负样本按照语义划分为多个不同的负样本集合，这些负样本集合组成最终的负样本聚类集合G＝{g₁，g₂，…，g_M}，该聚类集合G中的每个元素表示一组语义相似的负样本。其中，负样本集合的数量由参数k确定，通常在执行算法前指定。Specifically, negative samples that do not match the anchor samples are selected in the mini-batch of the training set, the k-means algorithm is performed on these negative samples, and these negative samples are divided into multiple different negative sample sets according to semantics. These negative sample sets The final negative sample cluster set G = {g ₁ , g ₂ ,..., g _M } is formed. Each element in the cluster set G represents a group of semantically similar negative samples. Among them, the number of negative sample sets is determined by the parameter k, which is usually specified before executing the algorithm.

3)基于核函数计算每个负样本与锚点样本q之间的相似度以及对应的权重，进行加权平均获得多个不同难度的负样本。3) Calculate the similarity and corresponding weight between each negative sample and the anchor sample q based on the kernel function, and perform a weighted average to obtain multiple negative samples of different difficulties.

本发明实施例中，采用的核函数为高斯径向基函数。具体地，基于核函数计算每个负样本与锚点样本q之间的相似度以及对应的权重，进行加权平均获得多个不同难度的负样本的步骤，包括：In the embodiment of the present invention, the kernel function used is a Gaussian radial basis function. Specifically, the steps of calculating the similarity between each negative sample and the anchor sample q and the corresponding weight based on the kernel function, and performing a weighted average to obtain multiple negative samples of different difficulties include:

1)基于高斯径向基函数计算每个负样本与锚点样本之间的相似度：1) Calculate the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:

其中，k表示锚点样本q与负样本x_in之间的相似度，||·||表示范式距离，σ为宽度参数；Among them, k represents the similarity between the anchor sample q and the negative sample x _in , ||·|| represents the paradigm distance, and σ is the width parameter;

2)根据以下公式计算每个负样本与锚点样本间的相似度对应的权重w_n：2) Calculate the weight w _n corresponding to the similarity between each negative sample and the anchor sample according to the following formula:

J(W)＝min||X-W||J(W)=min||X-W||

其中，J(w)表示最小二乘法中表示误差的成本函数，w表示要优化的权重矩阵；X表示输入数据矩阵，每一行表示样本，每一列表示特征；||·||表示范式距离；Among them, J(w) represents the cost function representing the error in the least squares method, w represents the weight matrix to be optimized; X represents the input data matrix, each row represents a sample, and each column represents a feature; ||·|| represents the paradigm distance;

本申请实施例通过最小二乘法来优化权重矩阵，最小二乘法的目标是通过调整参数矩阵The embodiment of this application uses the least squares method to optimize the weight matrix. The goal of the least squares method is to adjust the parameter matrix

3)通过加权平均计算得到生成的负样本：3) The generated negative samples are calculated through weighted average calculation:

可以理解的是，若锚点样本为图像样本v，则生成的负样本为文本负样本若锚点样本为文本样本c，则生成的负样本为图像样本/> It can be understood that if the anchor sample is an image sample v, the generated negative sample is a text negative sample If the anchor sample is text sample c, the generated negative sample is image sample/>

步骤S104，计算正样本对中图像与文本的正相似度、负样本对和生成的负样本对中图像与文本之间的负相似度。Step S104: Calculate the positive similarity between the image and the text in the positive sample pair, and the negative similarity between the image and the text in the negative sample pair and the generated negative sample pair.

与锚点样本匹配的为正样本，正样本与锚点样本组成正样本对，计算正样本对中图像与文本的正相似度。与锚点不匹配的为负样本，负样本与锚点样本组成负样本对，计算负样本对和生成的都样本对中图像与文本的负相似度。将正相似度和负相似度结合起来形成第三相似度。The positive sample that matches the anchor sample is a positive sample. The positive sample and the anchor sample form a positive sample pair, and the positive similarity between the image and the text in the positive sample pair is calculated. The negative samples that do not match the anchor point are negative samples. The negative samples and the anchor point samples form a negative sample pair. The negative similarity between the image and the text in the negative sample pair and the generated sample pair is calculated. Positive and negative similarities are combined to form a third degree of similarity.

步骤S105，基于正相似度和负相似度计算损失函数，通过损失函数对预训练模型OSCAR进行微调，得到完成训练的OSCAR图文检索模型。Step S105: Calculate a loss function based on positive similarity and negative similarity, fine-tune the pre-trained model OSCAR through the loss function, and obtain the OSCAR image and text retrieval model that has completed training.

本发明实施例基于InfoCMR提出一种全新的损失函数，用于对比不同来源的正负样本，损失函数可以用公式表示：The embodiment of the present invention proposes a brand new loss function based on InfoCMR for comparing positive and negative samples from different sources. The loss function can be expressed by the formula:

其中，v表示图像特征表示，c表示文本特征表示；s^vc+表示锚点样本为图像样本时的正相似度，s^cv+表示锚点样本为文本样本时的正相似度；S^vc表示锚点样本为图像样本时正相似度和负相似度的集合，S^cv表示锚点样本为文本样本时正相似度和负相似度的集合；和/>表示惩罚项；τ是一个超参数；||•||表示集合大小。Among them, v represents the image feature representation, c represents the text feature representation; s ^vc+ represents the positive similarity when the anchor sample is an image sample, s ^cv+ represents the positive similarity when the anchor sample is a text sample; S ^vc represents the anchor sample. is the set of positive similarity and negative similarity when the anchor sample is an image sample, S ^cv represents the set of positive similarity and negative similarity when the anchor sample is a text sample; and/> represents the penalty term; τ is a hyperparameter; ||•|| represents the set size.

其中，为了减轻模型过拟合的风险，引入一个额外的惩罚项。从高斯分布中随机抽样Z个高斯噪声向量，每个向量具有与锚点样本在嵌入空间对应的锚点向量相同的维度，这些高斯噪声向量与批处理中的每个样本形成高置信度的负样本对，有助于平滑表示空间。需要注意的是，这些高斯噪声向量不会参与正样本对的形成。Among them, in order to reduce the risk of model overfitting, an additional penalty term is introduced. Randomly sample Z Gaussian noise vectors from a Gaussian distribution, each vector having the same dimensions as the anchor vector corresponding to the anchor sample in the embedding space. These Gaussian noise vectors form a high-confidence negative relationship with each sample in the batch. Sample pairs,help to smooth the representation space. It should be noted that these Gaussian noise vectors will not participate in the formation of positive sample pairs.

本发明设计的损失函数，综合了正样本、负样本和生成的负样本之间的信息，进一步缩小图文匹配的异质性差异。同时，在损失函数中加入额外的惩罚项，通过随机抽样高斯噪声向量，形成了高置信度的负样本对，减轻过拟合的风险，有助于平滑表示控件并提高模型的泛化能力。The loss function designed by this invention integrates the information between positive samples, negative samples and generated negative samples to further reduce the heterogeneity difference in image-text matching. At the same time, an additional penalty term is added to the loss function, and a high-confidence negative sample pair is formed by randomly sampling Gaussian noise vectors, which reduces the risk of overfitting, helps to smooth the representation of controls and improves the generalization ability of the model.

本发明实施例提供的一种基于OSCAR的图文检索模型的训练方法，利用视觉语言预训练模型OSCAR对图像样本和文本样本进行特征提取，通过负样本合成模块生成不具有挑战性的负样本，增加了图像与文本之间的难度，利用正样本对中图像与本文的正相似度、负样本对和生成的负样本对中图像与文本的负相似度设计损失函数，基于全新的损失函数训练得到目标OSCAR模型，提升了图文检索模型的泛化能力，进而提高模型进行图文检索的效率和准确性。An embodiment of the present invention provides a training method for an image and text retrieval model based on OSCAR, which uses the visual language pre-training model OSCAR to extract features from image samples and text samples, and generates unchallenging negative samples through a negative sample synthesis module. It increases the difficulty between the image and the text, and uses the positive similarity between the image and the article in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair to design the loss function, and trains based on the new loss function The target OSCAR model is obtained, which improves the generalization ability of the image and text retrieval model, thereby improving the efficiency and accuracy of the model for image and text retrieval.

请参阅图2，为本发明实施例提供的一种采用如上述方法训练得到的图文检索模型实现的图文检索方法的流程示意图，该方法包括：Please refer to Figure 2, which is a schematic flow chart of an image-text retrieval method implemented using an image-text retrieval model trained as described above, provided by an embodiment of the present invention. The method includes:

基于图文检索模型中的文本编码器对目标文本进行特征提取，得到文本特征表示；Based on the text encoder in the image-text retrieval model, feature extraction is performed on the target text to obtain text feature representation;

基于图文检索模型中的图像编码器对目标图像进行特征提取，得到图像特征表示；Based on the image encoder in the image-text retrieval model, feature extraction is performed on the target image to obtain image feature representation;

基于文本特征表示及图像特征表示，确定目标文本在目标图像中的图像检索结果，和/或，确定目标图像在目标文本中的文本检索结果。Based on the text feature representation and the image feature representation, determine the image retrieval result of the target text in the target image, and/or determine the text retrieval result of the target image in the target text.

利用本发明实施例提供的图文检索方法，能够提升图文检索的效率和准确性。Using the image and text retrieval method provided by the embodiment of the present invention can improve the efficiency and accuracy of image and text retrieval.

请参阅图3，为本发明实施例提供的一种基于OSCAR的图文检索模型的训练装置的结构框图，该装置包括：Please refer to Figure 3, which is a structural block diagram of a training device for an OSCAR-based image and text retrieval model provided by an embodiment of the present invention. The device includes:

数据获取模块310，用于获取训练集，训练集包括多个图像-文本样本对；The data acquisition module 310 is used to acquire a training set, which includes multiple image-text sample pairs;

特征提取模块320，用于将训练集中的多个图像-文本样本对输入面向视觉语言任务的预训练模型OSCAR中，进行特征提取生成图像特征表示和文本特征表示；The feature extraction module 320 is used to input multiple image-text sample pairs in the training set into the pre-training model OSCAR for visual language tasks, and perform feature extraction to generate image feature representations and text feature representations;

负样本合成模块330，用于将训练集中的每个样本作为锚点样本，基于图像特征表示和文本特征表示，生成锚点样本对应的多个不同难度的负样本；生成的负样本与锚点样本组成生成的负样本对；The negative sample synthesis module 330 is used to use each sample in the training set as an anchor sample, and generate multiple negative samples of different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation; the generated negative sample and the anchor point Sample composition generates negative sample pairs;

相似度计算模块340，用于计算正样本对中图像与文本的正相似度、负样本对和生成的负样本对中图像与文本之间的负相似度；The similarity calculation module 340 is used to calculate the positive similarity between the image and the text in the positive sample pair, and the negative similarity between the image and the text in the negative sample pair and the generated negative sample pair;

对比损失计算模块350，用于基于正相似度和负相似度计算损失函数，通过损失函数对预训练模型OSCAR进行微调，得到完成训练的OSCAR图文检索模型。The comparison loss calculation module 350 is used to calculate a loss function based on positive similarity and negative similarity, and fine-tune the pre-trained model OSCAR through the loss function to obtain a trained OSCAR image and text retrieval model.

请参阅图4，为本发明实施例提供的计算机设备的结构示意图，该计算机设备可以包括处理器20、存储器21和总线，还可以包括存储在存储其21中并可在处理器20上运行的计算机程序。Please refer to Figure 4, which is a schematic structural diagram of a computer device according to an embodiment of the present invention. The computer device may include a processor 20, a memory 21 and a bus, and may also include a program stored in the memory 21 and capable of running on the processor 20. Computer program.

其中，存储器21至少包括一种类型的可读存储介质，可读存储介质包括闪存、移动硬盘、多媒体卡、卡性存储器(例如：SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器21在一些实施例中可以是计算机设备的内部存储单元，例如该计算机设备的移动硬盘。存储器21在另一些实施例中也可以是电子设备的外部存储设备，例如计算机设备上配备的插接式移动硬盘、智能存储卡(Smart Media Card，SMC)、安全数字(Secure Digital，SD)卡、闪存卡(Flash Card)等。进一步地，存储器21还可以既包括计算机设备的内部存储单元也包括外部存储设备。存储器21不仅可以用于存储安装于计算机设备的应用软件及各类数据，还可以用于暂时地存储已经输出或者将要输出的数据。The memory 21 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card memory (such as SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device, such as a mobile hard disk of the computer device. In other embodiments, the memory 21 may also be an external storage device of an electronic device, such as a plug-in mobile hard disk, a smart memory card (Smart Media Card, SMC), or a secure digital (Secure Digital, SD) card equipped on a computer device. , Flash Card, etc. Further, the memory 21 may also include both an internal storage unit of the computer device and an external storage device. The memory 21 can not only be used to store application software and various types of data installed on the computer device, but can also be used to temporarily store data that has been output or is to be output.

处理器20在一些实施例中可以由集成电路组成，例如可以由单个封装的集成电路所组成，也可以是由多个相同功能或不同功能封装的集成电路所组成，包括一个或者多个中央处理器(Central Processing unit，CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。处理器20是计算机设备的控制核心(Control Unit)，利用各种接口和线路连接整个电子设备的各个部件，通过运行或执行存储在存储器21内的程序或者模块，以及调用存储在存储器21内的数据，以执行计算机设备的各种功能和处理数据。In some embodiments, the processor 20 may be composed of an integrated circuit, for example, it may be composed of a single packaged integrated circuit, or it may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more central processing units. (Central Processing unit, CPU), microprocessor, digital processing chip, graphics processor and various control chip combinations, etc. The processor 20 is the control core (Control Unit) of the computer device. It uses various interfaces and lines to connect various components of the entire electronic device, runs or executes programs or modules stored in the memory 21, and calls programs stored in the memory 21. Data to perform the various functions of the computer equipment and to process the data.

总线可以是外设部件互连标准(peripheral component interconnect，简称PCI)总线或扩展工业标准结构(extended industry standard architecture，简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。总线被设置为实现存储器21以及至少一个处理器20等之间的连接通信。The bus may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus. The bus can be divided into address bus, data bus, control bus, etc. The bus is provided to enable connection communication between the memory 21 and at least one processor 20 and the like.

图4仅示出了具有部件的计算机设备，本领域技术人员可以理解的是，图4示出的结构并不构成对该计算机设备的限定，可以包括比图示更少或者更多的部件，或者组合某些部件，或者不同的部件布置。Figure 4 only shows a computer device with components. Those skilled in the art can understand that the structure shown in Figure 4 does not constitute a limitation on the computer device, and may include fewer or more components than shown in the figure. Or combining certain parts, or different parts arrangements.

例如，尽管未示出，计算机设备还可以包括给各个部件供电的电源(比如电池)，优选地，电源可以通过电源管理装置与至少一个处理器20逻辑相连，从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。计算机设备还可以包括多种传感器、蓝牙模块、Wi-Fi模块等，在此不再赘述。For example, although not shown, the computer device may also include a power supply (such as a battery) that supplies power to various components. Preferably, the power supply may be logically connected to at least one processor 20 through a power management device, thereby realizing charging management through the power management device. Discharge management, power consumption management and other functions. The power supply may also include one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components. The computer equipment may also include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.

进一步地，计算机设备还可以包括网络接口，可选地，网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等)，通常用于在该计算机设备与其他计算机设备之间建立通信连接。Further, the computer device may also include a network interface. Optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which are usually used between the computer device and other computer devices. Establish a communication connection between them.

可选地，该计算机设备还可以包括用户接口，用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard))，可选地，用户接口还可以是标准的有线接口、无线接口。可选地，在一些实施例中，显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。其中，显示器也可以适当的称为显示屏或显示单元，用于显示在计算机设备中处理的信息以及用于显示可视化的用户界面。Optionally, the computer device may also include a user interface, which may be a display (Display) or an input unit (such as a keyboard). Optionally, the user interface may also be a standard wired interface or a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, or the like. The display may also be appropriately referred to as a display screen or a display unit, and is used for displaying information processed in the computer device and for displaying a visualized user interface.

应该了解，上述实施例仅为说明之用，在专利申请范围上并不受此结构的限制。It should be understood that the above embodiments are for illustration only, and the scope of the patent application is not limited by this structure.

计算机设备集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读存储介质中。计算机可读存储介质可以是易失性的，也可以是非易失性的。例如，计算机可读介质可以包括：能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM，Read-Only Memory)。Computer equipment integrated modules/units can be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Computer-readable storage media may be volatile or non-volatile. For example, computer-readable media may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, mobile hard drives, magnetic disks, optical disks, computer memory, and read-only memory (ROM, Read-Only Memory).

本发明还提供一种计算机可读存储介质，可读存储介质存储有计算机程序，计算机程序在被电子设备的处理器所执行。The present invention also provides a computer-readable storage medium. The readable storage medium stores a computer program, and the computer program is executed by a processor of an electronic device.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神或基本特征的情况下，能够以其他的具体形式实现本发明。It is obvious to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.

因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present invention is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of equivalent elements are included in the present invention. Any accompanying reference signs in the claims shall not be construed as limiting the claim in question.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中，人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of this application can obtain and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

此外，显然“包括”一词不排除其他单元或步骤，单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称，而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims may also be implemented by one unit or device by software or hardware. Second-class words are used to denote names, not any particular order.

最后应说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或等同替换，而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions may be made without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A training method for an OSCAR-based image and text retrieval model, characterized in that the method includes:

Obtain a training set, where the training set includes a plurality of image-text sample pairs;

Input multiple image-text sample pairs in the training set into the pre-trained model OSCAR for visual language tasks, and perform feature extraction to obtain image feature representation and text feature representation;

Each sample in the training set is used as an anchor sample, and based on the image feature representation and the text feature representation, multiple negative samples of different difficulties corresponding to the anchor sample are generated; the generated negative samples are consistent with the Anchor point samples form the generated negative sample pairs;

Calculate the positive similarity between the image and the text in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair between the image and the text;

A loss function is calculated based on the positive similarity and negative similarity, and the visual language pre-training model OSCAR is fine-tuned through the loss function to obtain the OSCAR image and text retrieval model that has completed training.

2. The training method according to claim 1, characterized in that each sample in the training set is used as an anchor point sample, and the anchor point is generated based on the image feature representation and the text feature representation. Multiple negative samples of different difficulties corresponding to the sample, including:

Select a sample as the anchor point sample q, where the sample is an image sample or a text sample;

Based on the anchor sample q, global semantic clustering is performed on each sample in the training set to obtain a negative sample cluster set G = {g ₁ , g ₂ ,..., g _M }, where _gi = { x _i1 ,x _i2 ,…,x _iN } represents a negative sample set of N negative samples with similar semantics, x _ij represents the j-th negative sample in the negative sample set g _i , i takes any integer from 1 to M , j takes any integer from 1 to N;

Based on the kernel function, the similarity and the corresponding weight between each negative sample and the anchor sample q are calculated, and a weighted average is performed to obtain multiple negative samples of different difficulties.

3. The training method according to claim 2, characterized in that the similarity and corresponding weight between each negative sample and the anchor sample q are calculated based on the kernel function, and a weighted average is performed to obtain a plurality of different Difficulty negative samples include:

Calculate the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:

Among them, k represents the similarity between the anchor sample q and the negative sample x _jn , ||·|| represents the paradigm distance, and σ is the width parameter;

Calculate the weight W _n corresponding to the similarity between each negative sample and the anchor sample according to the following formula:

J(W)=min||XW _n ||

Among them, J(W) is the cost function representing the error in the least squares method, W is the weight matrix to be optimized; X represents the input data matrix, each row represents a negative sample, and each column represents a feature; W _n is the weight matrix W Weight value; ||·|| represents calculation error;

The generated negative samples are calculated by weighted average:

in, Represents the generated negative sample corresponding to the anchor sample.

4. The training method according to claim 1, characterized in that the loss function can be expressed as:

Among them, v represents the image feature representation, c represents the text feature representation; s ^vc+ represents the positive similarity when the anchor sample is an image sample, s ^cv+ represents the positive similarity when the anchor sample is a text sample; S ^vc represents the anchor sample. is the set of positive similarity and negative similarity when the anchor sample is an image sample, S ^cv represents the set of positive similarity and negative similarity when the anchor sample is a text sample; and/> represents the penalty term; τ is a hyperparameter; ||·|| represents the set size.

5. The training method according to claim 1, characterized in that the plurality of image-text sample pairs in the training set are input into a pre-trained OSCAR image and text retrieval model, and feature extraction is performed to generate an image feature representation and Text feature representation, including:

Obtain image samples in the training set, extract regional visual features and regional position features of the image samples, linearly combine the regional visual features and the regional position features to obtain image embeddings; the image samples include n object area;

Obtain the text samples in the training set, use word segmentation technology to divide the text samples into multiple tags, and obtain the text embedding corresponding to each tag based on the OSCAR-base model;

Based on the image embedding and the text embedding, an attention mechanism is used to generate a joint feature representation, and the image feature representation and the text feature representation are generated through average pooling.

6. A method for realizing image and text retrieval using the OSCAR image and text retrieval model. The OSCAR image and text retrieval model is trained by the training method according to any one of claims 1 to 5, characterized in that, the method include:

Obtain the target text and target image to be retrieved;

Perform feature extraction on the target text based on the text encoder in the image-text retrieval model to obtain text feature representation;

Perform feature extraction on the target image based on the image encoder in the image-text retrieval model to obtain image feature representation;

Based on the text feature representation and the image feature representation, an image retrieval result of the target text in the target image is determined, and/or a text retrieval result of the target image in the target text is determined.

7. An image and text retrieval model training device based on OSCAR, characterized in that the device includes:

A data acquisition module, used to acquire a training set, where the training set includes multiple image-text sample pairs;

A feature extraction module, used to input multiple image-text sample pairs in the training set into the pre-training model OSCAR for visual language tasks, and perform feature extraction to obtain image feature representation and text feature representation;

A negative sample synthesis module, configured to use each sample in the training set as an anchor sample, and generate multiple negative samples of different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation; The generated negative sample and the anchor point sample form a generated negative sample pair;

A similarity calculation module, used to calculate the positive similarity between the image and the text in the positive sample pair, the negative similarity between the negative sample pair and the generated negative sample pair between the image and the text;

A comparison loss calculation module is used to calculate a loss function based on the positive similarity and negative similarity, and fine-tune the pre-trained model OSCAR through the loss function to obtain a trained OSCAR image and text retrieval model.

8. A computer device, characterized in that the computer device includes a processor and a memory; the memory stores at least one instruction, and the at least one instruction is used to be executed by the processor to implement claims 1 to 6 any of the methods described.

9. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, and the at least one instruction is used to be executed by a processor to implement the method according to any one of claims 1 to 6.