CN110110145A

CN110110145A - Document creation method and device are described

Info

Publication number: CN110110145A
Application number: CN201810082485.1A
Authority: CN
Inventors: 杨小汕; 徐常胜
Original assignee: Institute of Automation of Chinese Academy of Science; Tencent Cyber Tianjin Co Ltd
Current assignee: Institute of Automation of Chinese Academy of Science; Tencent Cyber Tianjin Co Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2019-08-09
Anticipated expiration: 2038-01-29
Also published as: CN110110145B

Abstract

The application discloses a description text generating method and device, belonging to the field of information processing. The method includes: extracting at least one visual feature vector from a target object, where the target object is a video or a picture; acquiring a semantic feature vector corresponding to each of the visual feature vectors; for the at least one visual feature vector, And each semantic feature vector corresponding to the visual feature vector is processed to obtain the description text of the target object. The description accuracy and flexibility of the method provided by this application are relatively high.

Description

Description text generation method and device

技术领域technical field

本发明涉及信息处理领域，特别涉及一种描述文本生成方法及装置。The invention relates to the field of information processing, in particular to a description text generation method and device.

背景技术Background technique

描述文本生成方法是一种采用自然语言生成用于描述视频内容的文本的方法。通过该方法生成视频的描述文本后，可以便于用户通过文本快速检索到需要的视频，并且可以帮助视力障碍者通过文本或者语音了解视频的内容。The description text generation method is a method for generating text for describing video content using natural language. After the description text of the video is generated by this method, it is convenient for the user to quickly retrieve the desired video through the text, and it can help the visually impaired to understand the content of the video through text or voice.

相关技术中，在生成视频的描述文本时，可以先采用预训练的分类器识别出视频中的视觉对象(例如物体、场景和动作等)，然后再采用预先确定的语言模板对识别出的视觉对象所对应的文本进行组织，从而得到该视频的描述文本。其中，该语言模板可以是预先对大量文本数据进行挖掘得到的。In related technologies, when generating video description text, a pre-trained classifier can be used to identify visual objects (such as objects, scenes and actions, etc.) in the video, and then a predetermined language template can be used to analyze the identified visual objects. The text corresponding to the object is organized to obtain the description text of the video. Wherein, the language template may be obtained by mining a large amount of text data in advance.

但是，由于相关技术中的方法在生成不同视频的描述文本时，均是采用固定的语言模板来组织文本的，其描述的灵活性和准确性较低。However, since the methods in the related art all use fixed language templates to organize texts when generating description texts for different videos, the flexibility and accuracy of the descriptions are relatively low.

发明内容Contents of the invention

本发明实施例提供了一种描述文本生成方法及装置，可以解决相关技术中的描述文本生成方法灵活性和准确性较低的问题。所述技术方案如下：Embodiments of the present invention provide a description text generation method and device, which can solve the problem of low flexibility and accuracy in the description text generation method in the related art. Described technical scheme is as follows:

一方面，提供了一种描述文本生成方法，所述方法包括：In one aspect, a method for generating description text is provided, the method comprising:

从目标对象中提取出至少一个视觉特征向量，所述目标对象为视频或图片；Extracting at least one visual feature vector from a target object, where the target object is a video or a picture;

获取与每个所述视觉特征向量对应的语义特征向量；obtaining a semantic feature vector corresponding to each of the visual feature vectors;

对所述至少一个视觉特征向量，以及每个所述视觉特征向量对应的语义特征向量进行处理，得到所述目标对象的描述文本。The at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector are processed to obtain the description text of the target object.

另一方面，提供了一种描述文本生成装置，所述装置包括：In another aspect, an apparatus for generating description text is provided, the apparatus comprising:

提取模块，用于从目标对象中提取出至少一个视觉特征向量，所述目标对象为视频或图片；An extraction module, configured to extract at least one visual feature vector from a target object, where the target object is a video or a picture;

获取模块，用于获取与每个所述视觉特征向量对应的语义特征向量；An acquisition module, configured to acquire a semantic feature vector corresponding to each of the visual feature vectors;

处理模块，用于对所述至少一个视觉特征向量，以及每个所述视觉特征向量对应的语义特征向量进行处理，得到所述目标对象的描述文本。The processing module is configured to process the at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector to obtain the description text of the target object.

又一方面，提供了一种终端，所述终端包括处理器和存储器，所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述方面所提供的描述文本生成方法。In yet another aspect, a terminal is provided, the terminal includes a processor and a memory, at least one instruction, at least one program, code set or instruction set are stored in the memory, the at least one instruction, the at least one program . The code set or instruction set is loaded and executed by the processor to implement the description text generation method provided in the above aspect.

再一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如如上述方面所提供的描述文本生成方法。In yet another aspect, a computer-readable storage medium is provided, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one section of program, the code The set or instruction set is loaded and executed by the processor to implement the method for generating description text as provided in the above aspect.

本发明实施例提供的技术方案带来的有益效果是：The beneficial effects brought by the technical solution provided by the embodiments of the present invention are:

本发明实施例提供了一种描述文本生成方法及装置，可以从目标对象中提取出至少一个视觉特征向量，并可以获取与每个视觉特征向量对应的语义特征向量，之后可以基于该至少一个视觉特征向量以及每个视觉特征向量的语义特征向量生成该目标对象的描述文本。由于与每个视觉特征向量对应的语义特征向量可以反映目标对象的语义特征，因此通过该语义特征向量辅助描述文本的生成，可以提高描述的准确性和灵活性。The embodiment of the present invention provides a description text generation method and device, which can extract at least one visual feature vector from the target object, and can obtain the semantic feature vector corresponding to each visual feature vector, and then can based on the at least one visual feature vector The feature vectors and the semantic feature vectors of each visual feature vector generate a descriptive text for the target object. Since the semantic feature vector corresponding to each visual feature vector can reflect the semantic feature of the target object, the accuracy and flexibility of the description can be improved by assisting the generation of description text through the semantic feature vector.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本发明实施例提供的一种描述文本生成方法所应用的设备的示意图；FIG. 1 is a schematic diagram of a device used to describe a text generation method provided by an embodiment of the present invention;

图2是本发明实施例提供的一种描述文本生成方法的流程图；Fig. 2 is a flow chart of a description text generation method provided by an embodiment of the present invention;

图3是本发明实施例提供的一种描述文本生成方法的算法框图；Fig. 3 is an algorithm block diagram of a description text generation method provided by an embodiment of the present invention;

图4是本发明实施例提供的另一种描述文本生成方法的流程图；FIG. 4 is a flowchart of another description text generation method provided by an embodiment of the present invention;

图5是本发明实施例提供的一种获取第一视觉特征向量对应的语义特征向量的方法流程图；5 is a flowchart of a method for obtaining a semantic feature vector corresponding to a first visual feature vector according to an embodiment of the present invention;

图6是本发明实施例提供的一种训练记忆模型的方法流程图；Fig. 6 is a flow chart of a method for training a memory model provided by an embodiment of the present invention;

图7是本发明实施例提供的一种训练记忆模型的算法框图；FIG. 7 is an algorithm block diagram of a training memory model provided by an embodiment of the present invention;

图8是本发明实施例提供的一种描述文本生成装置的结构示意图；Fig. 8 is a schematic structural diagram of a description text generation device provided by an embodiment of the present invention;

图9是本发明实施例提供的一种获取模块的结构示意图；Fig. 9 is a schematic structural diagram of an acquisition module provided by an embodiment of the present invention;

图10是本发明实施例提供的另一种描述文本生成装置的结构示意图；Fig. 10 is a schematic structural diagram of another descriptive text generation device provided by an embodiment of the present invention;

图11是本发明实施例提供的一种提取模块的结构示意图；Fig. 11 is a schematic structural diagram of an extraction module provided by an embodiment of the present invention;

图12是本发明实施例提供的一种终端块的结构示意图。Fig. 12 is a schematic structural diagram of a terminal block provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

相关技术中，除了基于语言模板的方法，一般还会采用基于机器翻译的方法对视频内容进行描述。机器翻译的原理是将一个用源语言描述的文本S翻译成为用目标语言描述的文本T，在该翻译的过程一般需要通过多个子任务(例如单词翻译，单词校准和重排序等)来完成。其中，在对单词进行翻译时，可以通过最大化条件概率p(T|S)来实现。近年来，随着深度学习技术的兴起，机器翻译方法普遍开始采用基于循环神经网络的编解码模型。在该编解码模型中，一个用于编码的循环神经网络可以把输入的用源语言描述的文本转换成为一个特征向量，然后由另外一个用于解码的循环神经网络基于该特征向量生成用目标语言描述的文本。其中，该编解码模型中所采用的循环神经网络可以为长短期记忆(LongShort-Term Memory，LSTM)网络。In related technologies, in addition to the method based on the language template, the method based on machine translation is generally used to describe the video content. The principle of machine translation is to translate a text S described in the source language into a text T described in the target language. The translation process generally needs to be completed through multiple subtasks (such as word translation, word alignment and reordering, etc.). Among them, when translating words, it can be realized by maximizing the conditional probability p(T|S). In recent years, with the rise of deep learning technology, machine translation methods have generally begun to use recurrent neural network-based encoding and decoding models. In the encoding and decoding model, a cyclic neural network for encoding can convert the input text described in the source language into a feature vector, and then another cyclic neural network for decoding generates the target language text based on the feature vector. The text of the description. Wherein, the recurrent neural network used in the encoding and decoding model may be a Long Short-Term Memory (Long Short-Term Memory, LSTM) network.

基于机器翻译的描述文本生成方法也可以采用类似的编解码模型，该编解码模型可以包括编码器和解码器，该编码器主要包括卷积神经网络和循环神经网络，该卷积神经网络可以分别对视频中的每一帧图像分别进行处理，提取出每一帧图像的特征向量，之后该循环神经网络可以将各帧图像的特征向量编码为一个视觉特征向量并输入至解码器。该解码器为基于循环神经网络的解码器，该解码器可以将该输入的视觉特征向量解码为由多个单词组成的文本，从而实现对视频内容的描述。The descriptive text generation method based on machine translation can also adopt a similar encoding and decoding model, which can include an encoder and a decoder. The encoder mainly includes a convolutional neural network and a recurrent neural network. The convolutional neural network can be respectively Each frame image in the video is processed separately to extract the feature vector of each frame image, and then the cyclic neural network can encode the feature vector of each frame image into a visual feature vector and input it to the decoder. The decoder is a decoder based on a cyclic neural network, and the decoder can decode the input visual feature vector into a text composed of multiple words, thereby realizing the description of the video content.

但是，由于相关技术中的方法只能提取出每一帧图像中单一的视觉特征向量，并不能有效的挖掘出视频中各个视觉对象之间的关系，其描述效果较差。并且，该方法主要是通过对大量的有标记训练样本(每个训练样本包括视频及视频对应的描述文本)进行训练而得到视频与描述文本之间的编解码模型，该编解码模型对有标记训练样本的依赖性较大，并且由于视频本身所具有的复杂特性，该编解码模型的效果依旧有待改善。However, because the method in the related art can only extract a single visual feature vector in each frame of image, it cannot effectively mine the relationship between various visual objects in the video, and its description effect is poor. Moreover, this method mainly obtains the codec model between the video and the description text by training a large number of marked training samples (each training sample includes a video and a description text corresponding to the video), and the codec model is for the marked The dependence of the training samples is large, and due to the complex characteristics of the video itself, the effect of the codec model still needs to be improved.

图1是本发明实施例提供的一种描述文本生成方法所应用的设备的示意图。参考图1，该描述文本生成方法可以应用于描述文本生成设备中。该描述文本生成设备可以包括智能手机、计算机、平板电脑、可穿戴设备、车载设备或者服务器中的任一种，本发明实施例对该描述文本生成设备的类型不做限定。Fig. 1 is a schematic diagram of a device applied to a description text generation method provided by an embodiment of the present invention. Referring to FIG. 1 , the description text generation method can be applied to a description text generation device. The descriptive text generating device may include any one of a smart phone, a computer, a tablet computer, a wearable device, a vehicle-mounted device, or a server, and the embodiment of the present invention does not limit the type of the descriptive text generating device.

当该描述文本生成方法应用于智能手机或计算机等终端设备中时，终端设备可以基于该方法生成用户所选定的视频或者图像的描述文本；当该描述文本生成方法应用于服务器时，服务器可以基于该方法对素材库(例如检索数据库)中的大量视频或者图像进行文本描述，以提高视频或图像的文本检索效率和准确率。When the descriptive text generation method is applied to terminal devices such as smart phones or computers, the terminal device can generate the descriptive text of the video or image selected by the user based on this method; when the descriptive text generation method is applied to the server, the server can Based on this method, a large number of videos or images in a material library (such as a retrieval database) are described in text, so as to improve the efficiency and accuracy of text retrieval of videos or images.

请参考图2，其示出了本发明实施例提供的描述文本生成方法的流程图。本实施例以该描述文本生成方法应用于图1所示的描述文本生成设备来举例说明。参考图2，该方法可以包括：Please refer to FIG. 2 , which shows a flowchart of a method for generating description text provided by an embodiment of the present invention. This embodiment is described by taking the descriptive text generating method applied to the descriptive text generating device shown in FIG. 1 as an example. Referring to Figure 2, the method may include:

步骤101、从目标对象中提取出至少一个视觉特征向量。Step 101. Extract at least one visual feature vector from the target object.

在本发明实施例中，该目标对象可以为视频或者图像。在提取特征时，若该目标对象为图像，则描述文本生成设备可以直接从该图像中提取出多个基础视觉特征向量；若该目标对象为视频，则描述文本生成设备可以从视频的每一帧图像中分别提取出多个基础视觉特征向量。In this embodiment of the present invention, the target object may be a video or an image. When extracting features, if the target object is an image, the description text generation device can directly extract a plurality of basic visual feature vectors from the image; if the target object is a video, the description text generation device can extract each A plurality of basic visual feature vectors are respectively extracted from the frame image.

进一步的，描述文本生成设备可以将提取出的多个基础视觉特征向量编码为一个视觉特征向量。或者，该描述文本生成设备也可以获取预先确定的至少一组注意力系数，然后分别采用每一组注意力系数对该多个基础视觉特征向量进行加权，得到对应于该至少一组注意力系数的至少一个视觉特征向量。Further, the descriptive text generation device may encode the multiple extracted basic visual feature vectors into one visual feature vector. Alternatively, the descriptive text generation device may also obtain at least one set of predetermined attention coefficients, and then use each set of attention coefficients to weight the multiple basic visual feature vectors to obtain the corresponding to the at least one set of attention coefficients At least one visual feature vector of .

步骤102、获取与每个视觉特征向量对应的语义特征向量。Step 102, acquiring a semantic feature vector corresponding to each visual feature vector.

进一步的，描述文本生成设备可以获取与每个视觉特征向量关联的至少一组样本数据，并基于该至少一组样本数据生成该语义特征向量。其中，每组样本数据可以包括：样本图片以及与该样本图片对应的标注文本，该标注文本可以为属性标注文本或者关系标注文本，该属性标注文本可以用于指示样本图片中的视觉对象的属性特征，该关系标注文本可以用于指示样本图片中各视觉对象之间的关系特征。在本发明实施例中，描述文本生成设备可以采用k近邻(k-Nearest Neighbor，KNN)算法获取与每个视觉特征向量关联的至少一组样本数据。Further, the device for generating description text may acquire at least one set of sample data associated with each visual feature vector, and generate the semantic feature vector based on the at least one set of sample data. Wherein, each set of sample data may include: a sample image and annotation text corresponding to the sample image, the annotation text may be an attribute annotation text or a relationship annotation text, and the attribute annotation text may be used to indicate the attribute of the visual object in the sample image feature, the relationship annotation text can be used to indicate the relationship feature between the visual objects in the sample picture. In this embodiment of the present invention, the device for generating description text may use a k-nearest neighbor (k-Nearest Neighbor, KNN) algorithm to obtain at least one set of sample data associated with each visual feature vector.

步骤103、对该至少一个视觉特征向量，以及每个视觉特征向量对应的语义特征向量进行处理，得到该目标对象的描述文本。Step 103: Process the at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector to obtain the description text of the target object.

在本发明实施例中，描述文本生成设备可以采用语言模型生成该目标对象的描述文本。该语言模型可以为基于循环神经网络的模型(也可以称为解码器)，该循环神经网络可以为LSTM网络。并且，该语言模型可以包括至少一个迭代单元，每个迭代单元可以用于生成一个单词。将该至少一个视觉特征向量，以及每个视觉特征向量对应的语义特征向量输入至语言模型后，该每个迭代单元可以根据输入的视觉特征向量及语义特征向量生成一个单词，该至少一个迭代单元生成的至少一个单词即可组成用于描述该目标对象的描述文本。In this embodiment of the present invention, the descriptive text generating device may use a language model to generate the descriptive text of the target object. The language model may be a model based on a cyclic neural network (also called a decoder), and the cyclic neural network may be an LSTM network. Also, the language model may include at least one iteration unit, and each iteration unit may be used to generate a word. After the at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector are input to the language model, each iteration unit can generate a word according to the input visual feature vector and semantic feature vector, and the at least one iteration unit The generated at least one word can form a description text for describing the target object.

需要说明的是，本发明实施例中的单词可以是指用于生成描述文本的文本单元，对于采用不同语言的描述文本，该单词的类型可以不同。例如，若描述文本采用的语言为中文，则该单词可以是指词语或单个汉字；若描述文本采用的语言为英语等印欧语系的语言，则该单词可以是指由若干字母组成的单字。It should be noted that a word in this embodiment of the present invention may refer to a text unit used to generate a description text, and the type of the word may be different for description texts in different languages. For example, if the language used in the description text is Chinese, the word may refer to a word or a single Chinese character; if the language used in the description text is an Indo-European language such as English, the word may refer to a single character composed of several letters.

综上所述，本发明实施例提供了一种描述文本生成方法，该方法可以从目标对象中提取出至少一个视觉特征向量，并可以获取与每个视觉特征向量对应的语义特征向量，之后可以基于该至少一个视觉特征向量以及每个视觉特征向量的语义特征向量生成该目标对象的描述文本。由于与每个视觉特征向量对应的语义特征向量可以反映目标对象的语义特征，因此通过该语义特征向量辅助描述文本的生成，可以提高描述的准确性和灵活性。To sum up, the embodiment of the present invention provides a description text generation method, which can extract at least one visual feature vector from the target object, and can obtain the semantic feature vector corresponding to each visual feature vector, and then can A description text of the target object is generated based on the at least one visual feature vector and the semantic feature vector of each visual feature vector. Since the semantic feature vector corresponding to each visual feature vector can reflect the semantic feature of the target object, the accuracy and flexibility of the description can be improved by assisting the generation of description text through the semantic feature vector.

图3是本发明实施例提供的一种描述文本生成方法的算法框图，参考图3可以看出，本发明实施例提供的描述文本生成方法主要采用了特征提取模型01、记忆模型02和语言模型03，该语言模型03可以包括至少一个迭代单元031，每个迭代单元031用于生成一个单词。其中，该特征提取模型01用于从目标对象中提取出与该至少一个迭代单元031一一对应的至少一个视觉特征向量，该记忆模型02用于获取与每个视觉特征向量对应的语义特征向量，该语言模型03用于基于该至少一个视觉特征向量以及每个视觉特征向量所对应的语义特征向量生成目标对象的描述文本。Fig. 3 is an algorithm block diagram of a description text generation method provided by the embodiment of the present invention. Referring to Fig. 3, it can be seen that the description text generation method provided by the embodiment of the present invention mainly uses the feature extraction model 01, the memory model 02 and the language model 03. The language model 03 may include at least one iteration unit 031, and each iteration unit 031 is used to generate a word. Wherein, the feature extraction model 01 is used to extract at least one visual feature vector corresponding to the at least one iteration unit 031 from the target object, and the memory model 02 is used to obtain the semantic feature vector corresponding to each visual feature vector , the language model 03 is used to generate a description text of the target object based on the at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector.

图4是本发明实施例提供的另一种描述文本生成方法的流程图，该方法可以应用于图1所示的实施环境中，并且可以基于图3所示的算法实现。参考图4，该描述文本生成方法具体可以包括：FIG. 4 is a flowchart of another method for describing text generation provided by an embodiment of the present invention. The method can be applied to the implementation environment shown in FIG. 1 and can be implemented based on the algorithm shown in FIG. 3 . Referring to Fig. 4, the description text generation method may specifically include:

步骤201、采用特征提取模型提取目标对象的至少一个基础视觉特征向量。Step 201, using a feature extraction model to extract at least one basic visual feature vector of a target object.

当该目标对象为图像时，该至少一个基础视觉特征向量可以是从该图像中不同局部位置提取的特征向量；当该目标对象为包括多帧图像的视频时，该至少一个基础视觉特征向量可以包括从每一帧图像中提取的特征向量，且从每一帧图像中提取的特征向量可以包括从该帧图像中不同局部位置提取的视觉特征向量。When the target object is an image, the at least one basic visual feature vector can be a feature vector extracted from different local positions in the image; when the target object is a video including multiple frames of images, the at least one basic visual feature vector can be A feature vector extracted from each frame of image is included, and the feature vector extracted from each frame of image may include visual feature vectors extracted from different local positions in the frame of image.

由于在图像分类或物体检测等视觉任务中，卷积神经网络的高层特征可以有效体现与物体相关的语义信息，因此在本发明实施例中，可以采用基于卷积神经网络的模型作为该特征提取模型。例如，该特征提取模型可以采用预训练的残差网络提取该目标对象的至少一个基础视觉特征向量，也即是，可以将该目标对象输入至残差网络，然后将该残差网络中最后一个卷积层输出的至少一个特征向量作为该目标对象的至少一个基础视觉特征向量。Since in visual tasks such as image classification or object detection, the high-level features of the convolutional neural network can effectively reflect the semantic information related to the object, so in the embodiment of the present invention, the model based on the convolutional neural network can be used as the feature extraction Model. For example, the feature extraction model can use a pre-trained residual network to extract at least one basic visual feature vector of the target object, that is, the target object can be input into the residual network, and then the last At least one feature vector output by the convolutional layer is used as at least one basic visual feature vector of the target object.

示例的，如图3所示，假设该目标对象为包括若干帧图像的视频，则从该视频中提取出的M(M可以为正整数，例如可以为大于1的正整数)个基础视觉特征向量可以表示为{x₁，x₂，...，x_M}，且该M个基础视觉特征向量可以包括从该视频中每一帧图像的不同位置所提取出的特征向量。Exemplary, as shown in Figure 3, assuming that the target object is a video that includes several frames of images, then the M (M can be a positive integer, such as a positive integer greater than 1) extracted from the video basic visual features The vectors can be expressed as {x ₁ , x ₂ , . . . , x _M }, and the M basic visual feature vectors can include feature vectors extracted from different positions of each frame of the video.

步骤202、确定该至少一个基础视觉特征向量中，每个基础视觉特征向量对应于每个迭代单元的注意力系数。Step 202, determine the at least one basic visual feature vector, each basic visual feature vector corresponds to the attention coefficient of each iteration unit.

参考图3可知，视频描述算法中的语言模型可以包括至少一个迭代单元，为了提高该每个迭代单元生成的单词的准确性，该特征提取模型可以先确定每个基础视觉特征向量对应于每个迭代单元的注意力系数，进而可以基于该至少一个基础视觉特征向量以及对应的注意力系数，生成对应于每个迭代单元的视觉特征向量；相应的，在采用该语言模型在生成描述文本时，可以将每个视觉特征向量输入至对应的迭代单元，以供该迭代单元生成对应的单词。Referring to Fig. 3, it can be seen that the language model in the video description algorithm may include at least one iterative unit, in order to improve the accuracy of the words generated by each iterative unit, the feature extraction model may first determine that each basic visual feature vector corresponds to each The attention coefficient of the iteration unit, and then based on the at least one basic visual feature vector and the corresponding attention coefficient, a visual feature vector corresponding to each iteration unit can be generated; correspondingly, when the language model is used to generate the description text, Each visual feature vector may be input to a corresponding iterative unit for the iterative unit to generate a corresponding word.

其中，每个基础视觉特征向量对应于某个迭代单元的注意力系数可以用于指示该基础视觉特征向量在该迭代单元生成单词时的重要程度，且该注意力系数的大小与重要程度正相关。Among them, the attention coefficient of each basic visual feature vector corresponding to a certain iteration unit can be used to indicate the importance of the basic visual feature vector when the iteration unit generates words, and the size of the attention coefficient is positively correlated with the importance .

在本发明实施例中，该语言模型可以为基于循环神经网络的模型，假设该语言模型中包括的迭代单元的个数为T，从目标对象中提取的基础视觉特征向量的个数为M(T和M均为正整数)，则在确定该至少一个基础视觉特征向量中，每个基础视觉特征向量对应于第t个迭代单元的注意力系数时，可以先获取第t-1个迭代单元中隐含层的特征向量h_t-1，该t为不大于T的正整数，对于第1个迭代单元，该特征提取模型则可以直接获取预设的初始特征向量h₀，该初始特征向量h₀可以为零向量。进一步的，可以基于该隐含层的特征向量h_t-1，确定每个基础视觉特征向量对应于该第t个迭代单元的注意力系数。其中，第m个基础视觉特征向量x_m对应于第t个迭代单元的注意力系数可以表示为：In the embodiment of the present invention, the language model may be a model based on a recurrent neural network, assuming that the number of iteration units included in the language model is T, and the number of basic visual feature vectors extracted from the target object is M( T and M are both positive integers), then in determining the at least one basic visual feature vector, when each basic visual feature vector corresponds to the attention coefficient of the tth iterative unit, the t-1th iterative unit can be obtained first The feature vector h _t-1 of the hidden layer in , where t is a positive integer not greater than T, for the first iteration unit, the feature extraction model can directly obtain the preset initial feature vector h ₀ , the initial feature vector h ₀ can be a zero vector. Further, based on the feature vector h _t-1 of the hidden layer, the attention coefficient of each basic visual feature vector corresponding to the t-th iteration unit can be determined. Among them, the m-th basic visual feature vector x _m corresponds to the attention coefficient of the t-th iteration unit It can be expressed as:

其中，f_att为预设的线性变换函数，例如，f_att可以为多层感知器，S为预设的归一化函数，例如，S可以为Softmax函数，m为不大于M的正整数。该注意力系数越大，则表明该第m个基础视觉特征向量x_m在第t个迭代单元生成单词时的重要程度越高。Wherein, f _att is a preset linear transformation function, for example, f _att can be a multi-layer perceptron, S is a preset normalization function, for example, S can be a Softmax function, and m is a positive integer not greater than M. The attention factor The larger the value, the higher the importance of the m-th basic visual feature vector x _m is when generating words in the t-th iteration unit.

最终该特征提取模型可以确定出对应于该T个迭代单元的T组注意力系数，其中每组注意力系数可以包括与该M个基础视觉特征向量一一对应的M个注意力系数。Finally, the feature extraction model can determine T groups of attention coefficients corresponding to the T iteration units, wherein each group of attention coefficients can include M attention coefficients corresponding to the M basic visual feature vectors one-to-one.

步骤203、基于该至少一个基础视觉特征向量，以及每个基础视觉特征向量对应于每个迭代单元的注意力系数，得到至少一个视觉特征向量。Step 203 : Obtain at least one visual feature vector based on the at least one basic visual feature vector and the attention coefficient of each basic visual feature vector corresponding to each iteration unit.

在本发明实施例中，对于语言模型中至少一个迭代单元中的任一迭代单元，可以基于每个基础视觉特征向量对应于该任一迭代单元的注意力系数，对该至少一个基础视觉特征向量进行加权求和，从而得到与该任一迭代单元对应的视觉特征向量。该至少一个迭代单元中的第t个迭代单元所对应的视觉特征向量V_t可以满足：In the embodiment of the present invention, for any iterative unit in at least one iterative unit in the language model, based on the attention coefficient of each basic visual feature vector corresponding to any iterative unit, the at least one basic visual feature vector Perform weighted summation to obtain the visual feature vector corresponding to any iterative unit. The visual feature vector V _t corresponding to the tth iteration unit in the at least one iteration unit may satisfy:

相应的，该特征提取模型所提取出的与该至少一个迭代单元一一对应的至少一个视觉特征向量可以为{V₁，V₂，...，V_T}。Correspondingly, the at least one visual feature vector extracted by the feature extraction model and corresponding to the at least one iteration unit may be {V ₁ , V ₂ , . . . , V _T }.

在本发明实施例中，该基于卷积神经网络的特征提取模型可以通过调整每个基础视觉特征向量对应于每个迭代单元的注意力系数的大小，自适应地从目标对象中提取出能够反映出该目标对象最重要的视觉特征的视觉特征向量。In the embodiment of the present invention, the feature extraction model based on convolutional neural network can adaptively extract from the target object that can reflect The visual feature vector of the most important visual features of the target object.

步骤204、获取与每个视觉特征向量关联的至少一组样本数据。Step 204, acquiring at least one set of sample data associated with each visual feature vector.

在本发明实施例中，可以采用k近邻(k-Nearest Neighbor，KNN)算法从预设的样本数据库中获取与每个视觉特征向量关联的至少一组样本数据。每组样本数据可以包括：样本图片以及与该样本图片对应的标注文本，该标注文本可以为人工标注的属性标注文本或者关系标注文本。其中，属性标注文本可以用于描述图片中视觉对象(例如物体或者物体的行为)的语义属性，关系标注文本可以用于描述图片中各视觉对象之间的关系。In the embodiment of the present invention, a k-nearest neighbor (k-Nearest Neighbor, KNN) algorithm may be used to acquire at least one set of sample data associated with each visual feature vector from a preset sample database. Each set of sample data may include: a sample image and annotation text corresponding to the sample image, and the annotation text may be manually annotated attribute annotation text or relation annotation text. Among them, the attribute annotation text can be used to describe the semantic attributes of the visual objects (such as objects or object behaviors) in the picture, and the relationship annotation text can be used to describe the relationship between the visual objects in the picture.

可选的，在获取与第一视觉特征向量关联的至少一组样本数据时，可以先采用该记忆模型分别提取该样本数据库中每一组样本数据的参考特征向量，例如可以采用该记忆模型中的图片处理模型获取每一组样本数据中样本图片的参考特征向量，然后分别计算该第一视觉特征向量与每一组样本数据的参考特征向量之间的向量距离，得到多个向量距离，最后可以获取向量距离不大于预设距离阈值的至少一组样本数据作为与该第一视觉特征向量关联的样本数据。Optionally, when acquiring at least one set of sample data associated with the first visual feature vector, the memory model may be used to respectively extract the reference feature vectors of each set of sample data in the sample database, for example, the memory model may be used The image processing model obtains the reference feature vectors of the sample pictures in each set of sample data, and then calculates the vector distances between the first visual feature vector and the reference feature vectors of each set of sample data to obtain multiple vector distances, and finally At least one set of sample data whose vector distance is not greater than a preset distance threshold may be acquired as the sample data associated with the first visual feature vector.

其中，该第一视觉特征向量可以为该至少一个视觉特征向量中的任一视觉特征向量。该预设距离阈值可以为该多个向量距离的均值；或者，也可以先对该多个向量距离由小至大进行排序，然后将该排序后的多个向量距离中，第K个向量距离作为该预设距离阈值，相应的，不大于该预设距离阈值的至少一组样本数据也即是该样本数据库中，与该第一视觉特征向量之间的向量距离较短的K组样本数据。两个向量之间的向量距离可以是指两个向量之间的欧式距离。Wherein, the first visual feature vector may be any visual feature vector in the at least one visual feature vector. The preset distance threshold can be the average value of the multiple vector distances; or, the multiple vector distances can be sorted from small to large, and then among the sorted multiple vector distances, the Kth vector distance As the preset distance threshold, correspondingly, at least one group of sample data that is not greater than the preset distance threshold is K groups of sample data that have a shorter vector distance from the first visual feature vector in the sample database . The vector distance between two vectors may refer to the Euclidean distance between two vectors.

需要说明的是，本发明实施例中所采用的用于获取该样本数据的样本数据库可以为人工标记的视觉知识图谱数据集Visual Genome，该数据集中包含10万张图片，500万个图片区域描述，100万个视觉问答，300万个物体，200万个针对该10万张图片中视觉对象的属性标注，以及200万个针对该10万张图片中视觉对象的关系标注。本发明实施例提供的方法主要使用了该数据集中具有属性标注的图片，以及具有关系标注的图片。It should be noted that the sample database used in the embodiment of the present invention to obtain the sample data can be a manually labeled visual knowledge map dataset Visual Genome, which contains 100,000 pictures and 5 million picture region descriptions , 1 million visual questions and answers, 3 million objects, 2 million attribute annotations for the visual objects in the 100,000 pictures, and 2 million relationship annotations for the visual objects in the 100,000 pictures. The method provided by the embodiment of the present invention mainly uses the pictures marked with attributes and pictures marked with relations in the data set.

此外，虽然本发明实施例所提供的方法也需要获取预先标记的样本数据，但由于本申请方法所获取的样本数据中的标记数据为样本图片的属性标注文本或关系标注文本，而相关技术中的方法所需获取的标记数据为视频的描述文本，因此相比于相关技术中的方法，本申请方法所需的样本数据更容易获取，且可用的数据库也更多。In addition, although the method provided by the embodiment of the present invention also needs to obtain pre-marked sample data, since the marked data in the sample data obtained by the method of the present application is the attribute annotation text or the relationship annotation text of the sample image, and the related art The tag data required by the method is the description text of the video. Therefore, compared with the method in the related art, the sample data required by the method of the present application is easier to obtain, and there are more databases available.

步骤205、采用记忆模型对每个视觉特征向量所关联的至少一组样本数据进行处理，得到每个视觉特征向量对应的语义特征向量。Step 205, using a memory model to process at least one set of sample data associated with each visual feature vector to obtain a semantic feature vector corresponding to each visual feature vector.

在本发明实施例中，该记忆模型可以为预先训练得到的基于卷积神经网络的模型，且该记忆模型可以包括图片处理模型和文本处理模型F_s。在采用该记忆模型对任一视觉特征向量所关联的该至少一组样F_v本数据进行处理时，可以采用图片处理模型F_v分别提取每组样本数据中的样本图片的视觉特征向量，并可以采用该文本处理模型F_s分别提取每组样本数据中的标注文本的语义特征向量，然后再基于该样本图片的视觉特征向量和标注文本的语义特征向量得到该任一视觉特征向量所对应的语义特征向量。In the embodiment of the present invention, the memory model may be a pre-trained convolutional neural network-based model, and the memory model may include a picture processing model and a text processing model F _s . When using the memory model to process the at least one set of sample data associated with any visual feature vector, the image processing model _Fv _can be used to extract the visual feature vectors of the sample pictures in each set of sample data respectively, and The text processing model F _s can be used to extract the semantic feature vectors of the labeled text in each group of sample data, and then based on the visual feature vector of the sample picture and the semantic feature vector of the labeled text, the corresponding visual feature vector can be obtained Semantic feature vectors.

由于至少一组样本数据中的标注文本可以指示样本图片中视觉对象的属性特征或者视觉对象之间的关系特征，且该至少一组样本数据是与目标对象中的视觉特征向量所关联的样本数据，因此在描述文本的生成过程中，每个视觉特征向量所关联的至少一组样本数据可以辅助提取出目标对象中的属性特征或关系特征，能够有效提高生成的描述文本准确性。也即是，该记忆模块可以自适应的选取与目标对象有关的属性特征和关系特征来辅助生成单词。Since the annotation text in at least one set of sample data can indicate the attribute features of visual objects in the sample picture or the relationship features between visual objects, and the at least one set of sample data is the sample data associated with the visual feature vector in the target object , so in the process of generating description text, at least one set of sample data associated with each visual feature vector can assist in the extraction of attribute features or relational features in the target object, which can effectively improve the accuracy of generated description text. That is, the memory module can adaptively select attribute features and relationship features related to the target object to assist in word generation.

图5是本发明实施例提供的一种获取第一视觉特征向量对应的语义特征向量的方法流程图，参考图5，采用记忆模型对第一视觉特征向量所关联的至少一组样本数据进行处理，得到第一视觉特征向量对应的语义特征向量的过程具体可以包括：Fig. 5 is a flow chart of a method for obtaining a semantic feature vector corresponding to a first visual feature vector according to an embodiment of the present invention. Referring to Fig. 5, a memory model is used to process at least one set of sample data associated with the first visual feature vector , the process of obtaining the semantic feature vector corresponding to the first visual feature vector may specifically include:

步骤2051、采用记忆模型中的图片处理模型对每组样本数据中的样本图片进行处理，得到每组样本数据中的样本图片的视觉特征向量。Step 2051: Use the image processing model in the memory model to process the sample pictures in each set of sample data to obtain the visual feature vectors of the sample pictures in each set of sample data.

该图片处理模型F_v可以包括多层卷积网络和一个全连接网络，假设该第一视觉特征向量所关联的样本数据包括K组，则采用该图片处理模型F_v依次对该K组样本数据中的样本图片进行处理后，可以得到共K个样本图片的视觉特征向量。假设采用图片处理模型F_v对第i组样本数据中的样本图片进行处理后，得到的该第i组样本数据中的样本图片的视觉特征向量表示为p_i，则该K个样本图片的视觉特征向量可以表示为集合{p_i}_i＝1,...,K。The image processing model F _v may include a multi-layer convolutional network and a fully connected network. Assuming that the sample data associated with the first visual feature vector includes K groups, the image processing model F _v is used to sequentially analyze the K groups of sample data After the sample pictures in are processed, the visual feature vectors of a total of K sample pictures can be obtained. Assuming that the image processing model F _v is used to process the sample pictures in the i-th group of sample data, the visual feature vector of the sample pictures in the i-th group of sample data obtained is denoted as p _i , then the visual features of the K sample pictures The feature vectors can be expressed as a set {p _i } _i=1,...,K .

步骤2052、采用记忆模型中的文本处理模型对每组样本数据中的标注文本进行处理，得到每组样本数据中的标注文本的语义特征向量。Step 2052: Use the text processing model in the memory model to process the labeled text in each set of sample data to obtain semantic feature vectors of the labeled text in each set of sample data.

该文本处理模型F_s可以包括词向量模型(例如word2vector模型)和池化层，在处理时，可以先采用该词向量模型获取每一组样本数据中的标注文本中每个单词的语义向量，得到至少一个语义向量，然后再通过该池化层的对该至少一个语义向量进行池化(pooling)操作，即可得到该每组样本数据中的标注文本的语义特征向量。The text processing model F _s can include a word vector model (such as a word2vector model) and a pooling layer. During processing, the word vector model can be used to obtain the semantic vector of each word in the labeled text in each group of sample data, At least one semantic vector is obtained, and then a pooling operation is performed on the at least one semantic vector by the pooling layer to obtain the semantic feature vector of the labeled text in each set of sample data.

采用该文本处理模型F_s依次对K组样本数据中的标注文本进行处理后，可以得到共K个标注文本的语义特征向量。假设采用文本处理模型F_s对第i组样本数据中的标注文本进行处理后，得到的该第i组样本数据中的标注文本的语义特征向量表示为q_i，则该K个标注文本的语义特征向量可以表示为集合{q_i}_i＝1,...,K。After using the text processing model F _s to sequentially process the labeled texts in K groups of sample data, a total of K semantic feature vectors of labeled texts can be obtained. Assuming that the text processing model F _s is used to process the annotation text in the i-th group of sample data, the semantic feature vector of the annotation text in the i-th group of sample data obtained is expressed as q _i , then the semantics of the K annotation texts The feature vectors can be expressed as a set {q _i } _i=1,...,K .

此外，在本发明实施例中，如图3所示，每个视觉特征向量所对应的K个样本图片的视觉特征向量{p_i}_i＝1,...,K也可以称为键(Key)向量，相应的，K个标注文本的语义特征向量{q_i}_i＝1,...,K可以为与该K个Key向量对应的K个值(Value)向量。In addition, in the embodiment of the present invention, as shown in FIG. 3 , the visual feature vectors {p _i } _i=1,...,K of the K sample pictures corresponding to each visual feature vector can also be called keys ( Key) vectors, correspondingly, the semantic feature vectors {q _i } _i=1, .

步骤2053、根据每组样本数据中的样本图片的视觉特征向量，确定每组样本数据中的标注文本的权重。Step 2053, according to the visual feature vectors of the sample pictures in each group of sample data, determine the weight of the labeled text in each group of sample data.

其中，每组样本数据中的标注文本的权重大小与样本图片的视觉特征向量的大小正相关。也即是，某个样本图片的视觉特征向量越大，该样本图片所对应的标注文本的权重也就越大。对于与第一视觉特征向量V_t关联的K组样本数据，其第i组样本数据中的标注文本的权重c_i可以满足：Wherein, the weight of the labeled text in each set of sample data is positively correlated with the size of the visual feature vector of the sample picture. That is, the larger the visual feature vector of a certain sample picture is, the greater the weight of the labeled text corresponding to the sample picture will be. For the K groups of sample data associated with the first visual feature vector V _t , the weight c _i of the labeled text in the i-th group of sample data can satisfy:

其中，表示V_t的转置，p_i为第i组样本数据中的样本图片的视觉特征向量，p_j为第j组样本数据中的样本图片的视觉特征向量，i和j均为不大于K的正整数。in, Indicates the transposition of V _t , p _i is the visual feature vector of the sample picture in the i-th group of sample data, p _j is the visual feature vector of the sample picture in the j-th group of sample data, and both i and j are not greater than K positive integer.

步骤2054、基于每组样本数据中的标注文本的权重，对该至少一组样本数据中的标注文本的语义特征向量进行加权求和，得到该第一视觉特征向量对应的语义特征向量。Step 2054: Based on the weight of the labeled text in each set of sample data, perform weighted summation of the semantic feature vectors of the labeled text in at least one set of sample data to obtain the semantic feature vector corresponding to the first visual feature vector.

基于上述步骤2053中确定的K个权重对该K个标注文本的语义特征向量{q_i}_i＝1,...,K进行加权求和后，得到的该第一视觉特征向量V_t所对应的语义特征向量可以表示为：Based on the K weights determined in the above step 2053, the semantic feature vectors {q _i } _i=1,...,K of the K labeled texts are weighted and summed, and the first visual feature vector V _t obtained is The corresponding semantic feature vector can be expressed as:

步骤206、依次采用语言模型中的至少一个迭代单元中的每个迭代单元对对应的视觉特征向量和语义特征向量进行处理，得到至少一个单词。Step 206, using each iteration unit in at least one iteration unit in the language model to process the corresponding visual feature vector and semantic feature vector to obtain at least one word.

该语言模型可以为基于循环神经网络的模型，例如可以为基于LSTM网络的模型。参考图3可知，该语言模型可以包括至少一个迭代单元，每个迭代单元可以用于生成一个单词。在生成描述文本时，可以将每个视觉特征向量及对应的语义特征向量分别输入至对应的迭代单元，该迭代单元即可根据输入的特征向量生成一个单词。The language model may be a model based on a recurrent neural network, for example, a model based on an LSTM network. Referring to FIG. 3, it can be seen that the language model may include at least one iteration unit, and each iteration unit may be used to generate a word. When generating the description text, each visual feature vector and the corresponding semantic feature vector can be input to the corresponding iteration unit, and the iteration unit can generate a word according to the input feature vector.

如图3所示，该每个迭代单元031可以包括第一线性处理单元L1和第二线性处理单元L2。每个视觉特征向量可以输入至对应的迭代单元中的第一线性处理单元L1，每个视觉特征向量对应的语义特征向量则可以输入至对应的迭代单元中的第二线性处理单元L2。As shown in FIG. 3 , each iteration unit 031 may include a first linear processing unit L1 and a second linear processing unit L2 . Each visual feature vector can be input to the first linear processing unit L1 in the corresponding iteration unit, and the semantic feature vector corresponding to each visual feature vector can be input to the second linear processing unit L2 in the corresponding iteration unit.

其中，第一个迭代单元031中的第一线性处理单元L1可以基于输入的视觉特征向量V₁，预设的初始特征向量h₀以及预设的初始化单词(例如“开始”)生成输出向量，并可以将该输出向量输入至第二线性处理单元L2以及下一个迭代单元的第一线性处理单元L1；其余每个迭代单元031中的第一线性处理单元L1可以基于输入的视觉特征向量，上一个迭代单元中第一线性处理单元L1的输出向量(也即是上一个迭代单元隐含层的特征向量)，以及上一个迭代单元生成的单词生成输出向量，并可以将该输出向量输入至第二线性处理单元L2以及下一个迭代单元的第一线性处理单元L1。Wherein, the first linear processing unit L1 in the first iteration unit 031 can generate an output vector based on the input visual feature vector V ₁ , the preset initial feature vector h ₀ and the preset initialization word (such as "start"), And the output vector can be input to the second linear processing unit L2 and the first linear processing unit L1 of the next iteration unit; the first linear processing unit L1 in each of the remaining iteration units 031 can be based on the input visual feature vector, above The output vector of the first linear processing unit L1 in an iterative unit (that is, the feature vector of the hidden layer of the last iterative unit), and the word generated by the last iterative unit generate an output vector, and the output vector can be input to the first The second linear processing unit L2 and the first linear processing unit L1 of the next iteration unit.

每个迭代单元中的第二线性处理单元L2可以基于输入的语义特征向量以及第一线性处理单元L1的输出向量生成单词，并可以将该单词输入至下一个迭代单元的第一线性处理单元L1。其中，每个第二线性处理单元L2对输入的向量进行线性处理后，还需要采用预设的归一化函数S对线性处理后的向量进行进一步的归一化处理，进而生成单词。该预设的归一化函数S可以为Softmax函数。此外，对于最后一个迭代单元，该迭代单元基于输入的特征向量所生成的单词可以为预设的结束符单词(例如“结束”)，描述文本生成设备检测到该结束符单词时，可以确定用于生成描述文本的单词已生成完毕。The second linear processing unit L2 in each iteration unit can generate a word based on the input semantic feature vector and the output vector of the first linear processing unit L1, and can input the word to the first linear processing unit L1 of the next iteration unit . Wherein, after each second linear processing unit L2 performs linear processing on the input vector, it needs to use a preset normalization function S to further normalize the linearly processed vector to generate words. The preset normalization function S may be a Softmax function. In addition, for the last iteration unit, the word generated by the iteration unit based on the input feature vector may be a preset terminator word (for example, "end"), and when the description text generation device detects the terminator word, it may be determined to use The words used to generate the description text have been generated.

示例的，第二个迭代单元中的第一线性处理单元L2可以基于输入的视觉特征向量V₂，第一个迭代单元中第一线性处理单元L1的输出向量以及该第一个迭代单元生成的单词“一个”生成输出向量，并将该输出向量输入至第二线性处理单元L2以及第三个迭代单元中的第一线性处理单元L1；该第二个迭代单元中的第二线性处理单元L2可以基于输入的语义特征向量R₂以及该第一线性处理单元L1的输出向量生成单词“人”，并且可以将该单词输入至第三个迭代单元中的第一处理单元L1。Exemplarily, the first linear processing unit L2 in the second iteration unit may be based on the input visual feature vector V ₂ , the output vector of the first linear processing unit L1 in the first iteration unit and the output vector generated by the first iteration unit The word "one" generates an output vector and inputs the output vector to the second linear processing unit L2 and the first linear processing unit L1 in the third iteration unit; the second linear processing unit L2 in the second iteration unit The word "person" may be generated based _on the input semantic feature vector R2 and the output vector of the first linear processing unit L1, and the word may be input to the first processing unit L1 in the third iteration unit.

步骤207、将该至少一个单词组成的文本作为该目标对象的描述文本。Step 207, use the text composed of at least one word as the description text of the target object.

最后，该语言模型即可将各个迭代单元生成的单词组成文本，作为该目标对象的描述文本。示例的，假设目标对象为视频，该语言模型包括的多个迭代单元生成的单词依次为“一个”、“人”、“在”以及“跑步”，则该语言模型生成的该视频的描述文本可以为“一个人在跑步”。Finally, the language model can compose the words generated by each iteration unit into text, which is used as the description text of the target object. For example, assuming that the target object is a video, and the words generated by the multiple iterative units included in the language model are "one", "person", "in" and "running", then the description text of the video generated by the language model Can be "A man is running".

需要说明的是，本发明实施例提供的描述文本生成方法中所采用的每个模型，均可以采用Pytorch、Caffe或Tensorflow等深度学习框架实现。It should be noted that each model used in the description text generation method provided by the embodiment of the present invention can be implemented by using a deep learning framework such as Pytorch, Caffe or Tensorflow.

综上所述，本发明实施例提供了一种描述文本生成方法，该方法可以从样本数据库中获取与目标对象的每个视觉特征向量关联的至少一组样本数据，并可以基于目标对象的至少一个视觉特征向量，以及由该至少一组样本数据得到的语义特征向量生成该目标对象的描述文本。由于每至少一组样本数据均为与目标对象中的一个视觉特征向量关联，且预先经过标注的数据，因此通过该至少一组样本数据可以辅助提取出目标对象中的特征，有效提高了视频描述的准确性和灵活性。并且，由于本发明实施例提供的方法中，与视觉特征向量关联的每组样本数据中的标注文本可以为属性标注文本或者关系标注文本，因此该方法能够有效利用样本图片中视觉对象的属性特征和关系特征来指导目标对象的描述文本生成，该方法考虑了目标对象中视觉对象的语义特征、属性特征以及各视觉对象之间关系特征，因此可以有效提高描述的准确性。To sum up, the embodiment of the present invention provides a description text generation method, which can obtain at least one set of sample data associated with each visual feature vector of the target object from the sample database, and can be based on at least A visual feature vector and a semantic feature vector obtained from the at least one set of sample data generate a description text of the target object. Since each at least one set of sample data is associated with a visual feature vector in the target object and is pre-labeled data, the at least one set of sample data can assist in extracting the features in the target object, effectively improving the video description. accuracy and flexibility. Moreover, since in the method provided by the embodiment of the present invention, the annotation text in each set of sample data associated with the visual feature vector can be attribute annotation text or relation annotation text, so this method can effectively utilize the attribute characteristics of the visual objects in the sample picture and relational features to guide the description text generation of the target object. This method takes into account the semantic features, attribute features and relational features of the visual objects in the target object, so it can effectively improve the accuracy of description.

如前文所述，本发明实施例所提供的描述文本生成方法中所采用的记忆模型为预先训练得到的模型，图6是本发明实施例提供的一种训练记忆模型的方法流程图，参考图6，该训练方法可以包括：As mentioned above, the memory model used in the descriptive text generation method provided by the embodiment of the present invention is a pre-trained model. FIG. 6 is a flow chart of a method for training the memory model provided by the embodiment of the present invention. Refer to FIG. 6. The training method may include:

步骤301、获取至少一组训练数据。Step 301. Acquire at least one set of training data.

在本发明实施例中，可以从预设的样本数据库中获取该至少一组训练数据，并且该用于获取训练数据的样本数据库与该用于获取样本数据的样本数据库可以为同一数据库，也可以为不同的数据库，本发明实施例对此不做限定。该获取得到的每组训练数据也可以包括：训练图片以及与该训练图片对应的训练标注文本，该每组训练数据也可以称为一个训练样本对。该训练标注文本可以包括属性标注文本或者关系标注文本。In the embodiment of the present invention, the at least one set of training data may be obtained from a preset sample database, and the sample database used to obtain training data and the sample database used to obtain sample data may be the same database, or They are different databases, which is not limited in this embodiment of the present invention. Each set of acquired training data may also include: a training picture and a training annotation text corresponding to the training picture, and each set of training data may also be called a training sample pair. The training annotation text may include attribute annotation text or relation annotation text.

需要说明的是，为了保证训练得到的记忆模型的效果，在选取该至少一组训练数据时，应当尽量保证该至少一组训练数据中的训练标注文本既包括属性标注文本，也包括关系标注文本。例如，假设需要获取N组训练数据(该N为正整数)，则可以使得选取的N/2组训练数据中每组训练数据的训练标注文本为属性标注文本，剩余的N/2组训练数据中每组训练数据的训练标注文本为关系标注文本。It should be noted that, in order to ensure the effect of the memory model obtained through training, when selecting the at least one set of training data, it should be ensured that the training annotation text in the at least one set of training data includes both attribute annotation text and relation annotation text . For example, assuming that it is necessary to obtain N sets of training data (the N is a positive integer), the training label text of each set of training data in the selected N/2 sets of training data can be the attribute label text, and the remaining N/2 sets of training data The training annotation text of each set of training data in is the relation annotation text.

步骤302、采用图片处理模型对每组训练数据中的训练图片进行处理，得到每组训练数据中的训练图片的视觉特征向量。Step 302, using the image processing model to process the training pictures in each set of training data to obtain visual feature vectors of the training pictures in each set of training data.

图7是本发明实施例提供的一种训练记忆模型的算法框架图，参考图7，该图片处理模型可以为基于卷积神经网络的模型，例如该图片处理模型可以包括多层卷积网络和一个全连接网络。若在上述步骤301中获取到了N组训练数据，则采用该图片处理模型F_v对第n组训练数据中的训练图片v_n进行处理后，得到的该训练图片v_n的特征向量可以表示为F_v(v_n)，其中n为不大于N的正整数。Fig. 7 is an algorithm framework diagram of a training memory model provided by an embodiment of the present invention. Referring to Fig. 7, the image processing model may be a model based on a convolutional neural network, for example, the image processing model may include a multi-layer convolutional network and A fully connected network. If N groups of training data have been obtained in the above step 301, after the image processing model F _v is used to process the training image v _n in the nth group of training data, the obtained feature vector of the training image v _n can be expressed as F _v (v _n ), where n is a positive integer not greater than N.

步骤303、采用文本处理模型对每组训练数据中的训练标注文本进行处理，得到每组训练数据中的训练标注文本的语义特征向量。Step 303 , using a text processing model to process the training marked text in each set of training data to obtain a semantic feature vector of the training marked text in each set of training data.

该文本处理模型可以包括预训练的词向量模型(例如word2vector模型)，以及池化层。在采用该文本处理模型对训练标注文本进行处理时，可以先通过该词向量模型获取训练标注文本中每个单词的语义向量，得到至少一个语义向量，然后再通过该池化层的对该至少一个语义向量进行池化操作，即可得到该每组训练数据中的训练标注文本的语义特征向量。其中，采用该文本处理模型F_s对第n组训练数据中的训练标注文本s_n进行处理后，得到的该训练标注文本s_n的特征向量可以表示为F_s(s_n)。The text processing model may include a pre-trained word vector model (such as a word2vector model), and a pooling layer. When the text processing model is used to process the training annotation text, the semantic vector of each word in the training annotation text can be obtained through the word vector model first, and at least one semantic vector can be obtained, and then the at least one semantic vector can be obtained through the pooling layer. A semantic vector is pooled to obtain the semantic feature vector of the training labeled text in each set of training data. Wherein, after using the text processing model F _s to process the training labeled text s _n in the nth set of training data, the obtained feature vector of the training labeled text s _n can be expressed as F _s (s _n ).

示例的，如图7所示，假设某个训练图片的训练标注文本包括“小屋”、“窗户”以及“树木”等六个单词，则采用文本处理模型对该训练标注文本进行处理时，可以先通过词向量模型获取每个单词的语义向量，得到六个语义向量；之后可以通过池化层对该六个语义向量进行池化操作，即可得到该训练标注文本的语义特征向量。For example, as shown in Figure 7, assuming that the training annotation text of a certain training picture includes six words such as "hut", "window" and "tree", when the text processing model is used to process the training annotation text, it can be First obtain the semantic vector of each word through the word vector model to obtain six semantic vectors; then the six semantic vectors can be pooled through the pooling layer to obtain the semantic feature vector of the training annotation text.

步骤304、基于训练图片的视觉特征向量和该训练标注文本的语义特征向量，构造损失函数。Step 304: Construct a loss function based on the visual feature vector of the training picture and the semantic feature vector of the training labeled text.

在构造损失函数时，可以先分别计算每一组训练数据中，训练图片的视觉特征向量与训练标注文本的语义特征向量之间的距离，得到N个距离。其中第n组训练数据中的训练图片v_n的特征向量F_v(v_n)与训练标注文本s_n的特征向量F_s(s_n)之间的距离dn可以表示为：d_n＝||F_v(v_n)-F_s(s_n)||。When constructing the loss function, the distance between the visual feature vector of the training picture and the semantic feature vector of the training labeled text in each set of training data can be calculated separately to obtain N distances. The distance dn between the feature vector F _v (v _n ) of the training picture v _n in the nth set of training data and the feature vector F _s (s _n ) of the training label text s _n can be expressed as: d _n =|| F _v (v _n )-F _s (s _n )||.

进一步的，即可根据该N个距离构造损失函数，该损失函数L可以满足：Further, the loss function can be constructed according to the N distances, and the loss function L can satisfy:

上述公式(1)中，τ为预设的超参数，max(τ-d_i，0)表示取(τ-d_i)和0中的较大值，w为该记忆模型中的参数，Ω(w)表示w的二范数(也称为L2正则)，λ为权重衰减因子(weightdecay)，l_n表示第n组样本数据对应的配对标记，且l_n的取值为0或1。其中，l_n的取值为1表示该第n组样本数据中的训练图片与其所对应的训练标注文本是相关的，即该训练标注文本是针对该训练图片标注的；l_n的取值为0则表示该第n组样本数据中的训练图片与其所对应的训练标注文本不相关，即该训练图片和该训练标注文本是随机生成的训练样本对。In the above formula (1), τ is the preset hyperparameter, max(τ-d _i , 0) means to take the larger value between (τ-d _i ) and 0, w is the parameter in the memory model, Ω (w) represents the two-norm of w (also known as L2 regularization), λ is the weight decay factor (weightdecay), l _n represents the paired marker corresponding to the nth group of sample data, and the value of l _n is 0 or 1. Among them, the value of l _n is 1, which means that the training picture in the nth group of sample data is related to the corresponding training label text, that is, the training label text is marked for the training picture; the value of l _n is 0 means that the training picture in the nth group of sample data is not related to the corresponding training label text, that is, the training picture and the training label text are randomly generated training sample pairs.

需要说明的是，在本发明实施例中，该损失函数也可以称为对比性损失函数，或者对比性约束函数等，本发明实施例对此不做限定。It should be noted that, in the embodiment of the present invention, the loss function may also be called a contrastive loss function, or a contrastive constraint function, etc., which is not limited in the embodiment of the present invention.

步骤305、采用该损失函数对该记忆模型进行训练，得到该图片处理模型和该文本处理模型。Step 305, using the loss function to train the memory model to obtain the image processing model and the text processing model.

采用如上述公式(1)所示的损失函数对该记忆模型进行训练，可以规则化该记忆模型中的参数w，从而得到该图片处理模型和该文本处理模型，该训练过程可以表示为：也即是求解使得该损失函数取值最小时的自变量w的取值。该训练后的记忆模型能够有效拟合该至少一组训练数据，并能够学习将图片和文本映射到两者共享的语义特征空间时的向量变换。其中，在训练的过程中，可以通过反向传递的方式更新该记忆模型中的参数w，直至该损失函数L收敛。Using the loss function shown in the above formula (1) to train the memory model, the parameter w in the memory model can be regularized, so as to obtain the image processing model and the text processing model. The training process can be expressed as: That is to solve the value of the independent variable w when the value of the loss function is minimized. The trained memory model can effectively fit the at least one set of training data, and can learn the vector transformation when mapping the picture and the text to the semantic feature space shared by both. Wherein, during the training process, the parameter w in the memory model can be updated by means of backward transfer until the loss function L converges.

综上所述，本发明实施例提供了一种记忆模型的训练方法，该方法训练得到的记忆模型可以在描述文本的生成过程中，自适应的选取与目标对象有关的属性特征和关系特征来辅助生成单词。To sum up, the embodiment of the present invention provides a memory model training method, the memory model trained by this method can adaptively select the attribute features and relationship features related to the target object during the generation process of the description text. Auxiliary word generation.

需要说明的是，本发明实施例提供的描述文本生成方法和记忆模型的训练方法的步骤的先后顺序可以进行适当调整，步骤也可以根据情况进行相应增减。例如，步骤2052可以与步骤2051同时执行，步骤303可以与步骤302同时执行。任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化的方法，都应涵盖在本发明的保护范围之内，因此不再赘述。It should be noted that the order of the steps of the description text generation method and the memory model training method provided by the embodiment of the present invention can be adjusted appropriately, and the steps can also be increased or decreased according to the situation. For example, step 2052 may be executed simultaneously with step 2051 , and step 303 may be executed concurrently with step 302 . Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changing methods, which should be covered within the scope of protection of the present invention, and thus will not be repeated here.

图8是本发明实施例提供的一种描述文本生成装置的结构示意图，该装置可以配置于图1所示的描述文本生成设备中，参考图8，该装置可以包括：Fig. 8 is a schematic structural diagram of a descriptive text generation device provided by an embodiment of the present invention. The device can be configured in the descriptive text generation device shown in Fig. 1. Referring to Fig. 8, the device may include:

提取模块401，用于从目标对象中提取出至少一个视觉特征向量，该目标对象为视频或图片。The extraction module 401 is configured to extract at least one visual feature vector from a target object, where the target object is a video or a picture.

获取模块402，用于获取与每个该视觉特征向量对应的语义特征向量。The obtaining module 402 is configured to obtain a semantic feature vector corresponding to each visual feature vector.

处理模块403，用于对该至少一个视觉特征向量，以及每个该视觉特征向量对应的语义特征向量进行处理，得到该目标对象的描述文本。The processing module 403 is configured to process the at least one visual feature vector and each semantic feature vector corresponding to the visual feature vector to obtain a description text of the target object.

可选的，参考图9，该获取模块402可以包括：Optionally, referring to FIG. 9, the acquiring module 402 may include:

获取子模块4021，用于获取与每个该视觉特征向量关联的至少一组样本数据，每组该样本数据包括：样本图片以及与该样本图片对应的标注文本，该标注文本包括属性标注文本或者关系标注文本。The acquisition sub-module 4021 is configured to acquire at least one set of sample data associated with each visual feature vector, each set of sample data includes: a sample picture and an annotation text corresponding to the sample image, the annotation text includes attribute annotation text or Relationship label text.

处理子模块4022，用于采用记忆模型对每个该视觉特征向量所关联的至少一组样本数据进行处理，得到每个该视觉特征向量对应的语义特征向量。The processing sub-module 4022 is configured to use a memory model to process at least one set of sample data associated with each visual feature vector to obtain a semantic feature vector corresponding to each visual feature vector.

可选的，该记忆模型可以包括图片处理模型和文本处理模型，参考图10，该装置还可以包括：Optionally, the memory model may include a picture processing model and a text processing model. Referring to FIG. 10, the device may also include:

数据获取模块404，用于获取至少一组训练数据，每组该训练数据包括：训练图片以及与该训练图片对应的训练标注文本。The data acquisition module 404 is configured to acquire at least one set of training data, each set of training data includes: a training picture and a training labeled text corresponding to the training picture.

图片处理模块405，用于采用该图片处理模型对每组训练数据中的训练图片进行处理，得到每组训练数据中的训练图片的视觉特征向量。The image processing module 405 is configured to use the image processing model to process the training images in each set of training data to obtain visual feature vectors of the training images in each set of training data.

文本处理模块406，用于采用该文本处理模型对每组训练数据中的训练标注文本进行处理，得到每组训练数据中的训练标注文本的语义特征向量。The text processing module 406 is configured to use the text processing model to process the training marked text in each set of training data to obtain the semantic feature vector of the training marked text in each set of training data.

构造模块407，用于基于该训练图片的视觉特征向量和该训练标注文本的语义特征向量，构造损失函数。A construction module 407, configured to construct a loss function based on the visual feature vector of the training picture and the semantic feature vector of the training labeled text.

训练模块408，用于采用该损失函数对该记忆模型进行训练，得到该图片处理模型和该文本处理模型。The training module 408 is configured to use the loss function to train the memory model to obtain the image processing model and the text processing model.

可选的，该数据获取模块404获取到的训练数据的组数为N，N为正整数，该构造模块407可以用于：Optionally, the number of groups of training data obtained by the data acquisition module 404 is N, where N is a positive integer, and the construction module 407 can be used for:

分别计算每一组训练数据中，训练图片的视觉特征向量与训练标注文本的语义特征向量之间的距离，得到N个距离，其中第n组训练数据中的训练图片v_n的视觉特征向量F_v(v_n)与训练标注文本s_n的语义特征向量F_s(s_n)之间的距离d_n满足：d_n＝||F_v(v_n)-F_s(s_n)||，n为不大于N的正整数；In each set of training data, the distance between the visual feature vector of the training picture and the semantic feature vector of the training annotation text is calculated separately, and N distances are obtained, wherein the visual feature vector F of the training picture v _n in the nth set of training data The distance d _n between _v (v _n ) and the semantic feature vector F _s (s _n ) of the training annotation text s _n satisfies: d _n =||F _v (v _n )-F _s (s _n )||, n is a positive integer not greater than N;

根据所述N个距离，构造损失函数，所述损失函数L满足：According to the N distances, a loss function is constructed, and the loss function L satisfies:

其中，l_n表示第n组样本数据对应的配对标记，且l_n的取值为0或1，τ为预设的超参数，max(τ-d_i，0)表示取(τ-d_i)和0中的较大值，w为所述记忆模型中的参数，Ω(w)表示w的二范数，λ为权重衰减因子。Among them, l _n represents the paired label corresponding to the nth group of sample data, and the value of l _n is 0 or 1, τ is a preset hyperparameter, max(τ-d _i , 0) means to take (τ-d _i ) and 0, w is a parameter in the memory model, Ω(w) represents the two-norm of w, and λ is a weight attenuation factor.

可选的，记忆模型包括图片处理模型和文本处理模型，该处理子模块4022采用记忆模型对第一视觉特征向量所关联的至少一组样本数据进行处理，得到该第一视觉特征向量对应的语义特征向量的过程可以包括：Optionally, the memory model includes an image processing model and a text processing model, and the processing submodule 4022 uses the memory model to process at least one set of sample data associated with the first visual feature vector to obtain the semantics corresponding to the first visual feature vector The process of eigenvectors can include:

采用该图片处理模型对每组样本数据中的样本图片进行处理，得到每组样本数据中的样本图片的视觉特征向量。The image processing model is used to process the sample images in each set of sample data to obtain visual feature vectors of the sample images in each set of sample data.

采用该文本处理模型对每组样本数据中的标注文本进行处理，得到每组样本数据中的标注文本的语义特征向量。The text processing model is used to process the labeled text in each group of sample data, and the semantic feature vector of the labeled text in each group of sample data is obtained.

根据每组样本数据中的样本图片的视觉特征向量，确定每组样本数据中的标注文本的权重，其中，每组样本数据中的标注文本的权重大小与样本图片的视觉特征向量的大小正相关。According to the visual feature vector of the sample picture in each set of sample data, determine the weight of the labeled text in each set of sample data, wherein the weight of the labeled text in each set of sample data is positively correlated with the size of the visual feature vector of the sample picture .

基于每组样本数据中的标注文本的权重，对该至少一组样本数据中的标注文本的语义特征向量进行加权求和，得到该第一视觉特征向量对应的语义特征向量。Based on the weights of the labeled text in each set of sample data, weighted summation is performed on the semantic feature vectors of the labeled text in the at least one set of sample data to obtain a semantic feature vector corresponding to the first visual feature vector.

其中，该处理子模块4022根据每组样本数据中的样本图片的视觉特征向量，确定每组样本数据中的标注文本的权重的过程可以包括：Wherein, the processing sub-module 4022 determines the weight of the labeled text in each set of sample data according to the visual feature vector of the sample picture in each set of sample data may include:

根据该第一视觉特征向量V_t，以及每组样本数据中的样本图片的视觉特征向量，确定每组样本数据中的标注文本的权重，第i组样本数据中的标注文本的权重c_i满足：According to the first visual feature vector V _t and the visual feature vector of the sample picture in each set of sample data, determine the weight of the marked text in each set of sample data, and the weight c _i of the marked text in the i-th set of sample data satisfies :

其中，K为与每个视觉特征向量关联的样本数据的组数，表示V_t的转置，p_i为第i组样本数据中的样本图片的视觉特征向量，p_j为第j组样本数据中的样本图片的视觉特征向量，i和j均为不大于K的正整数。Among them, K is the group number of sample data associated with each visual feature vector, Indicates the transposition of V _t , p _i is the visual feature vector of the sample picture in the i-th group of sample data, p _j is the visual feature vector of the sample picture in the j-th group of sample data, and both i and j are not greater than K positive integer.

可选的，该获取子模块4021可以用于：Optionally, the acquiring submodule 4021 can be used for:

采用该记忆模型分别提取样本数据库中每一组样本数据的参考特征向量；Using the memory model to extract the reference feature vectors of each group of sample data in the sample database;

分别计算该第一视觉特征向量与每一组样本数据的参考特征向量之间的向量距离；Calculate the vector distance between the first visual feature vector and the reference feature vector of each group of sample data;

获取向量距离不大于预设距离阈值的至少一组样本数据作为与该第一视觉特征向量关联的样本数据。At least one set of sample data whose vector distance is not greater than a preset distance threshold is acquired as sample data associated with the first visual feature vector.

可选的，该记忆模型包括图片处理模型和文本处理模型；该获取子模块4021采用该记忆模型分别提取该样本数据库中每一组样本数据的参考特征向量的过程可以包括：Optionally, the memory model includes a picture processing model and a text processing model; the process of the acquisition sub-module 4021 using the memory model to respectively extract the reference feature vectors of each group of sample data in the sample database may include:

采用该图片处理模型分别提取样本数据库中每一组样本数据中样本图片的参考特征向量。The image processing model is used to extract the reference feature vectors of the sample pictures in each group of sample data in the sample database.

可选的，目标对象的描述文本由语言模型生成，该语言模型可以包括至少一个迭代单元，每个该迭代单元用于生成一个单词；Optionally, the description text of the target object is generated by a language model, and the language model may include at least one iteration unit, each of which is used to generate a word;

该提取模块401可以用于：The extraction module 401 can be used for:

从该目标对象中提取出与该至少一个迭代单元一一对应的至少一个视觉特征向量。At least one visual feature vector corresponding to the at least one iteration unit is extracted from the target object.

相应的，该处理模块403可以用于：Correspondingly, the processing module 403 can be used for:

依次采用该至少一个迭代单元中的每个迭代单元对对应的视觉特征向量和语义特征向量进行处理，得到至少一个单词；Using each iteration unit in the at least one iteration unit in turn to process the corresponding visual feature vector and semantic feature vector to obtain at least one word;

将该至少一个单词组成的文本作为该目标对象的描述文本。The text composed of at least one word is used as the description text of the target object.

可选的，参考图11，该提取模块401可以包括：Optionally, referring to FIG. 11, the extracting module 401 may include:

提取子模块4011，用于提取该目标对象的至少一个基础视觉特征向量，该目标对象包括多帧图像，该至少一个基础视觉特征向量包括从每一帧图像中提取的视觉特征向量。The extraction sub-module 4011 is configured to extract at least one basic visual feature vector of the target object, the target object includes multiple frame images, and the at least one basic visual feature vector includes a visual feature vector extracted from each frame of image.

确定子模块4012，用于确定该至少一个基础视觉特征向量中，每个基础视觉特征向量对应于每个迭代单元的注意力系数。The determination sub-module 4012 is configured to determine the attention coefficient of each basic visual feature vector corresponding to each iteration unit among the at least one basic visual feature vector.

加权子模块4013，用于对于任一迭代单元，基于每个基础视觉特征向量对应于该任一迭代单元的注意力系数，对该至少一个基础视觉特征向量进行加权求和，得到与该任一迭代单元对应的视觉特征向量。Weighting sub-module 4013, for any iterative unit, based on the attention coefficient of each basic visual feature vector corresponding to the any iterative unit, carry out weighted summation on the at least one basic visual feature vector, and obtain the The visual feature vector corresponding to the iteration unit.

可选的，该语言模型为基于循环神经网络的模型，该确定子模块4012可以用于：Optionally, the language model is a model based on a recurrent neural network, and the determining submodule 4012 can be used for:

获取第t-1个迭代单元中隐含层的特征向量h_t-1，t为不大于T的正整数，T为语言模型包括的迭代单元的个数；Obtain the feature vector h _t-1 of the hidden layer in the t-1th iteration unit, where t is a positive integer not greater than T, and T is the number of iteration units included in the language model;

基于该隐含层的特征向量h_t-1，确定每个基础视觉特征向量对应于该第t个迭代单元的注意力系数，其中，第m个基础视觉特征向量x_m对应于该第t个迭代单元的注意力系数满足：Based on the feature vector h _t-1 of the hidden layer, determine the attention coefficient of each basic visual feature vector corresponding to the t-th iteration unit, where the m-th basic visual feature vector x _m corresponds to the t-th iteration unit Attention coefficient of iterative unit Satisfy:

其中，f_att为预设的线性变换函数，S为预设的归一化函数，m为不大于M的正整数，M为从该目标对象中提取的基础特征向量的个数。Wherein, f _att is a preset linear transformation function, S is a preset normalization function, m is a positive integer not greater than M, and M is the number of basic feature vectors extracted from the target object.

可选的，该语言模型为基于循环神经网络的模型；Optionally, the language model is a model based on a recurrent neural network;

该处理模块403采用该至少一个迭代单元中的第一个迭代单元对对应的视觉特征向量和语义特征向量进行处理，得到一个单词的过程可以包括：The processing module 403 uses the first iteration unit in the at least one iteration unit to process the corresponding visual feature vector and semantic feature vector, and the process of obtaining a word may include:

采用该第一个迭代单元对对应的视觉特征向量、语义特征向量、预设的初始特征向量以及预设的初始化单词进行处理，得到一个单词。The first iterative unit is used to process the corresponding visual feature vector, semantic feature vector, preset initial feature vector and preset initialization word to obtain a word.

该处理模块403采用除该第一个迭代单元之外的任一迭代单元对对应的视觉特征向量和语义特征向量进行处理，得到一个单词的过程可以包括：The processing module 403 uses any iteration unit other than the first iteration unit to process the corresponding visual feature vector and semantic feature vector, and the process of obtaining a word may include:

采用该任一迭代单元对对应的视觉特征向量、语义特征向量、上一个迭代单元隐含层的特征向量以及上一个迭代单元生成的单词进行处理，得到一个单词。Any iteration unit is used to process the corresponding visual feature vector, semantic feature vector, feature vector of the hidden layer of the previous iteration unit and the word generated by the previous iteration unit to obtain a word.

综上所述，本发明实施例提供了一种描述文本生成装置，可以从目标对象中提取出至少一个视觉特征向量，并可以获取与每个视觉特征向量对应的语义特征向量，之后可以基于该至少一个视觉特征向量以及每个视觉特征向量的语义特征向量生成该目标对象的描述文本。由于与每个视觉特征向量对应的语义特征向量可以反映目标对象的语义特征，因此通过该语义特征向量辅助描述文本的生成，可以提高描述的准确性和灵活性。To sum up, the embodiment of the present invention provides a device for generating description text, which can extract at least one visual feature vector from the target object, and can obtain the semantic feature vector corresponding to each visual feature vector, and then based on the At least one visual feature vector and a semantic feature vector of each visual feature vector generate a description text of the target object. Since the semantic feature vector corresponding to each visual feature vector can reflect the semantic feature of the target object, the accuracy and flexibility of the description can be improved by assisting the generation of description text through the semantic feature vector.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

图12示出了本发明一个示例性实施例提供的终端1200的结构框图。该终端1200可以是：智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio LayerIII，动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts GroupAudioLayer IV，动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1200还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。Fig. 12 shows a structural block diagram of a terminal 1200 provided by an exemplary embodiment of the present invention. The terminal 1200 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4) player, laptop or desktop computer. The terminal 1200 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.

通常，终端1200包括有：处理器1201和存储器1202。Generally, the terminal 1200 includes: a processor 1201 and a memory 1202 .

处理器1201可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器1201可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1201也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器1201可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1201还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1201 can adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 1201 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); the coprocessor is used to Low-power processor for processing data in standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 1201 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.

存储器1202可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器1202还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器1202中的非暂态的计算机可读存储介质用于存储至少一个指令，该至少一个指令用于被处理器1201所执行以实现本申请中方法实施例提供的描述文本生成方法。Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. The memory 1202 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1202 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1201 to implement the description text provided by the method embodiments in this application generate method.

在一些实施例中，终端1200还可选包括有：外围设备接口1203和至少一个外围设备。处理器1201、存储器1202和外围设备接口1203之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1203相连。具体地，外围设备包括：射频电路1204、触摸显示屏1205、摄像头1206、音频电路1207、定位组件1208和电源1209中的至少一种。In some embodiments, the terminal 1200 may optionally further include: a peripheral device interface 1203 and at least one peripheral device. The processor 1201, the memory 1202, and the peripheral device interface 1203 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 1203 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1204 , a touch screen 1205 , a camera 1206 , an audio circuit 1207 , a positioning component 1208 and a power supply 1209 .

外围设备接口1203可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器1201和存储器1202。在一些实施例中，处理器1201、存储器1202和外围设备接口1203被集成在同一芯片或电路板上；在一些其他实施例中，处理器1201、存储器1202和外围设备接口1203中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 1203 may be used to connect at least one peripheral device related to I/O (Input/Output, input/output) to the processor 1201 and the memory 1202 . In some embodiments, the processor 1201, memory 1202 and peripheral device interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1201, memory 1202 and peripheral device interface 1203 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路1204用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路1204通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1204将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路1204包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1204可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于：城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路1204还可以包括NFC(NearField Communication，近距离无线通信)有关的电路，本申请对此不加以限定。The radio frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 1204 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 1204 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 1204 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: a metropolitan area network, various generations of mobile communication networks (2G, 3G, 4G and 5G), a wireless local area network and/or a WiFi (Wireless Fidelity, wireless fidelity) network. In some embodiments, the radio frequency circuit 1204 may also include circuits related to NFC (NearField Communication, short-range wireless communication), which is not limited in this application.

显示屏1205用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1205是触摸显示屏时，显示屏1205还具有采集在显示屏1205的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1201进行处理。此时，显示屏1205还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏1205可以为一个，设置终端1200的前面板；在另一些实施例中，显示屏1205可以为至少两个，分别设置在终端1200的不同表面或呈折叠设计；在再一些实施例中，显示屏1205可以是柔性显示屏，设置在终端1200的弯曲表面上或折叠面上。甚至，显示屏1205还可以设置成非矩形的不规则图形，也即异形屏。显示屏1205可以采用LCD(Liquid Crystal Display，液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The display screen 1205 is used to display a UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 1205 is a touch display screen, the display screen 1205 also has the ability to collect touch signals on or above the surface of the display screen 1205 . The touch signal can be input to the processor 1201 as a control signal for processing. At this time, the display screen 1205 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there may be one display screen 1205, which is provided on the front panel of the terminal 1200; in other embodiments, there may be at least two display screens 1205, which are respectively provided on different surfaces of the terminal 1200 or in a folding design; In some other embodiments, the display screen 1205 may be a flexible display screen, which is arranged on the curved surface or the folded surface of the terminal 1200 . Even, the display screen 1205 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The display screen 1205 may be made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.

摄像头组件1206用于采集图像或视频。可选地，摄像头组件1206包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中，摄像头组件1206还可以包括闪光灯。闪光灯可以是单色温闪光灯，也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，可以用于不同色温下的光线补偿。The camera assembly 1206 is used to capture images or videos. Optionally, the camera component 1206 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of the main camera, depth-of-field camera, wide-angle camera, and telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function. Combined with the wide-angle camera to realize panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1206 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.

音频电路1207可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器1201进行处理，或者输入至射频电路1204以实现语音通信。出于立体声采集或降噪的目的，麦克风可以为多个，分别设置在终端1200的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1201或射频电路1204的电信号转换为声波。扬声器可以是传统的薄膜扬声器，也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅可以将电信号转换为人类可听见的声波，也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路1207还可以包括耳机插孔。Audio circuitry 1207 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 1201 for processing, or input them to the radio frequency circuit 1204 to realize voice communication. For the purpose of stereo sound collection or noise reduction, there may be multiple microphones, which are respectively set at different parts of the terminal 1200 . The microphone can also be an array microphone or an omnidirectional collection microphone. The speaker is used to convert the electrical signal from the processor 1201 or the radio frequency circuit 1204 into sound waves. The loudspeaker can be a conventional membrane loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it is possible not only to convert electrical signals into sound waves audible to humans, but also to convert electrical signals into sound waves inaudible to humans for purposes such as distance measurement. In some embodiments, audio circuitry 1207 may also include a headphone jack.

定位组件1208用于定位终端1200的当前地理位置，以实现导航或LBS(LocationBased Service，基于位置的服务)。定位组件1208可以是基于美国的GPS(GlobalPositioning System，全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。The positioning component 1208 is used to locate the current geographic location of the terminal 1200 to implement navigation or LBS (Location Based Service, location-based service). The positioning component 1208 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, the Greinus system of Russia, or the Galileo system of the European Union.

电源1209用于为终端1200中的各个组件进行供电。电源1209可以是交流电、直流电、一次性电池或可充电电池。当电源1209包括可充电电池时，该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。The power supply 1209 is used to supply power to various components in the terminal 1200 . The power source 1209 can be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1209 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery can also be used to support fast charging technology.

在一些实施例中，终端1200还包括有一个或多个传感器1210。该一个或多个传感器1210包括但不限于：加速度传感器1211、陀螺仪传感器1212、压力传感器1213、指纹传感器1214、光学传感器1215以及接近传感器1216。In some embodiments, the terminal 1200 further includes one or more sensors 1210 . The one or more sensors 1210 include, but are not limited to: an acceleration sensor 1211 , a gyroscope sensor 1212 , a pressure sensor 1213 , a fingerprint sensor 1214 , an optical sensor 1215 and a proximity sensor 1216 .

加速度传感器1211可以检测以终端1200建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器1211可以用于检测重力加速度在三个坐标轴上的分量。处理器1201可以根据加速度传感器1211采集的重力加速度信号，控制触摸显示屏1205以横向视图或纵向视图进行用户界面的显示。加速度传感器1211还可以用于游戏或者用户的运动数据的采集。The acceleration sensor 1211 can detect the acceleration on the three coordinate axes of the coordinate system established by the terminal 1200 . For example, the acceleration sensor 1211 can be used to detect the components of the acceleration of gravity on the three coordinate axes. The processor 1201 may control the touch display screen 1205 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1211 . The acceleration sensor 1211 can also be used for collecting game or user's motion data.

陀螺仪传感器1212可以检测终端1200的机体方向及转动角度，陀螺仪传感器1212可以与加速度传感器1211协同采集用户对终端1200的3D动作。处理器1201根据陀螺仪传感器1212采集的数据，可以实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。The gyro sensor 1212 can detect the body direction and rotation angle of the terminal 1200 , and the gyro sensor 1212 can cooperate with the acceleration sensor 1211 to collect 3D actions of the user on the terminal 1200 . According to the data collected by the gyroscope sensor 1212, the processor 1201 can realize the following functions: motion sensing (such as changing the UI according to the tilt operation of the user), image stabilization during shooting, game control and inertial navigation.

压力传感器1213可以设置在终端1200的侧边框和/或触摸显示屏1205的下层。当压力传感器1213设置在终端1200的侧边框时，可以检测用户对终端1200的握持信号，由处理器1201根据压力传感器1213采集的握持信号进行左右手识别或快捷操作。当压力传感器1213设置在触摸显示屏1205的下层时，由处理器1201根据用户对触摸显示屏1205的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。The pressure sensor 1213 may be disposed on a side frame of the terminal 1200 and/or a lower layer of the touch display screen 1205 . When the pressure sensor 1213 is installed on the side frame of the terminal 1200 , it can detect the user's grip signal on the terminal 1200 , and the processor 1201 performs left and right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 1213 . When the pressure sensor 1213 is arranged on the lower layer of the touch screen 1205, the processor 1201 controls the operable controls on the UI interface according to the user's pressure operation on the touch screen 1205. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

指纹传感器1214用于采集用户的指纹，由处理器1201根据指纹传感器1214采集到的指纹识别用户的身份，或者，由指纹传感器1214根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时，由处理器1201授权该用户执行相关的敏感操作，该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1214可以被设置终端1200的正面、背面或侧面。当终端1200上设置有物理按键或厂商Logo时，指纹传感器1214可以与物理按键或厂商Logo集成在一起。The fingerprint sensor 1214 is used to collect the user's fingerprint, and the processor 1201 recognizes the identity of the user according to the fingerprint collected by the fingerprint sensor 1214, or, the fingerprint sensor 1214 recognizes the user's identity according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1201 authorizes the user to perform related sensitive operations, such sensitive operations include unlocking the screen, viewing encrypted information, downloading software, making payment, and changing settings. The fingerprint sensor 1214 may be provided on the front, rear or side of the terminal 1200 . When the terminal 1200 is provided with a physical button or a manufacturer's Logo, the fingerprint sensor 1214 may be integrated with the physical button or the manufacturer's Logo.

光学传感器1215用于采集环境光强度。在一个实施例中，处理器1201可以根据光学传感器1215采集的环境光强度，控制触摸显示屏1205的显示亮度。具体地，当环境光强度较高时，调高触摸显示屏1205的显示亮度；当环境光强度较低时，调低触摸显示屏1205的显示亮度。在另一个实施例中，处理器1201还可以根据光学传感器1215采集的环境光强度，动态调整摄像头组件1206的拍摄参数。The optical sensor 1215 is used to collect ambient light intensity. In one embodiment, the processor 1201 can control the display brightness of the touch screen 1205 according to the ambient light intensity collected by the optical sensor 1215 . Specifically, when the ambient light intensity is high, the display brightness of the touch screen 1205 is increased; when the ambient light intensity is low, the display brightness of the touch screen 1205 is decreased. In another embodiment, the processor 1201 may also dynamically adjust shooting parameters of the camera assembly 1206 according to the ambient light intensity collected by the optical sensor 1215 .

接近传感器1216，也称距离传感器，通常设置在终端1200的前面板。接近传感器1216用于采集用户与终端1200的正面之间的距离。在一个实施例中，当接近传感器1216检测到用户与终端1200的正面之间的距离逐渐变小时，由处理器1201控制触摸显示屏1205从亮屏状态切换为息屏状态；当接近传感器1216检测到用户与终端1200的正面之间的距离逐渐变大时，由处理器1201控制触摸显示屏1205从息屏状态切换为亮屏状态。The proximity sensor 1216 , also called a distance sensor, is usually arranged on the front panel of the terminal 1200 . The proximity sensor 1216 is used to collect the distance between the user and the front of the terminal 1200 . In one embodiment, when the proximity sensor 1216 detects that the distance between the user and the front of the terminal 1200 gradually decreases, the processor 1201 controls the touch display screen 1205 to switch from the bright screen state to the off-screen state; when the proximity sensor 1216 detects When the distance between the user and the front of the terminal 1200 gradually increases, the processor 1201 controls the touch display screen 1205 to switch from the off-screen state to the on-screen state.

本领域技术人员可以理解，图12中示出的结构并不构成对终端1200的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 12 does not constitute a limitation on the terminal 1200, and may include more or less components than shown in the figure, or combine certain components, or adopt a different component arrangement.

本发明实施例还提供了一种计算机可读存储介质，该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，该至少一条指令、该至少一段程序、该代码集或指令集由处理器加载并执行以实现如上述实施例提供的描述文本生成方法。The embodiment of the present invention also provides a computer-readable storage medium, at least one instruction, at least one program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one program, the code set or instruction set The set is loaded and executed by the processor to implement the description text generation method provided by the above-mentioned embodiment.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. A description text generation method is characterized in that, the method comprises:

Extracting at least one visual feature vector from a target object, where the target object is a video or a picture;

obtaining a semantic feature vector corresponding to each of the visual feature vectors;

The at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector are processed to obtain the description text of the target object.

2. The method according to claim 1, wherein said obtaining the semantic feature vector corresponding to each said visual feature vector comprises:

Obtaining at least one set of sample data associated with each of the visual feature vectors, each set of sample data includes: a sample picture and annotation text corresponding to the sample image, the annotation text includes attribute annotation text or relationship annotation text ;

At least one set of sample data associated with each of the visual feature vectors is processed by using a memory model to obtain a semantic feature vector corresponding to each of the visual feature vectors.

3. The method according to claim 2, wherein the memory model comprises a picture processing model and a text processing model, and the method further comprises:

Obtain at least one set of training data, each set of training data includes: a training picture and a training label text corresponding to the training picture, and the training label text includes attribute label text or relationship label text;

Using the image processing model to process the training pictures in each group of training data to obtain the visual feature vectors of the training pictures in each group of training data;

Using the text processing model to process the training marked text in each group of training data to obtain the semantic feature vector of the training marked text in each group of training data;

Constructing a loss function based on the visual feature vector of the training picture and the semantic feature vector of the training labeled text;

The memory model is trained by using the loss function to obtain the image processing model and the text processing model.

4. The method according to claim 3, wherein the number of groups of the training data obtained is N, the visual feature vector based on the training picture and the semantic feature vector of the training label text, Construct a loss function, including:

In each set of training data, the distance between the visual feature vector of the training picture and the semantic feature vector of the training annotation text is calculated separately, and N distances are obtained, wherein the visual feature vector F of the training picture v _n in the nth set of training data The distance d _n between _v (v _n ) and the semantic feature vector F _s (s _n ) of the training annotation text s _n satisfies: d _n =||F _v (v _n )-F _s (s _n )||, n is a positive integer not greater than N;

According to the N distances, a loss function is constructed, and the loss function L satisfies:

Among them, l _n represents the paired label corresponding to the nth group of sample data, and the value of l _n is 0 or 1, τ is a preset hyperparameter, max(τ-d _i , 0) means to take (τ-d _i ) and 0, w is a parameter in the memory model, Ω(w) represents the two-norm of w, and λ is a weight attenuation factor.

5. The method according to any one of claims 2 to 4, wherein the memory model includes a picture processing model and a text processing model, and at least a group of samples associated with the first visual feature vector are used in the memory model The data is processed to obtain the semantic feature vector corresponding to the first visual feature vector, including:

Processing the sample pictures in each group of sample data by using the picture processing model to obtain the visual feature vectors of the sample pictures in each group of sample data;

Processing the labeled text in each set of sample data by using the text processing model to obtain a semantic feature vector of the labeled text in each set of sample data;

According to the visual feature vector of the sample picture in each set of sample data, determine the weight of the labeled text in each set of sample data, wherein the weight of the labeled text in each set of sample data is positively correlated with the size of the visual feature vector of the sample picture ;

Based on the weights of the labeled text in each set of sample data, weighted summation is performed on the semantic feature vectors of the labeled text in the at least one set of sample data to obtain a semantic feature vector corresponding to the first visual feature vector.

6. The method according to claim 5, wherein said determining the weight of the label text in each group of sample data according to the visual feature vector of the sample picture in each group of sample data includes:

According to the first visual feature vector V _t and the visual feature vector of the sample picture in each set of sample data, determine the weight of the labeled text in each set of sample data, and the weight c _i of the labeled text in the i-th set of sample data Satisfy:

Wherein, K is the group number of sample data associated with each described visual feature vector, V _t ^T represents the transposition of V _t , p _i is the visual feature vector of the sample picture in the i-th group of sample data, and p _j is The visual feature vectors of the sample pictures in the jth group of sample data, i and j are both positive integers not greater than K.

7. The method according to any one of claims 2 to 4, wherein obtaining at least one set of sample data associated with the first visual feature vector comprises:

Using the memory model to extract the reference feature vectors of each group of sample data in the sample database;

respectively calculating the vector distance between the first visual feature vector and the reference feature vector of each set of sample data;

At least one set of sample data whose vector distance is not greater than a preset distance threshold is acquired as the sample data associated with the first visual feature vector.

8. The method according to claim 7, wherein the memory model comprises a picture processing model and a text processing model;

The use of the memory model to respectively extract the reference feature vectors of each group of sample data in the sample database includes:

The reference feature vectors of the sample pictures in each group of sample data in the sample database are respectively extracted by using the picture processing model.

9. The method according to any one of claims 1 to 4, wherein the description text of the target object is generated by a language model, and the language model includes at least one iterative unit, each of which is used to generate a word;

The extracting at least one visual feature vector from the target object includes:

Extracting at least one visual feature vector corresponding to the at least one iteration unit one-to-one from the target object;

The processing of the at least one visual feature vector and the semantic feature vector corresponding to each of the visual feature vectors to obtain the description text of the target object includes:

Using each iteration unit in the at least one iteration unit in turn to process the corresponding visual feature vector and semantic feature vector to obtain at least one word;

The text composed of the at least one word is used as the description text of the target object.

10. The method according to claim 9, wherein said extracting at least one visual feature vector from the target object comprises:

Extracting at least one basic visual feature vector of the target object, where the target object includes multiple frame images, and the at least one basic visual feature vector includes a visual feature vector extracted from each frame of image;

Determining the at least one basic visual feature vector, each basic visual feature vector corresponds to the attention coefficient of each iteration unit;

For any iterative unit, based on each basic visual feature vector corresponding to the attention coefficient of any iterative unit, the at least one basic visual feature vector is weighted and summed to obtain the corresponding to any iterative unit Visual feature vector.

11. The method according to claim 10, wherein the language model is a model based on a recurrent neural network, and in determining the at least one basic visual feature vector, each basic visual feature vector corresponds to the tth iteration The attention coefficient of the unit, including:

Obtain the feature vector h _t-1 of the hidden layer in the t-1th iteration unit, where t is a positive integer not greater than T, and T is the number of iteration units included in the language model;

Based on the feature vector h _t-1 of the hidden layer, determine the attention coefficient of each basic visual feature vector corresponding to the t-th iteration unit, wherein the m-th basic visual feature vector x _m corresponds to the Attention coefficient of the tth iteration unit Satisfy:

Wherein, f _att is a preset linear transformation function, S is a preset normalization function, m is a positive integer not greater than M, and M is the number of basic feature vectors extracted from the target object.

12. The method according to claim 9, wherein the language model is a model based on a recurrent neural network;

Using the first iteration unit in the at least one iteration unit to process the corresponding visual feature vector and semantic feature vector to obtain a word, including:

Using the first iteration unit to process the corresponding visual feature vector, semantic feature vector, preset initial feature vector and preset initialization word to obtain a word;

Use any iteration unit except the first iteration unit to process the corresponding visual feature vector and semantic feature vector to obtain a word, including:

Using any of the iteration units to process the corresponding visual feature vector, semantic feature vector, feature vector of the hidden layer of the previous iteration unit, and the word generated by the previous iteration unit to obtain a word.

13. A device for generating description text, characterized in that the device comprises:

An extraction module, configured to extract at least one visual feature vector from a target object, where the target object is a video or a picture;

An acquisition module, configured to acquire a semantic feature vector corresponding to each of the visual feature vectors;

The processing module is configured to process the at least one visual feature vector and the semantic feature vector corresponding to each visual feature vector to obtain the description text of the target object.

14. A terminal, characterized in that the terminal includes a processor and a memory, at least one instruction, at least one program, code set or instruction set are stored in the memory, and the at least one instruction, the at least one program , the code set or instruction set is loaded and executed by the processor to implement the description text generation method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that at least one instruction, at least one section of program, code set or instruction set is stored in said storage medium, said at least one instruction, said at least one section of program, said code The set or instruction set is loaded and executed by the processor to implement the method for generating description text as claimed in any one of claims 1 to 12.