CN115114479A

CN115114479A - Method and device for generating video label, storage medium and electronic device

Info

Publication number: CN115114479A
Application number: CN202210404088.8A
Authority: CN
Inventors: 徐鲁辉; 熊鹏飞; 陈宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-09-27
Anticipated expiration: 2042-04-18
Also published as: CN115114479B

Abstract

The invention discloses a method and device for generating a video label, a storage medium and an electronic device. Wherein, the method includes: acquiring a target video to be identified, wherein the target video carries a video description text for describing the target video; constructing a target candidate font library of the target video by using the video description text; extracting the video features of the target video and In the case of the description features of the video description text, the video features and the description features are iteratively calculated for N times to obtain M candidate character sequences, wherein the candidate character objects in each candidate character sequence include each candidate character in the target candidate character library. character and the confidence of matching the candidate characters; determine the target character from each candidate character sequence based on the confidence, and obtain M target characters; splicing the M target characters into target video tags matching the target video. The invention solves the technical problem of low video label accuracy determined by the prior method.

Description

Method and device for generating video label, storage medium and electronic device

技术领域technical field

本发明涉及计算机领域，具体而言，涉及一种视频标签的生成方法和装置、存储介质及电子设备。The present invention relates to the field of computers, and in particular, to a method and apparatus for generating a video tag, a storage medium and an electronic device.

背景技术Background technique

随着网络视频平台的发展，网络用户越来越习惯于通过浏览视频获取最新的网络资讯。在网络上传播网络视频的过程中，视频平台通常会为视频匹配视频标签，进而基于视频标签实现视频推荐以及视频搜索功能。对网络视频而言，新鲜且匹配度高的视频标签可以使得网络用户在第一时间发现新热事件，进而提高新热网络视频的点击率。With the development of network video platforms, network users are more and more accustomed to obtaining the latest network information by browsing videos. In the process of disseminating online videos on the Internet, video platforms usually match video tags for videos, and then implement video recommendation and video search functions based on video tags. For online videos, fresh and highly matched video tags can enable online users to discover new hot events at the first time, thereby increasing the click-through rate of new hot online videos.

现有为网络视频匹配标签的方法，通常在匹配过程中，对提取的视频特征进行多次分类和聚类操作，进而在定义好的标签体系中确定出与该视频匹配的标签。The existing method for matching tags of online video, usually in the matching process, performs multiple classification and clustering operations on the extracted video features, and then determines the tag matching the video in the defined tag system.

由此可见，现有的标签确定方法仅能在预先设置的标签体系中选出与视频匹配的标签，而对于内容为新鲜事件的视频，由于预设的标签库中未包括与新鲜事件对应的新鲜标签词条，因此无法为新鲜视频匹配新鲜标签。进而导致平台在基于标签提供视频推荐和视频搜索功能的过程中，无法实现精准推荐，也无法提供准确的搜索结果。也就是说，现有的视频标签的确定方法存在确定出的标签准确率较低的技术问题。It can be seen that the existing label determination method can only select the label matching the video in the preset label system, and for the video whose content is a fresh event, because the preset label library does not include the label corresponding to the fresh event. Fresh hashtags entries, so it is not possible to match fresh hashtags for fresh videos. As a result, in the process of providing video recommendation and video search functions based on tags, the platform cannot achieve accurate recommendation or provide accurate search results. That is to say, the existing method for determining video tags has the technical problem of low accuracy of the determined tags.

针对上述的问题，目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种视频标签的生成方法和装置、存储介质及电子设备，以至少解决现有方法确定的视频标签准确率低的技术问题。The embodiments of the present invention provide a method and device for generating a video tag, a storage medium and an electronic device, so as to at least solve the technical problem of the low accuracy rate of the video tag determined by the existing method.

根据本发明实施例的一个方面，提供了一种视频标签的生成方法，包括：获取待识别的目标视频，其中，上述目标视频中携带有用于描述上述目标视频的视频描述文本；利用上述视频描述文本构建上述目标视频的目标候选字库，其中，上述目标候选字库包括参考字符集和描述字符集，上述参考字符集中包括对多个历史标签文本进行分词得到的参考字符，上述描述字符集中包括对上述视频描述文本进行分词得到的描述字符；在提取出上述目标视频的视频特征以及上述视频描述文本的描述特征的情况下，对上述视频特征以及上述描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个上述候选字符序列中的候选字符对象包括上述目标候选字库中每个候选字符以及与上述候选字符匹配的置信度，上述置信度用于指示上述候选字符与上述目标视频之间的匹配度，N、M为大于等于 1的自然数；基于上述置信度从每个上述候选字符序列中确定出目标字符，得到M个上述目标字符；将M个上述目标字符拼接为与上述目标视频匹配的目标视频标签。According to an aspect of the embodiments of the present invention, a method for generating a video tag is provided, including: acquiring a target video to be identified, wherein the target video carries video description text for describing the target video; using the video description The text constructs the target candidate character library of the above-mentioned target video, wherein, the above-mentioned target candidate character library includes a reference character set and a description character set, the above-mentioned reference character set includes a plurality of historical tag texts. Descriptive characters obtained by segmenting the video description text; in the case of extracting the video features of the above-mentioned target video and the description features of the above-mentioned video description text, the above-mentioned video features and the above-mentioned description features are iteratively calculated for N times to obtain M candidates Character sequence, wherein, the candidate character object in each of the above-mentioned candidate character sequence includes each candidate character in the above-mentioned target candidate character library and the confidence level of matching with the above-mentioned candidate character, and the above-mentioned confidence level is used to indicate the above-mentioned candidate character and the above-mentioned target video. The matching degree between the two, N and M are natural numbers greater than or equal to 1; the target character is determined from each of the above-mentioned candidate character sequences based on the above-mentioned confidence degree, and M above-mentioned target characters are obtained; M above-mentioned target characters are spliced into the above-mentioned target characters The target video tag for video matches.

根据本发明实施例的另一方面，还提供了一种视频标签的生成装置，包括：获取单元，用于获取待识别的目标视频，其中，上述目标视频中携带有用于描述上述目标视频的视频描述文本；构建单元，用于利用上述视频描述文本构建上述目标视频的目标候选字库，其中，上述目标候选字库包括参考字符集和描述字符集，上述参考字符集中包括对多个历史标签文本进行分词得到的参考字符，上述描述字符集中包括对上述视频描述文本进行分词得到的描述字符；计算单元，用于在提取出上述目标视频的视频特征以及上述视频描述文本的描述特征的情况下，对上述视频特征以及上述描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个上述候选字符序列中的候选字符对象包括上述目标候选字库中每个候选字符以及与上述候选字符匹配的置信度，上述置信度用于指示上述候选字符与上述目标视频之间的匹配度，N、M为大于等于1的自然数；确定单元，用于基于上述置信度从每个上述候选字符序列中确定出目标字符，得到M个上述目标字符；拼接单元，用于将M个上述目标字符拼接为与上述目标视频匹配的目标视频标签。According to another aspect of the embodiments of the present invention, there is also provided an apparatus for generating a video tag, including: an acquisition unit configured to acquire a target video to be identified, wherein the target video carries a video for describing the target video A description text; a construction unit, used to construct a target candidate character library of the above-mentioned target video by using the above-mentioned video description text, wherein, the above-mentioned target candidate character library includes a reference character set and a description character set, and the above-mentioned reference character set includes a plurality of historical label texts for word segmentation The obtained reference character, the above-mentioned description character set includes the description character obtained by word segmentation on the above-mentioned video description text; a computing unit is used to extract the video feature of the above-mentioned target video and the description feature of the above-mentioned video description text. The video features and the above-mentioned description features are calculated N times of iterations to obtain M candidate character sequences, wherein the candidate character objects in each of the above-mentioned candidate character sequences include each candidate character in the above-mentioned target candidate character library and the above-mentioned candidate character matching. confidence, the confidence is used to indicate the degree of matching between the candidate characters and the target video, and N and M are natural numbers greater than or equal to 1; a determination unit is used to determine from each of the candidate character sequences based on the confidence extracting target characters to obtain M above-mentioned target characters; a splicing unit for splicing the M above-mentioned target characters into target video tags matching the above-mentioned target videos.

根据本发明实施例的又一方面，还提供了一种计算机可读的存储介质，该计算机可读的存储介质中存储有计算机程序，其中，该计算机程序被设置为运行时执行上述视频标签的生成方法。According to yet another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the above-mentioned video tagging process when running. generate method.

根据本申请实施例的又一个方面，提供一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行如以上视频标签的生成方法。According to yet another aspect of the embodiments of the present application, there is provided a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video tag generation method as above.

根据本发明实施例的又一方面，还提供了一种电子设备，包括存储器和处理器，上述存储器中存储有计算机程序，上述处理器被设置为通过所述计算机程序执行上述的视频标签的生成方法。According to yet another aspect of the embodiments of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to perform the above-mentioned generation of the video tag through the computer program method.

在本发明实施例中，采用获取待识别的目标视频；利用视频描述文本构建目标视频的目标候选字库，其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符；在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列；基于置信度从每个候选字符序列中确定出目标字符，得到M个目标字符；进而将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说，在本申请实施例中，通过对目标视频的视频特征和描述文本特征进行迭代计算，进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目标字符，再通过输出的目标字符组成目标视频标签。由于本申请实施例中用于输出字符的目标候选字库中结合了描述字符集，而描述字符集中包括了与视频相关的新鲜文本内容，因此可以结合描述字符集输出新鲜字符，进而根据输出的新鲜字符拼接得到新鲜标签，提升了输出的视频标签的准确性，解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问题。In the embodiment of the present invention, the target video to be identified is acquired; a target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of historical tags. The reference characters obtained by word segmentation of the text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features N Iterative calculation to obtain M candidate character sequences; target characters are determined from each candidate character sequence based on confidence, and M target characters are obtained; then M target characters are spliced into target video tags matching the target video. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh text content related to the video, the fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described herein are used to provide a further understanding of the present invention, and constitute a part of the present application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached image:

图1是根据本发明实施例的一种可选的视频标签的生成方法的硬件环境的示意图；1 is a schematic diagram of a hardware environment of a method for generating an optional video tag according to an embodiment of the present invention;

图2是根据本发明实施例的一种可选的视频标签的生成方法的流程图；2 is a flowchart of an optional method for generating a video tag according to an embodiment of the present invention;

图3是根据本发明实施例的一种可选的视频标签的生成方法的示意图；3 is a schematic diagram of an optional method for generating a video tag according to an embodiment of the present invention;

图4是根据本发明实施例的另一种可选的视频标签的生成方法的示意图；4 is a schematic diagram of another optional method for generating a video tag according to an embodiment of the present invention;

图5是根据本发明实施例的一种可选的视频标签的生成方法的示意图；5 is a schematic diagram of an optional method for generating a video tag according to an embodiment of the present invention;

图6是根据本发明实施例的另一种可选的视频标签的生成方法的示意图；6 is a schematic diagram of another optional method for generating a video tag according to an embodiment of the present invention;

图7是根据本发明实施例的另一种可选的视频标签的生成方法的流程图；Fig. 7 is the flow chart of the generation method of another kind of optional video label according to an embodiment of the present invention;

图8是根据本发明实施例的一种可选的视频标签的生成装置的结构示意图；8 is a schematic structural diagram of a device for generating an optional video tag according to an embodiment of the present invention;

图9是根据本发明实施例的一种可选的电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an optional electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Embodiments are part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语 “第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, eg, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

根据本发明实施例的一个方面，提供了一种视频标签的生成方法，作为一种可选的实施方式，上述视频标签的生成方法可以但不限于应用于如图1所示的由服务器102和终端设备104所构成的视频标签的生成系统中。如图1所示，服务器102通过网络110与终端设备104进行连接，上述网络可以包括但不限于：有线网络，无线网络，其中，该有线网络包括：局域网、城域网和广域网，该无线网络包括：蓝牙、WIFI及其他实现无线通信的网络。上述终端设备可以包括但不限于以下至少之一：手机(如 Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices，移动互联网设备)、PAD、台式电脑、智能电视、车载设备等。上述终端设备上可以安装有客户端，例如视频分享客户端等。上述终端设备上还设置有显示器、处理器和存储器，显示器可以用于显示视频上传应用程序的程序界面，以及显示上传服务器的视频内容，处理器可以用于对待上传的视频文件进行传输前的预处理，例如，将获取到的视频文件进行压缩处理；存储器用于待上传的视频进行存储。可以理解的是，在上述终端设备104中获取到待上传的目标视频后，终端设备104可以通过网络110向服务器102发送上述目标视频，服务器102接收到目标视频的情况下，根据终端设备104上传的视频生成与目标视频匹配的视频标签提示标签，终端设备104可以通过网络110接收服务器102返回的视频标签。服务器102可以是单一服务器，也可以是由多个服务器组成的服务器集群，或者是云服务器。上述服务器包括数据库和处理引擎。其中，上述数据库中可包括用于为视频匹配标签的历史标签词库以及预先训练好的标签生成模型；上述处理引擎用于根据获取的目标视频生成目标视频标签。According to an aspect of the embodiments of the present invention, a method for generating a video tag is provided. As an optional implementation manner, the above-mentioned method for generating a video tag may be applied, but not limited to, the method shown in FIG. 1 by the server 102 and the In the video tag generation system composed of the terminal device 104 . As shown in FIG. 1 , the server 102 is connected to the terminal device 104 through the network 110. The above-mentioned network may include, but is not limited to, a wired network and a wireless network, wherein the wired network includes: a local area network, a metropolitan area network, and a wide area network, and the wireless network Including: Bluetooth, WIFI and other networks that realize wireless communication. The above-mentioned terminal devices may include but are not limited to at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), notebook computers, tablet computers, PDAs, MIDs (Mobile Internet Devices, mobile Internet Devices), PADs, desktop computers, smart TV, car equipment, etc. A client, such as a video sharing client, may be installed on the above-mentioned terminal device. The above-mentioned terminal equipment is also provided with a display, a processor and a memory. The display can be used to display the program interface of the video uploading application and display the video content of the uploading server. Processing, for example, compressing the obtained video file; the memory is used for storing the video to be uploaded. It can be understood that, after acquiring the target video to be uploaded in the above-mentioned terminal device 104, the terminal device 104 can send the above-mentioned target video to the server 102 through the network 110, and when the server 102 receives the target video, upload the target video according to the terminal device 104. The video generates a video tag prompt tag matching the target video, and the terminal device 104 can receive the video tag returned by the server 102 through the network 110 . The server 102 may be a single server, a server cluster composed of multiple servers, or a cloud server. The above server includes a database and a processing engine. Wherein, the above-mentioned database may include a historical label thesaurus for matching labels for videos and a pre-trained label generation model; the above-mentioned processing engine is used to generate target video labels according to the acquired target videos.

根据本发明实施例的一个方面，上述视频标签的生成系统还可以执行以下步骤：终端设备104执行步骤S102，获取待识别的目标视频；接着执行步骤S104，终端设备104通过网络110向服务器102发送目标视频；服务器102执行步骤S106至S114，获取待识别的目标视频利用视频描述文本构建目标视频的目标候选字库，其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符；在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字符以及与候选字符匹配的置信度，置信度用于指示候选字符与目标视频之间的匹配度，N、M为大于等于1的自然数；基于置信度从每个候选字符序列中确定出目标字符，得到M个目标字符；将M个目标字符拼接为与目标视频匹配的目标视频标签；接着执行步骤S116，服务器102通过网络 110向终端设备104发送目标视频标签。可以理解的是，在终端设备104 为具有足够计算处理能力的设备的情况下，上述步骤S106至S114也可以在终端设备104中进行。According to an aspect of the embodiments of the present invention, the above-mentioned video tag generation system may further perform the following steps: the terminal device 104 performs step S102 to obtain the target video to be identified; The target video; the server 102 executes steps S106 to S114 to obtain the target video to be identified and use the video description text to construct a target candidate character library of the target video, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of The reference characters obtained by word segmentation of the historical label text. The description character set includes the description characters obtained by word segmentation of the video description text; when the video features of the target video and the description features of the video description text are extracted, the video features and description features Carry out N times of iterative calculation to obtain M candidate character sequences, wherein the candidate character objects in each candidate character sequence include each candidate character in the target candidate character library and the confidence level of the candidate character matching, and the confidence level is used to indicate the candidate character. The matching degree between the character and the target video, N and M are natural numbers greater than or equal to 1; the target character is determined from each candidate character sequence based on the confidence, and M target characters are obtained; M target characters are spliced into the target The target video tag of the video matching; then step S116 is executed, the server 102 sends the target video tag to the terminal device 104 through the network 110 . It can be understood that, in the case where the terminal device 104 is a device with sufficient computing processing capability, the above steps S106 to S114 can also be performed in the terminal device 104 .

上述仅是一种示例，本实施例中对此不作任何限定。The above is only an example, which is not limited in this embodiment.

作为一种可选的实施方式，如图2所示，上述视频标签的生成方法包括以下步骤：As a kind of optional embodiment, as shown in Figure 2, the generation method of above-mentioned video label comprises the following steps:

S202，获取待识别的目标视频，其中，目标视频中携带有用于描述目标视频的视频描述文本；S202, obtain the target video to be identified, wherein, the target video carries a video description text for describing the target video;

需要说明的是，上述视频描述本信息可以包括但不限于是目标视频的视频标题信息，目标视频的字幕信息，目标视频的标签信息，目标视频中的画面中出现的文本信息等一种或多种用于描述目标视频内容的文本信息。在此不对具体的文本信息进行限制。It should be noted that the above-mentioned video description information may include, but is not limited to, the video title information of the target video, the subtitle information of the target video, the label information of the target video, and the text information that appears in the picture in the target video. One or more. A textual information used to describe the target video content. The specific text information is not limited here.

S204，利用视频描述文本构建目标视频的目标候选字库；S204, using the video description text to construct a target candidate character library of the target video;

其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符。The target candidate character library includes a reference character set and a description character set, the reference character set includes reference characters obtained by word segmentation of multiple historical tag texts, and the description character set includes description characters obtained by word segmentation of video description text.

以下以一个具体的例子对上述方法进行说明。假设在历史生成标签的操作中，已确定可以用以下几个标签对视频进行分类：“游戏”、“爱情”、 “电视剧”，进而基于历史生成的标签可以确定出参考字符集为“游、戏、爱、情、电、视、剧”；The above method will be described below with a specific example. Assuming that in the operation of historically generating tags, it has been determined that the following tags can be used to classify videos: "game", "love", "tv series", and then based on the historically generated tags, it can be determined that the reference character set is "tourism, drama, love, affection, television, television, drama”;

假设获取到的待识别的目标视频的描述文本为目标视频中的一句字幕“城市英雄游戏真好玩，大家快来玩吧！”，进而基于上述描述文本可以确定出描述字符集为“城、市、英、雄、游、戏、真、好、玩、大、家、快、来、玩、吧”。Assuming that the acquired description text of the target video to be identified is a subtitle in the target video "The city hero game is so fun, let's play it!", and then based on the above description text, it can be determined that the description character set is "city, city , hero, hero, game, play, really, good, play, big, home, come, come, play, let's go".

将上述参考字符集和上述描述字符集进行组合，即可得到目标候选字库“游、戏、爱、情、电、视、剧、城、市、英、雄、游、戏、真、好、玩、大、家、快、来、玩、吧”。Combining the above reference character set and the above description character set, the target candidate character library "game, game, love, affection, television, TV, drama, city, city, hero, hero, game, game, true, good, Play, big, home, come, come, play, let's go".

上述字库中包括了描述字符集中的字符“游、戏”以及参考字符集中的字符“游、戏”。在一种可选的方式中，还可以对上述字库中重复的字符进行去重操作，以得到更新后的目标候选字库“游、戏、爱、情、电、视、剧、城、市、英、雄、真、好、玩、大、家、快、来、玩、吧”。The above-mentioned character library includes the characters "Game, Game" in the description character set and the characters "Game, Game" in the reference character set. In an optional manner, the repeated characters in the above-mentioned character library can also be de-duplicated, so as to obtain the updated target candidate character library "Game, Game, Love, Love, TV, TV, Drama, City, City, Hero, hero, true, good, play, big, home, fast, come, play, let’s go”.

可以理解的是，上述构建目标候选字库的方式仅为一种示例，在具体实施方式中，上述确定目标候选字库的方式还可以是其他的类似方式，在此不对具体构建上述目标候选字库的方式进行限定。It can be understood that the above-mentioned method of constructing the target candidate font library is only an example, and in the specific implementation manner, the above-mentioned method of determining the target candidate font library can also be other similar methods. be limited.

S206，在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字符以及与候选字符匹配的置信度；S206, in the case of extracting the video features of the target video and the description features of the video description text, perform N times of iterative calculation on the video features and the description features to obtain M candidate character sequences, wherein, in each candidate character sequence, The candidate character object includes each candidate character in the target candidate character library and the confidence level of matching the candidate character;

可以理解的是，上述置信度用于指示候选字符与目标视频之间的匹配度，N、M为大于等于1的自然数。It can be understood that the above confidence level is used to indicate the degree of matching between the candidate character and the target video, and N and M are natural numbers greater than or equal to 1.

可以理解的是，上述视频特征可以是用于表征视频画面风格的视觉特征，还可以是用于表征视频画面内容的视频内容特征，此处不对上述视频特征的具体表征的内容进行限制；上述描述特征为视频描述文本对应的文本特征，可以是标题文本特征，也可以是字幕文本特征，还可以是视频画面中包括的画面文本特征，此处也不对上述视频描述文本特征的具体内容进行限制。It can be understood that the above-mentioned video features may be visual features used to characterize the style of video pictures, and may also be video content features used to characterize video picture content, and the content of the specific representations of the above-mentioned video features is not limited here; the above description The feature is the text feature corresponding to the video description text, which can be the title text feature, the subtitle text feature, or the picture text feature included in the video picture. The specific content of the above video description text feature is not limited here.

可以理解的是，在上述迭代计算过程中，可以将上述视频特征以及描述文本特征转化为对应的向量，进而方便进行迭代计算。同时，上述迭代计算指示在当前迭代计算的输入数值与上一次迭代计算的输出数值相关，也就是说，在本申请的上述实施例中，迭代计算输出的后续结果与迭代计算的先前结果存在关联性。It can be understood that, in the above-mentioned iterative calculation process, the above-mentioned video features and description text features can be converted into corresponding vectors, so as to facilitate the iterative calculation. At the same time, the above iterative calculation indicates that the input value of the current iterative calculation is related to the output value of the previous iterative calculation, that is, in the above-mentioned embodiment of the present application, the subsequent result output by the iterative calculation is related to the previous result of the iterative calculation sex.

S208，基于置信度从每个候选字符序列中确定出目标字符，得到M 个目标字符；S208, determine target characters from each candidate character sequence based on the confidence, and obtain M target characters;

以下对上述候选字符序列进行说明。可以理解的是，上述候选字符序列为目标候选字库中每个字符匹配的置信度组成。继续以上述候选字库 “游、戏、爱、情、电、视、剧、城、市、英、雄、真、好、玩、大、家、快、来、玩、吧”为例进行说明。The above-mentioned candidate character sequences will be described below. It can be understood that the above candidate character sequence is composed of the confidence level of each character matching in the target candidate character library. Continue to take the above-mentioned candidate fonts as an example to explain .

假设对上述视频的视频特征和描述特征进行第一次迭代计算的情况下，得到的第一候选字符序列为：“游(98％)、戏(90％)、爱(10％)、情(10％)、电(5％)、视(8％)、剧(8％)、城(58％)、市(60％)、英 (88％)、雄(70％)、真(60％)、好(70％)、玩(60％)、大(28％)、家 (5％)、快(18％)、来(8％)、玩(68％)、吧(1％)”；Assuming that the first iterative calculation is performed on the video features and description features of the above video, the obtained first candidate character sequence is: "you (98%), play (90%), love (10%), love ( 10%), electricity (5%), TV (8%), drama (8%), city (58%), city (60%), English (88%), male (70%), true (60%) ), good (70%), play (60%), big (28%), home (5%), fast (18%), come (8%), play (68%), let’s (1%)” ;

假设输出字符的规则为将置信度最高的字符进行输出，进而可以根据上述候选序列确定出当前序列输出的第一字符为“游”；Assume that the rule for outputting characters is to output the character with the highest confidence, and then it can be determined that the first character output by the current sequence is "you" according to the above-mentioned candidate sequence;

假设对上述视频的视频特征和描述特征，再结合上一次迭代的结果进行第二次迭代计算，假设得到的第二候选字符序列为：“游(70％)、戏(98％)、爱(10％)、情(10％)、电(5％)、视(8％)、剧(8％)、城(58％)、市 (60％)、英(88％)、雄(70％)、真(60％)、好(70％)、玩(60％)、大(28％)、家(5％)、快(18％)、来(8％)、玩(68％)、吧(1％)”；Assume that the video features and description features of the above videos are combined with the results of the previous iteration to perform the second iterative calculation. It is assumed that the obtained second candidate character sequence is: "you (70%), play (98%), love ( 10%), love (10%), electricity (5%), TV (8%), drama (8%), city (58%), city (60%), British (88%), male (70%) ), true (60%), good (70%), play (60%), big (28%), home (5%), fast (18%), come (8%), play (68%), bar(1%)";

假设输出字符的规则为将置信度最高的字符进行输出，进而可以根据上述候选序列确定出当前序列输出的第二字符为“戏”；Assume that the rule for outputting characters is to output the character with the highest confidence, and then it can be determined that the second character output by the current sequence is "play" according to the above-mentioned candidate sequence;

根据上述方法进行多次迭代，假设最终输出的M个字符分别为“游”、 “戏”、“城”、“市”、“英”、“雄”、“好”、“游”、“戏”。Perform multiple iterations according to the above method, assuming that the final output M characters are "you", "play", "city", "shi", "ying", "xiong", "good", "you", " play".

S210，将M个目标字符拼接为与目标视频匹配的目标视频标签；S210, splicing M target characters into target video tags matching the target video;

继续以上述实施例对该步骤进行说明，在得到的M个目标字符分别为“游”、“戏”、“城”、“市”、“英”、“雄”、“好”、“游”、“戏”的情况下，可以将上述字符按照输出的先后顺序进行拼接，以得到三个标签“游戏”、 “城市英雄”、“好游戏”。Continuing to describe the step with the above embodiment, the obtained M target characters are "you", "play", "city", "city", "ying", "xiong", "good", "game" In the case of "play" and "play", the above characters can be spliced in the order of output to obtain three labels "game", "city hero" and "good game".

在一种可选的方式中，在输出上述字符的过程中，还可以通过输出标识符以提示目标视频标签的拼接成方式。比如，在得到的M个目标字符分别为“游”、“戏”、“CLS”、“城”、“市”、“英”、“雄”、“CLS”、“好”、 “游”、“戏”的情况下，可以根据“游”、“戏”以及“城”、“市”、“英”、 “雄”之间的间隔标识符“CLS”将字符“游”、“戏”拼接成的“游戏” 作为第一目标标签，将“城”、“市”、“英”、“雄”拼接成的“城市英雄” 作为第二目标标签，并将间隔标识符“CLS”之后的“好”、“游”、“戏” 拼接成的“好游戏”作为第三目标标签。In an optional manner, in the process of outputting the above-mentioned characters, the output identifier can also be used to prompt the splicing method of the target video tag. For example, the obtained M target characters are "you", "play", "CLS", "city", "city", "ying", "xiong", "CLS", "good", "you" , in the case of "play", the characters "you", "play" can be changed according to the interval identifier "CLS" between "you", "play" and "city", "city", "ying" and "xiong" "The "game" spliced into the first target tag, the "city hero" spliced into "city", "city", "ying", "xiong" as the second target tag, and the interval identifier "CLS" The "good game" formed by splicing "good", "you" and "play" after that is used as the third target tag.

在另一种可选的方式中，在得到的M个目标字符为无序的字符“游”、 “游”、“戏”、“城”、“英”、“市”、“雄”、“好”、“戏”的情况下，可以根据语义分析，确定将“游”、“戏”拼接为第一标签“游戏”，将“游”、“好”、 “戏”拼接为第二标签“好游戏”以及将“城”、“英”、“市”、“雄”拼接为第三标签“城市英雄”。由此可见，在一款新游戏《城市英雄》的推广视频上线的情况下，通过本申请的上述实施方式，通过上述实施方式，可以在原历史标签中并不包含“城市英雄”的情况下，基于描述字符库中提供的新鲜字符，经过迭代计算操作以得到“城市英雄”标签，进而实现了自动生成新鲜标签的技术效果，提高了标签生成的效率和准确度。同时，由于本申请的上述实施方式中，是基于逐字输出的方式产生标签，可以在描述文本以及历史标签库中原本并未包括“好游戏”的标签的情况下，产生“好游戏”的标签。进而使得用户在搜索“好游戏”的情况下，也能检索到上述视频，进而提升上述视频的点击率和观看热度。In another optional manner, the obtained M target characters are disordered characters "you", "you", "play", "city", "ying", "city", "male", In the case of "good" and "game", according to semantic analysis, it can be determined that "game" and "play" are spliced into the first tag "game", and "game", "good" and "game" are spliced into the second tag The label "good game" and the splicing of "city", "ying", "city", and "hero" into the third label "city hero". It can be seen from this that in the case where the promotion video of a new game "City Hero" is online, through the above-mentioned implementation manner of the present application, in the case where "City Hero" is not included in the original history tag, Based on the fresh characters provided in the description character library, the "city hero" label is obtained through iterative calculation operations, thereby realizing the technical effect of automatically generating fresh labels and improving the efficiency and accuracy of label generation. At the same time, because in the above-mentioned embodiment of the present application, the tags are generated based on word-by-word output, the description text and the historical tag library do not originally include the tag of "good game". Label. This enables users to retrieve the above-mentioned videos even when searching for "good games", thereby increasing the click-through rate and viewing popularity of the above-mentioned videos.

以下结合图3对另一个具体的实施方式对上述方法进行说明。The above method will be described below with reference to FIG. 3 for another specific embodiment.

如图3所示，示出一个视频的标题图像，该标题图像中包括了文本内容“今天你馋了吗”，同时该视频的标题为“追剧必备的小零食来一波，快@他做给你吃吧”，且该视频中包括了字幕内容“这个教程非常简单”。As shown in Figure 3, a title image of a video is shown, and the title image includes the text content "Are you greedy today?" At the same time, the title of the video is "A wave of snacks necessary for chasing dramas, come @ He will make it for you", and the video includes subtitles "This tutorial is very easy".

如图3所示的视频图像，可以将上述相关的文本作为该视频的视频描述文本信息，即包括了视频图像中的文本内容“今天你馋了吗”，以及视频标题的文本内容“追剧必备的小零食来一波，快@他做给你吃吧”以及字幕内容“这个教程非常简单”。For the video image shown in Figure 3, the above-mentioned related text can be used as the video description text information of the video, that is, the text content in the video image "Are you greedy today" and the text content of the video title "Chase drama" A wave of must-have snacks, please @he make it for you" and the subtitle content "This tutorial is very simple".

接着，提取该视频的视频特征，该视频特征可以是如图3所示的视频封面图像的图像特征；还可以提取视频的视频类别特征“美食”作为另一种视频特征；提取视频描述文本信息的描述特征，即提取上述视频描述文本信息对应的特征信息，作为一种具体的形式，上述特征均可以通过向量或矩阵进行表示。Next, extract the video feature of the video, which can be the image feature of the video cover image as shown in Figure 3; the video category feature "food" of the video can also be extracted as another video feature; Extract the video description text information The description feature is to extract the feature information corresponding to the above-mentioned video description text information. As a specific form, the above-mentioned features can be represented by vectors or matrices.

进一步地，根据上述提取的视觉特征、类别特征以及描述特征，从确定出由参考字符库以及描述字符库确定的候选字符库中每一个字符对应的置信度。假设预设用于产生参考字符的历史标签文本中并未包括“美食教程”这个标签，在传统的标签提取方式中，仅能从预设的历史标签库中确定出标签，进而无法得到“美食教程”这个与视频相关度较高的标签，也就是说，存在提取的标签的准确率较低的技术问题。Further, according to the above-mentioned extracted visual features, category features and description features, the confidence level corresponding to each character in the candidate character library determined by the reference character library and the description character library is determined. Assuming that the historical tag text preset for generating reference characters does not include the tag "food tutorial", in the traditional tag extraction method, the tag can only be determined from the preset historical tag library, and the "food tutorial" cannot be obtained. "Tutorial" is a label with high video relevance, that is to say, there is a technical problem that the accuracy of the extracted label is low.

而在本实施例中，当前候选字符库中包括了由历史标签分词得到的参考字符库，以及当前视频描述文本字符库，即包括了上述视频图像中的文本内容“今天你馋了吗”，以及视频标题的文本内容“追剧必备的小零食来一波，快@他做给你吃吧”以及字幕内容“这个教程非常简单”等文本字符内容，并通过上述方法得出历史标签库以及上述文本内容组成的候选标签库的每一个标签文本对应的置信度。可见，由于文本描述内容中包括了关键文本“教程”，同时该视频的分类特征也体现了关键特征“美食”，进而可以输出“美食教程”标签。In this embodiment, the current candidate character library includes the reference character library obtained by the historical tag word segmentation, and the current video description text character library, that is, the text content in the above-mentioned video image "Are you greedy today?" As well as the text content of the video title "A wave of snacks necessary for chasing dramas, hurry @ he will make it for you" and the text content of the subtitle content "This tutorial is very simple" and other text characters, and obtain the historical tag database through the above method and the confidence level corresponding to each tag text of the candidate tag library composed of the above text content. It can be seen that since the text description content includes the key text "tutorial", and the classification feature of the video also reflects the key feature "food", the label of "food tutorial" can be output.

假设对上述视频的相关特征进行迭代计算，输出的多个字符为“油、炸、食、品、脆、皮、花、生、美、食、教、程、零、食”的情况下，即可确定目标标签为最终输出的标签如图3所示，为“油炸食品”、“脆皮花生”、“美食教程”以及“零食”。Assuming that the relevant features of the above video are iteratively calculated, and the output characters are "oil, fried, food, product, crisp, skin, flower, raw, beauty, food, teaching, course, zero, food", It can be determined that the target label is the final output label as shown in Figure 3, which is "fried food", "crispy peanut", "food tutorial" and "snack".

可见，通过本申请的上述方法，可以基于视频的描述文本的内容，将本不包括在历史标签库中的标签进行输出，进而实现提高输出的视频标签的准确度的技术效果。It can be seen that, through the above method of the present application, the tags that are not included in the historical tag library can be output based on the content of the description text of the video, thereby achieving the technical effect of improving the accuracy of the output video tags.

作为一种可选的实施方式，上述对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列包括：As a kind of optional embodiment, the above-mentioned iterative calculation is carried out to the video feature and the description feature for N times, to obtain M candidate character sequences including:

S1，在对视频特征以及描述特征执行第i次迭代计算的过程中，在向量转换网络中基于视频特征中的视觉特征、描述特征以及第i-1次迭代计算得到的迭代参考向量，计算得到第i次中间隐向量集，其中，第i次中间隐向量集中包括与视觉特征匹配的第i次视觉隐向量、与描述特征相匹配的第i次描述隐向量，及第i次迭代计算得到的迭代参考向量，i为大于等于1，且小于等于N的自然数；S1, in the process of performing the i-th iterative calculation on the video features and the description features, in the vector transformation network, based on the visual features, the description features in the video features and the iterative reference vector obtained by the i-1th iteration calculation, the calculation is obtained. The ith intermediate latent vector set, wherein the ith intermediate latent vector set includes the ith visual latent vector matched with the visual feature, the ith descriptive latent vector matched with the description feature, and the ith iterative calculation is obtained The iterative reference vector of , i is a natural number greater than or equal to 1 and less than or equal to N;

S2，基于第i次迭代计算得到的迭代参考向量，确定目标候选字库中的各个字符各自匹配的置信度，并根据置信度确定出第j个候选字符序列，其中，j为大于等于0，且小于等于i的自然数；S2, based on the iterative reference vector obtained by the ith iterative calculation, determine the respective matching confidences of each character in the target candidate font library, and determine the jth candidate character sequence according to the confidence, where j is greater than or equal to 0, and A natural number less than or equal to i;

S3，在第i次迭代计算得到的迭代参考向量并未达到结束条件的情况下，执行第i+1次迭代计算；S3, under the condition that the iteration reference vector obtained by the i-th iterative calculation does not reach the end condition, perform the i+1-th iteration calculation;

S4，在第i次迭代计算得到的迭代参考向量达到结束条件的情况下，将前i次迭代计算得到的迭代参考向量确定为N个迭代参考向量。S4 , in the case that the iterative reference vector obtained by the ith iteration calculation reaches the end condition, determine the iteration reference vector obtained by the previous i iteration calculation as N iteration reference vectors.

需要说明的是，上述向量转换网络可以是一种Transformer模型。 Transformer的典型结构包括编码器和解码器，通常将两种结构统称为 Transformer模块，两种结构的区别在于解码器在计算过程中只能看见当前位置以及之前位置的信息。在本实施例中可以采用的模型包括BERT模型(Bidirectional Encoder Representation fromTransformers，双向编码器标识模型)、GPT模型(Generative Pre-Training，生成预训练模型)、UniLM模型(Unified Language Model，统一预训练语言模型)以及VPUniLM模型。作为一种优选的方式，本实施例中可以采用VPUniLM模型作为上述向量转换网络。It should be noted that the above vector transformation network may be a Transformer model. The typical structure of Transformer includes an encoder and a decoder. The two structures are usually collectively referred to as Transformer modules. The difference between the two structures is that the decoder can only see the information of the current position and the previous position during the calculation process. Models that can be used in this embodiment include BERT model (Bidirectional Encoder Representation from Transformers, bidirectional encoder identification model), GPT model (Generative Pre-Training, generate pre-training model), UniLM model (Unified Language Model, unified pre-training language) model) and the VPUniLM model. As a preferred way, in this embodiment, the VPUniLM model can be used as the above-mentioned vector transformation network.

可以理解的是，上述目标视频的视觉特征可以是目标视频的每一帧画面提取得到的图像特征，还可以是对目标视频进行风格提取得到的图像特征。此处不对用于表征目标视频的画面内容的具体视觉特征的形式进行限制。It can be understood that the visual features of the target video may be image features obtained by extracting each frame of the target video, or may also be image features obtained by performing style extraction on the target video. The form of specific visual features used to characterize the picture content of the target video is not limited here.

需要说明的是，在上述模型中，获取到视频特征中的视觉特征以及描述特征的同时，还需要获取上一次迭代计算得到的迭代参考向量，以将上述视觉特征、描述特征以及上一次迭代计算得到的迭代参考向量作为当前迭代计算的输入量，进而计算得到与视觉特征匹配的视觉隐向量、与描述特征相匹配的描述隐向量，及当前迭代参考向量。可以理解的是，在本实施例中，由于每次迭代计算的过程均需要结合上一次的迭代计算得到的迭代参考隐向量作为输入量，因此每次的迭代计算的结果均与历史迭代计算得到的迭代结果相关，或者说，每次迭代计算结果均指示了历史迭代计算结果的特征，进而可以通过迭代输出的先后顺序确定输出的迭代计算结果之间的关联关系。It should be noted that, in the above model, while obtaining the visual features and descriptive features in the video features, it is also necessary to obtain the iterative reference vector obtained by the previous iterative calculation, so as to combine the above-mentioned visual features, description features and the previous iterative calculation. The obtained iterative reference vector is used as the input of the current iterative calculation, and then the visual latent vector matched with the visual feature, the description latent vector matched with the description feature, and the current iterative reference vector are calculated. It can be understood that, in this embodiment, since the process of each iterative calculation needs to be combined with the iterative reference latent vector obtained by the previous iterative calculation as the input, the result of each iterative calculation is obtained from the historical iterative calculation. In other words, each iterative calculation result indicates the characteristics of the historical iterative calculation results, and then the correlation between the output iterative calculation results can be determined by the sequence of the iterative outputs.

以下对上述计数字母i，j进行说明。可以理解的是，对于第i次迭代计算，可以确定得到第j个用于生成目标字符的候选字符序列，j可以小于或等于i。也就是说，通过迭代计算得到的迭代参考向量的数量可以大于或等于候选字符序列的数量。在具体的实现方式中，第i次得到的迭代参考向量可以是用于指示间隔符的向量。比如，第一次迭代计算得到的迭代参考向量用于确定第一候选字符序列；第二次迭代计算得到的迭代参考向量用于确定第二候选字符序列；第三次迭代计算得到的迭代参考向量用于确定第一间隔符；第四次迭代计算得到的迭代参考向量用于确定第三候选字符序列；第五次迭代计算得到的迭代参考向量用于确定第四候选字符序列；第六次迭代计算得到的迭代参考向量指示结束间隔符。进而，对应于上述六次迭代参考计算，可以确定四个候选字符序列，进一步地，可以根据四个候选字符序列确定得到四个目标字符，并基于上述四个目标字符组合得到两个目标标签。The above-mentioned count letters i, j will be described below. It can be understood that, for the i-th iterative calculation, the j-th candidate character sequence for generating the target character can be determined, and j can be less than or equal to i. That is, the number of iterative reference vectors obtained by iterative calculation can be greater than or equal to the number of candidate character sequences. In a specific implementation manner, the iterative reference vector obtained for the i-th time may be a vector for indicating a spacer. For example, the iterative reference vector calculated by the first iteration is used to determine the first candidate character sequence; the iteration reference vector calculated by the second iteration is used to determine the second candidate character sequence; the iterative reference vector calculated by the third iteration It is used to determine the first spacer; the iterative reference vector calculated in the fourth iteration is used to determine the third candidate character sequence; the iterative reference vector calculated in the fifth iteration is used to determine the fourth candidate character sequence; the sixth iteration The computed iteration reference vector indicates the end delimiter. Furthermore, corresponding to the above-mentioned six iterative reference calculations, four candidate character sequences can be determined, and further, four target characters can be determined and obtained according to the four candidate character sequences, and two target labels can be obtained based on the combination of the above-mentioned four target characters.

在另一种实施方式中，上述计数字母i和j可以相等，也就是说，通过迭代计算得到的迭代参考向量的数量可以等于候选字符序列的数量，比如对应于五个迭代参考向量确定出五个候选字符序列，进而确定出五个目标字符，进一步根据上述五个目标字符拼接得到一个或多个目标标签。In another implementation manner, the above-mentioned count letters i and j may be equal, that is, the number of iterative reference vectors obtained through iterative calculation may be equal to the number of candidate character sequences, for example, five iterative reference vectors are determined corresponding to five A candidate character sequence is obtained, five target characters are determined, and one or more target labels are obtained by splicing the above five target characters.

通过本申请的上述方法，以在对视频特征以及描述特征执行第i次迭代计算的过程中，在向量转换网络中基于视频特征中的视觉特征、描述特征以及第i-1次迭代计算得到的迭代参考向量，计算得到第i次中间隐向量集；基于第i次迭代计算得到的迭代参考向量，确定目标候选字库中的各个字符各自匹配的置信度，并根据置信度确定出第j个候选字符序列，其中，j为大于等于0，且小于等于i的自然数；在第i次迭代计算得到的迭代参考向量并未达到结束条件的情况下，执行第i+1次迭代计算；在第 i次迭代计算得到的迭代参考向量达到结束条件的情况下，将前i次迭代计算得到的迭代参考向量确定为N个迭代参考向量，进而基于迭代参考向量确定出对应的候选字符序列。本申请的上述实时方式中，采用迭代的方式产生中间隐向量，并基于中间隐向量中的迭代参考向量确定候选字符序列，进而确保了候选字符序列之间的先后关联关系。以存在先后关联关系额候选字符序列确定输出的目标字符和拼接的目标标签，实现了输出准确度高的视频标签的技术效果，解决了现有标签生成的方法得到的标签准确率较低的技术问题。Through the above method of the present application, in the process of performing the i-th iterative calculation on the video features and the description features, in the vector transformation network based on the visual features in the video features, the description features and the i-1th iterative calculation. Iterate the reference vector, and calculate the ith intermediate latent vector set; based on the iterative reference vector obtained by the ith iteration calculation, determine the respective matching confidence levels of each character in the target candidate font library, and determine the jth candidate according to the confidence level. character sequence, where j is a natural number greater than or equal to 0 and less than or equal to i; if the iteration reference vector obtained by the i-th iterative calculation does not reach the end condition, execute the i+1-th iteration calculation; When the iterative reference vector calculated by the next iteration reaches the end condition, the iterative reference vector calculated by the previous i iterations is determined as N iterative reference vectors, and then the corresponding candidate character sequence is determined based on the iterative reference vector. In the above-mentioned real-time mode of the present application, an iterative method is used to generate an intermediate latent vector, and a candidate character sequence is determined based on the iterative reference vector in the intermediate latent vector, thereby ensuring the sequential association relationship between the candidate character sequences. Determining the output target character and the spliced target label based on the candidate character sequence that has a sequential relationship question.

作为一种可选的实施方式，上述基于第i次迭代计算得到的迭代参考向量，确定目标候选字库中的各个字符各自匹配的置信度包括：在与向量转换网络相连接的标签生成网络中，基于第i次迭代计算得到的迭代参考向量，确定参考字符集中的各个参考字符各自匹配的第一置信度，及描述字符集中的各个描述字符各自匹配的第二置信度。As an optional embodiment, the above-mentioned iterative reference vector obtained based on the i-th iterative calculation, determining the respective matching confidence levels of each character in the target candidate font library includes: in the label generation network connected to the vector conversion network, Based on the iterative reference vector calculated by the i-th iteration, determine the first confidence level of each reference character in the reference character set and the second confidence level of each description character in the description character set.

可以理解的是，在本实施例中，上述通过迭代参考向量计算置信度的过程中，针对参考字符集中的各个参考字符的置信度计算的过程与针对描述字符集中的各个描述字符的置信度的过程不同。比如可以采用不同的计算方式，还可以采用不同的计算参数，从而对计算参考字符集的第一置信度和计算描述字符集的第二置信度的方法进行区分。It can be understood that, in this embodiment, in the above-mentioned process of calculating the confidence level by iterating the reference vector, the process of calculating the confidence level for each reference character in the reference character set is different from the confidence level for each description character in the description character set. The process is different. For example, different calculation methods and different calculation parameters can be used to distinguish the methods of calculating the first confidence level of the reference character set and the method of calculating the second confidence level of the description character set.

在一种可选的方式中，对计算得到的参考字符集对应的初始第一置信度可以配置更高的权重参数，以得到第一置信度；对计算得到的描述字符集对应的初始第二置信度可以配置较低的权重参数，以得到第二置信度，从而提升参考字符集中各个字符的置信度，进而使得输出的字符为参考字符集中的字符的概率更高，从而实现通过参考字符集对输出的字符类型和规则进行约束的技术效果。In an optional manner, a higher weight parameter may be configured for the initial first confidence level corresponding to the calculated reference character set to obtain the first confidence level; for the initial second confidence level corresponding to the calculated description character set The confidence level can be configured with a lower weight parameter to obtain the second confidence level, thereby improving the confidence level of each character in the reference character set, thereby making the output character a higher probability of being a character in the reference character set, so as to achieve a higher probability of passing the reference character set The technical effect of constraining output character types and rules.

通过本申请的上述实施例，以在与向量转换网络相连接的标签生成网络中，基于第i次迭代计算得到的迭代参考向量，确定参考字符集中的各个参考字符各自匹配的第一置信度，及描述字符集中的各个描述字符各自匹配的第二置信度的方式，对参考字符集和描述字符集对应的置信度计算方法进行区分，进而可以根据需要调整参考字符集和描述字符集中输出字符的概率，实现通过参考字符集对输出的字符类型和规则进行约束的技术效果。Through the above-mentioned embodiments of the present application, in the label generation network connected with the vector conversion network, based on the iterative reference vector obtained by the ith iteration calculation, the first confidence level of each reference character in the reference character set is determined, respectively, and the second confidence level of each description character set in the description character set, distinguish the confidence calculation methods corresponding to the reference character set and the description character set, and then adjust the output characters in the reference character set and description character set as required. Probability, to achieve the technical effect of constraining the output character type and rules by referring to the character set.

作为一种可选的实施方式，上述基于第i次迭代计算得到的迭代参考向量，确定参考字符集中的各个参考字符各自匹配的第一置信度，及描述字符集中的各个描述字符各自匹配的第二置信包括：As an optional implementation manner, the above-mentioned iterative reference vector obtained by the i-th iteration calculation is used to determine the first confidence level of each reference character in the reference character set for each matching, and the first confidence level for each matching of each description character in the description character set. Two confidences include:

S1，确定出标签生成网络中与每个参考字符各自匹配的第一权重系数和第一偏置参数，与各个描述字符各自匹配的第二权重系数和第二偏置参数，及参考权重和参考偏置参数；S1, determine a first weight coefficient and a first bias parameter that match each reference character in the label generation network, a second weight coefficient and a second bias parameter that match each description character, and the reference weight and reference Bias parameter;

S2，基于第i次迭代计算得到的迭代参考向量、第一权重系数及第一偏置参数，确定出每个参考字符的第一置信度；S2, determine the first confidence level of each reference character based on the iterative reference vector, the first weight coefficient and the first bias parameter obtained based on the i-th iterative calculation;

S3，获取基于第i次描述隐向量、第二权重系数及第二偏置参数计算得到的第一中间结果，及基于第i次迭代计算得到的迭代参考向量、参考权重和参考偏置参数计算得到的第二中间结果；S3, obtaining the first intermediate result calculated based on the i-th description implicit vector, the second weight coefficient and the second bias parameter, and the iterative reference vector, reference weight and reference bias parameter calculated based on the i-th iterative calculation the second intermediate result obtained;

S4，基于第一中间结果与第二中间结果，确定出每个描述字符的第二置信度。S4, based on the first intermediate result and the second intermediate result, determine the second confidence level of each description character.

以下对上述实施方式进行具体说明。需要说明的是，在本实施例中，视频描述文本为视频标题。The above-mentioned embodiment will be specifically described below. It should be noted that, in this embodiment, the video description text is the video title.

首先，假设上述描述隐向量为

指示与描述文本特征(在本实施例中为标题)经过上述Transformer模型的一次迭代计算，输出的与第n 个字符对应的第n个隐向量；假设上述参考迭代隐向量为

需要说明的是，在本实施例中，在迭代计算过程中，需要多次输入和多次输入，

表示在第t次迭代计算输出的最后一个位置的隐向量。First, assume that the above described hidden vector is

The indication and description text feature (title in this embodiment) is calculated by one iteration of the above-mentioned Transformer model, and the outputted nth latent vector corresponding to the nth character; it is assumed that the above-mentioned reference iterative latent vector is

It should be noted that, in this embodiment, in the iterative calculation process, multiple inputs and multiple inputs are required,

The latent vector representing the last position of the computed output at the t-th iteration.

接着，根据模型的训练结果确定出标签生成网络中与各个参考字符各自匹配的第一权重系数，和第一偏置参数，与各个描述字符各自匹配的第二权重系数和第二偏置参数，及参考权重和参考偏置参数，即确定出以下 6个参数：Next, according to the training result of the model, determine the first weight coefficient and the first bias parameter in the label generation network that match each reference character, and the second weight coefficient and the second bias parameter that match each description character respectively, And the reference weight and reference bias parameters, that is, the following six parameters are determined:

第一权重系数：

表示参考字符集中第i个字符的权重参数；The first weight coefficient:

Indicates the weight parameter of the i-th character in the reference character set;

第一偏置参数：

First bias parameter:

第二权重系数：W^title、W^dec；Second weight coefficient: W ^title , W ^dec ;

第二偏置参数：b^title和b^dec是偏置参数。The second bias parameter: b ^title and b ^dec are bias parameters.

基于上述训练得到的各个参数，假设

表示模型在标题位置的隐向量输出(即描述隐向量)，在第t步解码阶段，

表示该步输出的最后一个位置的隐向量输出(即参考迭代隐向量)，

表示在第t步参考字符集中第i个字符的得分，那么

由以下公式计算出：Based on the parameters obtained from the above training, it is assumed that

Represents the latent vector output of the model at the title position (that is, describing the latent vector), in the t-th decoding stage,

The latent vector output representing the last position of the output of this step (that is, the reference iterative latent vector),

represents the score of the i-th character in the reference character set at step t, then

Calculated by the following formula:

式中，

表示参考字符集中第i个字符的权重参数，

是参考字符集中第i个字符的偏置参数。In the formula,

represents the weight parameter of the i-th character in the reference character set,

is the bias parameter for the ith character in the reference character set.

为了从标题文本中选择字符，添加了指针网络模块来预测从标题中选择词的得分，假设标题中第n个字符对应的隐向量为

表示在第t步第n个字符的得分，

由以下公式计算出：In order to select characters from the title text, a pointer network module is added to predict the score for selecting words from the title, assuming that the latent vector corresponding to the nth character in the title is

represents the score of the nth character at step t,

Calculated by the following formula:

式中，W^title和W^dec是权重参数，b^title和b^dec是偏置参数。where W ^title and W ^dec are weight parameters, and b ^title and b ^dec are bias parameters.

需要说明的是，上述第一中间结果即上式中的

上述第二中间结果即上式中的

每个参考字符的第一匹配度即

每个描述字符的第二匹配度即

It should be noted that the above-mentioned first intermediate result is that in the above formula

The second intermediate result above is that in the above formula

The first matching degree of each reference character is

The second matching degree of each description character is

通过本申请的上述实施方式，通过确定出标签生成网络中与每个参考字符各自匹配的第一权重系数和第一偏置参数，与各个描述字符各自匹配的第二权重系数和第二偏置参数，及参考权重和参考偏置参数；基于第i 次迭代计算得到的迭代参考向量、第一权重系数及第一偏置参数，确定出每个参考字符的第一置信度；获取基于第i次描述隐向量、第二权重系数及第二偏置参数计算得到的第一中间结果，及基于第i次迭代计算得到的迭代参考向量、参考权重和参考偏置参数计算得到的第二中间结果；基于第一中间结果与第二中间结果，确定出每个描述字符的第二置信度，确定出每个文本分词的第二匹配度，从而精确确定出参考字符集中每一个候选字符的得分以及描述字符集中每一个描述字符的得分，进而可以精确地当前迭代计算步骤中，每一个候选字符的得分，提高了标签生成的精确性。Through the above-mentioned embodiments of the present application, by determining the first weight coefficient and the first offset parameter matching each reference character in the label generation network, the second weight coefficient and the second offset matching each description character respectively parameters, and reference weight and reference bias parameters; based on the iterative reference vector, the first weight coefficient and the first bias parameter calculated by the i-th iteration, determine the first confidence level of each reference character; Describe the first intermediate result calculated by the implicit vector, the second weight coefficient and the second bias parameter, and the second intermediate result calculated based on the iterative reference vector, the reference weight and the reference bias parameter calculated based on the ith iteration ; Based on the first intermediate result and the second intermediate result, determine the second degree of confidence of each description character, determine the second degree of matching of each text word segmentation, thereby accurately determine the score of each candidate character in the reference character set and The score of each description character in the description character set can be used to accurately calculate the score of each candidate character in the current iterative calculation step, which improves the accuracy of label generation.

作为一种可选的实施方式，上述基于第i次迭代计算得到的迭代参考向量，确定目标候选字库中的各个字符各自匹配的置信度，并根据置信度确定出第j个候选字符序列包括：通过标签生成网络中的全连接层对获取到的多个第一置信度和多个第二置信度进行加权求和计算，得到目标候选字库中的各个字符各自匹配的置信度，以生成第j个候选字符序列。As an optional embodiment, the above-mentioned iterative reference vector obtained based on the i-th iterative calculation, determines the respective matching confidence levels of each character in the target candidate font library, and determines the j-th candidate character sequence according to the confidence level. The sequence includes: The fully connected layer in the label generation network performs a weighted sum calculation on the obtained multiple first confidence levels and multiple second confidence levels, and obtains the respective matching confidence levels of each character in the target candidate font to generate the jth candidate character sequences.

可以理解的是，获取到第t次的迭代计算得到每个参考字符的第一匹配度即

每个描述字符对应的的第二匹配度即

在一种实施方式中，可以为上述第一匹配度和第二匹配度配置不同的权重，仅基于加权求和的结果确定出每个标签对应的置信度。It can be understood that the first matching degree of each reference character obtained by the t-th iteration calculation is

The second matching degree corresponding to each description character is

In an embodiment, different weights may be configured for the first matching degree and the second matching degree, and the confidence corresponding to each tag is determined only based on the result of the weighted summation.

在另一种实施方式中，还可以是在多个标签提取步骤中，获取到第t 次迭代计算的每个参考字符的第一匹配度即

每个描述字符的第二匹配度即

为上述第一匹配度和第二匹配度配置相同的权重，进而将得到的

以及对应的字符进行组合，直接得到候选字符序列。In another implementation manner, in multiple label extraction steps, the first matching degree of each reference character calculated by the t-th iteration can be obtained, namely

The second matching degree of each description character is

The same weight is configured for the first matching degree and the second matching degree, and then the obtained

And the corresponding characters are combined to directly obtain the candidate character sequence.

通过本申请的上述实施方式，以通过标签生成网络中的全连接层对获取到的多个第一置信度和多个第二置信度进行加权求和计算，得到目标候选字库中的各个字符各自匹配的置信度，以生成第j个候选字符序列，从而通过多步迭代计算步骤确定出多个第一匹配度和第二匹配度，并基于多个第一匹配度和第二匹配度确定出多个候选序列，以及与候选序列对应的目标字符，进而提升了生成标签的准确性。Through the above-mentioned embodiments of the present application, the fully connected layer in the label generation network performs weighted sum calculation on the obtained multiple first confidence levels and multiple second confidence levels, and obtains each character in the target candidate font database. The confidence level of the matching is to generate the jth candidate character sequence, so as to determine multiple first matching degrees and second matching degrees through multiple iterative calculation steps, and determine based on the multiple first matching degrees and second matching degrees. Multiple candidate sequences and target characters corresponding to the candidate sequences improve the accuracy of generating labels.

作为一种可选的实施方式，上述在获取待识别的目标视频之前，还包括：As a kind of optional implementation, before obtaining the target video to be identified, the above also includes:

S1，获取样本视频和与样本视频匹配的样本标签，其中，样本视频中携带有样本视频描述文本；S1, obtain a sample video and a sample label matching the sample video, wherein the sample video is carried with the sample video description text;

S2，利用样本视频描述文本构建样本视频的样本候选字库，其中，样本候选字库包括参考字符集和样本描述字符集，参考字符集中包括对多个历史标签进行分词得到的参考字符，样本描述字符集中包括对样本视频描述文本进行分词得到的描述字符；S2, using the sample video description text to construct a sample candidate character library of the sample video, wherein the sample candidate character library includes a reference character set and a sample description character set, the reference character set includes reference characters obtained by segmenting multiple historical tags, and the sample description character set Including the description characters obtained by word segmentation of the sample video description text;

S3，利用样本视频、样本标签及样本候选字库对初始化的标签生成网络进行训练，直至达到训练收敛条件。S3, use the sample video, sample label and sample candidate font to train the initialized label generation network until the training convergence condition is reached.

以下对训练阶段的实施方式进行说明。在训练阶段，已知视频对应的标签，如一个视频含有“游戏”、“城市英雄”两个标签，将这两个标签用符号“；”连接起来作为真实值，并将该视频的视觉特征、类别嵌入向量、标题的词嵌入向量和真实值的词嵌入向量一起输入，预测阶段需将视觉特征、类别嵌入向量、标题的词嵌入向量和当前已预测出的词嵌入向量作为输入，该模型使用编码器操作计算视觉特征和原标题词嵌入向量位置的隐向量，使用解码器操作计算真实值位置的隐向量。隐向量经过一层全连接操作计算出词典的概率分布，这里损失函数类型为交叉熵，n表示预测文本的长度，V表示词典。Embodiments of the training phase are described below. In the training phase, the labels corresponding to the known videos are known. For example, a video contains two labels of "game" and "city hero", and the two labels are connected with the symbol ";" as the real value, and the visual features of the video are used. , the category embedding vector, the word embedding vector of the title, and the word embedding vector of the real value are input together. The prediction stage needs to use the visual feature, the category embedding vector, the word embedding vector of the title and the currently predicted word embedding vector as input. The encoder operation is used to calculate the latent vector of the position of the visual feature and the original title word embedding vector, and the decoder operation is used to calculate the hidden vector of the ground truth position. The hidden vector calculates the probability distribution of the dictionary through a layer of full connection operation, where the loss function type is cross entropy, n represents the length of the predicted text, and V represents the dictionary.

在上式所示的损失函数达到收敛条件的情况下，确定以下参数训练完成：第一权重系数：

表示参考字符集中i个字符的权重参数；When the loss function shown in the above formula reaches the convergence condition, it is determined that the training of the following parameters is completed: the first weight coefficient:

Indicates the weight parameter of i characters in the reference character set;

第一偏置参数：

First bias parameter:

第二偏置参数：b^title、b^dec。The second bias parameter: b ^title , b ^dec .

以下对另一种UniLM(Unified Language Model，统一预训练语言模型)模型的训练过程进行说明。The following describes the training process of another UniLM (Unified Language Model, unified pre-trained language model) model.

通过本申请的上述方法，以获取样本视频和与样本视频匹配的样本标签，其中，样本视频中携带有样本视频描述文本；利用样本视频描述文本构建样本视频的样本候选字库，其中，样本候选字库包括参考字符集和样本描述字符集，参考字符集中包括对多个历史标签进行分词得到的参考字符，样本描述字符集中包括对样本视频描述文本进行分词得到的描述字符；利用样本视频、样本标签及样本候选字库对初始化的标签生成网络进行训练，直至达到训练收敛条件的方法，从而使用一种可监督的方式得到上述标签生成网络，提高上述标签生成网络得到的目标标签的可靠性，解决了现有的标签生成方式得到的标签准确率低的技术问题。The above method of the present application is used to obtain a sample video and a sample label matching the sample video, wherein the sample video carries the sample video description text; the sample video description text is used to construct a sample candidate font library of the sample video, wherein the sample candidate font library Including reference character set and sample description character set, the reference character set includes reference characters obtained by word segmentation of multiple historical tags, and the sample description character set includes description characters obtained by word segmentation of sample video description text; Using sample videos, sample labels and The sample candidate character library trains the initialized label generation network until the training convergence condition is reached, so as to obtain the above label generation network in a supervised way, improve the reliability of the target label obtained by the above label generation network, and solve the problem of current situation. The technical problem of low label accuracy in some label generation methods.

作为一种可选的方式，上述在对视频特征以及描述特征执行第i次迭代计算的过程中，在向量转换网络中基于视频特征中的视觉特征、描述特征以及第i-1次迭代计算得到的迭代参考向量，计算得到第i次中间隐向量集包括：As an optional method, in the process of performing the i-th iterative calculation on the video features and the description features, the vector transformation network is based on the visual features in the video features, the description features, and the i-1th iterative calculation to obtain The iterative reference vector of , the i-th intermediate hidden vector set is calculated to include:

在i等于1的情况下，将视觉特征和描述特征转换为目标视频对应的词嵌入向量；提取目标视频对应的类别嵌入向量、段落嵌入向量及位置嵌入向量，其中，类别嵌入向量用于指示目标视频的类别信息，段落嵌入向量中包括基于视觉特征得到的第一段落子向量及基于描述特征得到的第二段落子向量，位置嵌入向量用于指示目标视频中各个视频帧的帧序，及视频描述文本中各个文本分词的次序；将词嵌入向量、类别嵌入向量、段落嵌入向量及位置嵌入向量作为向量转换网络的输入向量，以计算得到第 i次中间隐向量集；When i is equal to 1, convert the visual features and description features into the word embedding vector corresponding to the target video; extract the category embedding vector, paragraph embedding vector and position embedding vector corresponding to the target video, where the category embedding vector is used to indicate the target video The category information of the video. The paragraph embedding vector includes the first paragraph sub-vector obtained based on the visual feature and the second paragraph sub-vector obtained based on the description feature. The position embedding vector is used to indicate the frame sequence of each video frame in the target video, and the video description. The order of word segmentation of each text in the text; the word embedding vector, category embedding vector, paragraph embedding vector and position embedding vector are used as the input vector of the vector conversion network to calculate the i-th intermediate latent vector set;

在i大于1的情况下，将词嵌入向量、类别嵌入向量、段落嵌入向量、位置嵌入向量，及第i-1次迭代计算得到的迭代参考向量作为向量转换网络的输入向量，以计算得到第i次中间隐向量集。When i is greater than 1, use the word embedding vector, category embedding vector, paragraph embedding vector, position embedding vector, and the iterative reference vector calculated by the i-1th iteration as the input vector of the vector conversion network to calculate the first The set of intermediate latent vectors i times.

作为一种可选的实施方式，上述在向量转换网络中基于视频特征中的视觉特征、描述特征以及第i-1次迭代计算得到的迭代参考向量，计算得到第i次中间隐向量集包括：As a kind of optional embodiment, above-mentioned in the vector conversion network based on the visual feature in the video feature, the description feature and the iterative reference vector that the ith-1th iterative calculation obtains, the calculation obtains the ith intermediate latent vector set including:

S1，基于视觉特征及描述特征确定第i次视觉隐向量、第i次描述隐向量，其中，第i次视觉隐向量和第i次描述隐向量用于指示目标视频标签中字符的上下文关系；S1, determine the ith visual latent vector, the ith description latent vector based on the visual feature and the description feature, wherein, the ith visual latent vector and the ith description latent vector are used to indicate the contextual relationship of the character in the target video label;

S2，基于视觉特征、描述特征及第i-1次迭代计算得到的迭代参考向量，确定第i次迭代计算得到的迭代参考向量。S2: Determine the iterative reference vector obtained by the ith iterative calculation based on the visual feature, the description feature and the iterative reference vector obtained by the i-1th iterative calculation.

上述目标视频的类别嵌入向量是用于指示目标视频的大致类别的特征，如目标视频为一个有关运动会的视频，该视频的类别特征可以用于表征该视频是“体育类”，对应于类别嵌入向量可以为(0，0，0，0，1)；又如，目标视频为一个歌曲mv的视频，则该视频的类别特征可以用于表征该视频为“歌曲类”，对应于类别嵌入向量可以为(0，0，0，0，2)；又如，目标视频为一个纪录片《中国美食》，则该视频的类别特征可以用于表征该视频为“美食类”以及“纪录片类”，对应于类别嵌入向量可以为(0，0，0，0，3)。以上仅为一种示例性说明，不对本实施例中具体表征目标视频的类别嵌入向量的具体内容对应形式进行限制。The category embedding vector of the above-mentioned target video is a feature used to indicate the general category of the target video. For example, if the target video is a video about a sports meeting, the category feature of the video can be used to indicate that the video is "sports category", which corresponds to the category embedding. The vector can be (0, 0, 0, 0, 1); for another example, if the target video is a video of a song mv, the category feature of the video can be used to characterize the video as a "song category", corresponding to the category embedding vector It can be (0, 0, 0, 0, 2); for another example, if the target video is a documentary "Chinese Food", the category feature of the video can be used to characterize the video as "food" and "documentary", The embedding vector corresponding to the category can be (0, 0, 0, 0, 3). The above is only an exemplary description, and does not limit the specific content corresponding form of the category embedding vector that specifically represents the target video in this embodiment.

以下结合图4对上述方法进行说明。The above method will be described below with reference to FIG. 4 .

如图4所示，在进行迭代计算之前，输入上述向量转换网络中的向量组合包括：位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量。As shown in Figure 4, before the iterative calculation, the vector combinations input into the above-mentioned vector transformation network include: position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector.

针对视频的类别特征，在本实施例中，可以将每一种视频类别对应于一个向量组，以用向量组表示视频特征。在图4所示的模型示意图中，转换后的类别嵌入向量为(5，5，5，5，5，5，5)。For the category features of videos, in this embodiment, each video category can be corresponding to a vector group, so that the video features are represented by the vector group. In the schematic diagram of the model shown in Figure 4, the converted category embedding vector is (5, 5, 5, 5, 5, 5, 5).

接着，将视觉特征以及描述特征转换为词嵌入向量。首先对目标视频进行抽帧，抽取的帧率为1fps，将抽取的视频帧输入CLIP模型中得到视觉特征向量，对应于图4中的视频向量(F，F，F)。并将视频描述文本转换为向量，对应于图4中的标题向量(我，爱，你)，并加入连接向量CLS，即组成词嵌入向量(F，F，F，CLS，我，爱，你)。Next, the visual features and descriptive features are converted into word embedding vectors. First, extract the frame of the target video, the extracted frame rate is 1fps, and input the extracted video frame into the CLIP model to obtain the visual feature vector, which corresponds to the video vector (F, F, F) in Figure 4. And convert the video description text into a vector, corresponding to the title vector (I, Love, You) in Figure 4, and join the connection vector CLS, that is, the word embedding vector (F, F, F, CLS, I, Love, You) ).

同时如图4示出的，还获取了视频的段落嵌入向量以及位置嵌入向量。其中，段落嵌入向量为(0，0，0，1，1，1，1)，可以理解的是，段落嵌入向量用于表征词嵌入向量中的的视频向量部分与标题向量部分，其中， (0，0，0)对应于视频向量(F，F，F)，(1，1，1)对应于标题向量(我，爱，你)。位置嵌入向量(0，1，2，0，1，2，3)用于表征输入模型中的向量的先后顺序。如(0，1，2)部分用于标识视频向量(F，F，F)的视频帧的帧序；(1，2，3)部分用于标识标题向量(我，爱，你)三个文本分词的次序。At the same time, as shown in FIG. 4 , the paragraph embedding vector and the position embedding vector of the video are also obtained. Among them, the paragraph embedding vector is (0, 0, 0, 1, 1, 1, 1). It can be understood that the paragraph embedding vector is used to represent the video vector part and the title vector part in the word embedding vector, where, ( 0, 0, 0) corresponds to the video vector (F, F, F) and (1, 1, 1) corresponds to the title vector (I, Love, You). The position embedding vector (0, 1, 2, 0, 1, 2, 3) is used to characterize the order of vectors in the input model. For example, the (0, 1, 2) part is used to identify the frame sequence of the video frame of the video vector (F, F, F); the (1, 2, 3) part is used to identify the title vector (I, Love, You) three The order of text segmentation.

如图4所示，将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量第一次输入Transformer模块之后，即得到中间隐向量集

(图中未示出

)。As shown in Figure 4, after the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector are input into the Transformer module for the first time, the intermediate hidden vector set is obtained.

(not shown in the figure

).

可以理解的是，将上述中间隐向量集输入指针网络模块中即可得到由参考字符集以及描述字符集中的各个字符的得分。It can be understood that, by inputting the above-mentioned intermediate latent vector set into the pointer network module, the score of each character in the reference character set and the description character set can be obtained.

通过本申请的上述方法，在i等于1的情况下，将视觉特征和描述特征转换为目标视频对应的词嵌入向量；提取目标视频对应的类别嵌入向量、段落嵌入向量及位置嵌入向量，其中，类别嵌入向量用于指示目标视频的类别信息，段落嵌入向量中包括基于视觉特征得到的第一段落子向量及基于描述特征得到的第二段落子向量，位置嵌入向量用于指示目标视频中各个视频帧的帧序，及视频描述文本中各个文本分词的次序；将词嵌入向量、类别嵌入向量、段落嵌入向量及位置嵌入向量作为向量转换网络的输入向量，以计算得到第i次中间隐向量集，从而通过指针网络模块以及 Transformer模块得到各个候选字符的置信度值，进而解决了现有标签生成方法得到的标签准确率低的技术问题。Through the above method of the present application, when i is equal to 1, the visual feature and the description feature are converted into the word embedding vector corresponding to the target video; the category embedding vector, paragraph embedding vector and position embedding vector corresponding to the target video are extracted, wherein, The category embedding vector is used to indicate the category information of the target video. The paragraph embedding vector includes the first paragraph sub-vector obtained based on the visual feature and the second paragraph sub-vector obtained based on the description feature. The position embedding vector is used to indicate each video frame in the target video. The frame order of , and the order of each text segmentation in the video description text; the word embedding vector, category embedding vector, paragraph embedding vector and position embedding vector are used as the input vector of the vector conversion network to calculate the ith intermediate hidden vector set, Therefore, the confidence value of each candidate character is obtained through the pointer network module and the Transformer module, thereby solving the technical problem of low label accuracy obtained by the existing label generation method.

作为一种可选的实施方式，基于置信度从每个候选字符序列中确定出目标字符包括：根据目标视频标签中字符的上下文关系，将第j个候选字符序列中确定出的目标字符，作为目标视频标签中的第j个目标字符。As an optional implementation manner, determining the target character from each candidate character sequence based on the confidence includes: according to the contextual relationship of the characters in the target video tag, using the target character determined in the jth candidate character sequence as The jth target character in the target video tag.

可以理解的是，在模型预测阶段，将通过上述方法得到

和

拼接起来，并选择得分最高的字符，即是第当前步骤得到的第t个候选字符序列，并输出第t个目标字符。It can be understood that in the model prediction stage, the above method will be used to obtain

and

Splice them together, and select the character with the highest score, which is the t-th candidate character sequence obtained in the current step, and output the t-th target character.

继续结合图4对上述方法进行说明。Continue to describe the above method with reference to FIG. 4 .

如图4所示，第一步迭代计算的过程如下：As shown in Figure 4, the iterative calculation process of the first step is as follows:

将图4中示出的上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量第一次输入Transformer模块之后，即得到中间隐向量集输出

以及

(图中未示出)。After the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector shown in Figure 4 are input into the Transformer module for the first time, the output of the intermediate hidden vector set is obtained.

as well as

(not shown in the figure).

基于

以及训练得到的

W^title、W^dec、b^title和 b^dec，通过以下式子分别算出第1步迭代计算过程中，参考字符集中第i 个字符的得分以及描述字符集中第n个字符的得分：based on

and trained

W ^title , W ^dec , b ^title and b ^dec , respectively calculate the score of the i-th character in the reference character set and the score of the n-th character in the description character set in the iterative calculation process in the first step by the following formulas:

其中，

表示在第1步参考字符集中第i个字符的得分，

表示在第1步中描述字符集中第n个字符的得分。将

和

拼接起来，并选择得分最高的单词，即是第1步的输出词。如图4所示，第一步的输出词为“爱”；in,

represents the score of the i-th character in the reference character set in step 1,

represents the score describing the nth character in the character set in step 1. Will

and

Splice them together and select the word with the highest score, which is the output word of step 1. As shown in Figure 4, the output word of the first step is "love";

第二步解析步骤如下：The second step of analysis is as follows:

将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量以及第一步中得到的

输入Transformer模块，得到中间隐向量集

以及

Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.

Enter the Transformer module to get the intermediate hidden vector set

as well as

基于

以及训练得到的

W^title、W^dec、b^title和 b^dec，通过以下式子分别算出第2步迭代计算过程中，参考字符集中第i 个字符的得分以及描述字符集中第n个字符的得分：based on

and trained

W ^title , W ^dec , b ^title and b ^dec , respectively calculate the score of the i-th character in the reference character set and the score of the n-th character in the description character set in the iterative calculation process in the second step by the following formulas:

其中，

表示在第2步参考字符集中第i个字符的得分，

表示在第2步中描述字符集中第n个字符的得分。将

和

拼接起来，并选择得分最高的单词，即是第2步的输出词。如图4所示，第一步的输出词为“情”；in,

represents the score of the i-th character in the reference character set in step 2,

represents the score describing the nth character in the character set in step 2. Will

and

Splice them together and select the word with the highest score, which is the output word of step 2. As shown in Figure 4, the output word of the first step is "love";

结束步骤如下：The end steps are as follows:

输入Transformer模块，得到中间隐向量集

以及

其中，

与预设的结束标识向量相匹配，即确定迭代步骤结束，不再继续生成新的字符。Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.

Enter the Transformer module to get the intermediate hidden vector set

as well as

in,

Matching with the preset end identification vector, that is, it is determined that the iteration step is over, and no new characters are continued to be generated.

进而将第一步输出词“爱”和第二步的输出词“情”进行拼接，得到标题为“我爱你”的目标视频的目标标签为“爱情”。Then, the output word "love" of the first step and the output word "love" of the second step are spliced together, and the target label of the target video titled "I love you" is obtained as "love".

通过本申请的上述实施方式，通过迭代计算，逐字输出的方式，由于本申请实施例中用于输出字符的目标候选字库中结合了描述字符集，而描述字符集中包括了与视频相关的新鲜文本内容，因此可以结合描述字符集输出新鲜字符，进而根据输出的新鲜字符拼接得到新鲜标签，提升了输出的视频标签的准确性，解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问题。Through the above-mentioned embodiments of the present application, through iterative calculation and word-by-word output, since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh video-related characters Therefore, fresh characters can be output in combination with the description character set, and then fresh labels can be obtained by splicing the output fresh characters, which improves the accuracy of the output video labels, and solves the problem of the accuracy of the video labels obtained by the existing video label determination methods. Low technical issues.

作为一种可选的实施方式，上述将M个目标字符拼接为与目标视频匹配的目标视频标签之后，还包括：基于上下文语义关系调整目标视频标签中M个目标字符的顺序，以得到更新后的目标视频标签。As an optional implementation manner, after splicing the M target characters into target video tags matching the target video, the method further includes: adjusting the order of the M target characters in the target video tag based on the contextual semantic relationship, so as to obtain the updated target video tag.

在一种可选的方式中，假设根据上述方法进行多次迭代，假设初步输出的M个字符按顺序分别为“城”、“市”、“雄”、“英”、“游”、“戏”、“游”、 “戏”、“好”，则可以根据对上述各个字符进行语义分析，可以确定“雄”、 “英”两个字符的顺序进行调整，以得到“英雄”词条；同时，还可以将 “游”、“戏”、“好”三个字符的顺序进行调整，以得到“好游戏”词条；进而确定更新后的M个字符的最终顺序为：“城”、“市”、“英”、“雄”、“游”、 “戏”、“好”、“游”、“戏”。并基于该字符顺序确定目标标签：“城市英雄”、 “游戏”、“好游戏”。In an optional way, it is assumed that multiple iterations are performed according to the above method, and it is assumed that the initial output M characters are "city", "city", "xiong", "ying", "you", " "Play", "You", "Play", and "Good", you can determine the order of the two characters "Xiong" and "English" according to the semantic analysis of the above characters, so as to obtain the entry for "Hero" At the same time, the order of the three characters "you", "play" and "good" can also be adjusted to obtain the entry of "good game"; and the final sequence of the updated M characters is determined as: "city" , "city", "ying", "hero", "you", "play", "good", "play", "play". And based on this character order, target labels are determined: "city hero", "game", "good game".

以下对上述输出字符的顺序的方式进行说明。如图5所示，上述 Transformer模块中可以采用Attention矩阵进行字符的输入和输出。如图中所示，矩阵的每一行代表着输出，而每一列代表着输入，而Attention 矩阵就表示输出和输入的关联。假定白色方格都代表0，黑色方格代表1，那么第1行表示输出的字符(x₁)只能与起始标记<s>相关了，而第2行就表示输出的字符(x₂)只能跟起始标记<s>和(x₁)相关了，依此类推。也就是说，只需要在Transformer的Attention矩阵中引入下三角形形式的 Mask，并将输入输出错开一位训练，就可以实现单向语言模型，即按照预定的顺序输出字符。The manner in which the above-mentioned order of outputting characters is described below. As shown in Figure 5, the above Transformer module can use the Attention matrix to input and output characters. As shown in the figure, each row of the matrix represents the output, and each column represents the input, and the Attention matrix represents the association between the output and the input. Assuming that the white squares represent 0 and the black squares represent 1, then the first line indicates that the output character (x ₁ ) can only be related to the start tag <s>, and the second line indicates the output character (x _{2 )} ) can only be associated with start tags <s> and (x ₁ ), and so on. That is to say, a one-way language model can be implemented only by introducing a lower triangular Mask into the Attention matrix of the Transformer, and staggering the input and output by one bit for training, that is, outputting characters in a predetermined order.

可选地，对一种乱序语言模型进行说明。Optionally, an out-of-order language model is described.

乱序语言模型跟语言模型一样，都是做条件概率分解，但是乱序语言模型的分解顺序是随机的，如下式所示：Like the language model, the out-of-order language model performs conditional probability decomposition, but the decomposition order of the out-of-order language model is random, as shown in the following formula:

＝…=…

根据上式可知，对于字符x₁、x₂…x_n而言，任意一种“出场顺序”都有可能。原则上来说，每一种顺序都对应着一个模型，所以原则上就有n！个语言模型。而基于Transformer的模型，则可以将这所有顺序都在同一个模型中进行实现。According to the above formula, for the characters x ₁ , x ₂ . . . x _n , any "appearance order" is possible. In principle, each order corresponds to a model, so in principle there are n! a language model. With Transformer-based models, all of these sequences can be implemented in the same model.

以下结合图6对实现方式进行说明。以生成“北京欢迎你”为例，假设要求生成的字符顺序为：“<s>→迎→京→你→欢北→北→<e>”，那么通过图6所示的MSAK上述实施例中的Attention矩阵，即可实现按照上述顺序的字符输出。如图6中所示，第4行只有一个黑色格，表示“迎”只能跟起始标记<s>相关，而第2行有两个黑格，表示“京”只能跟起始标记<s>和“迎”相关，依此类推。直观来看，这就像是把单向语言模型的下三角形式的Mask“打乱”了。The implementation is described below with reference to FIG. 6 . Taking the generation of "Beijing welcomes you" as an example, assuming that the sequence of characters required to be generated is: "<s>→ying→jing→you→huanbei→bei→<e>", then through the above embodiment of MSAK shown in Figure 6 In the Attention matrix, the character output in the above order can be realized. As shown in Figure 6, the fourth row has only one black box, which means that "Ying" can only be related to the start tag <s>, while the second row has two black boxes, which means that "Jing" can only be related to the start mark <s> is related to "Welcome", and so on. Intuitively, this is like "shuffling" the lower triangular Mask of the unidirectional language model.

也就是说，实现某种特定顺序的语言模型，就相当于将原来的下三角形式的Mask以某种方式打乱。正因为上述Attention提供了这样的一个n×n 的Attention矩阵，进而本实施例中的Transformer模块可以存在足够多的自由度，以不同的方式去Mask上述字符矩阵，从而实现多样化的字符输出顺序的效果。That is to say, implementing a language model in a certain order is equivalent to disrupting the original lower-triangular form of Mask in some way. Because the above Attention provides such an n×n Attention matrix, the Transformer module in this embodiment can have enough degrees of freedom to mask the above character matrix in different ways, thereby realizing a variety of character output orders. Effect.

以下结合图7所示流程来说明本申请提供的视频标签的生成方法的完整过程：The complete process of the generation method of the video label provided by the application is described below in conjunction with the process shown in Figure 7:

S702，构建目标候选字库；S702, constructing a target candidate font library;

以视频描述文本为视频标题为例对本实施方式进行说明。假设视频标题为“我爱你”，参考字符库中包括了字符“亲”、“情”；即结合上述视频标题和上述参考字符库中的字符以得到目标候选字库：“我”、“爱”、“你”、 “亲”、“情”。This embodiment will be described by taking the video description text as the video title as an example. Suppose the video title is "I love you", and the reference character library includes the characters "Kin" and "Qing"; that is, the target candidate character library is obtained by combining the above video title and the characters in the above reference character library: "I", "Love" ", "you", "pro", "love".

S704，提取视频向量；S704, extract the video vector;

具体地，如图4所示，输入上述向量转换网络中的向量组合包括：位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量。Specifically, as shown in Figure 4, the vector combination input into the above-mentioned vector conversion network includes: position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector.

S706，将提取的视频向量输入Transformer模型，得到输出的中间向量；S706, the extracted video vector is input into the Transformer model, and the intermediate vector of the output is obtained;

具体地，将图4中示出的上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量第一次输入Transformer模块之后，即得到中间隐向量输出

以及

(图中未示出)。Specifically, after the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector, and category embedding vector shown in FIG. 4 are input into the Transformer module for the first time, the intermediate hidden vector output is obtained.

as well as

(not shown in the figure).

S708，判断中间向量是否包括结束向量；S708, determine whether the intermediate vector includes the end vector;

即判断

是否为结束向量，在

并非结束向量的情况下，执行后续步骤。judgment

is the end vector, in

If it is not the end vector, perform the next steps.

S710，基于训练好的参数集合对中间向量进行解析计算，得到第t步中各个候选字符的得分；S710, the intermediate vector is analyzed and calculated based on the trained parameter set, and the score of each candidate character in the t step is obtained;

S712，判断当前候选字符是否为最高得分的候选字符；S712, determine whether the current candidate character is the candidate character with the highest score;

S714，将得分最高的候选字符确定为本步骤的输出字符；S714, the candidate character with the highest score is determined as the output character of this step;

具体地，基于

以及训练得到的

W^title、 W^dec、b^title和b^dec，通过以下式子分别算出第1步解析过程中，历史标签词表里第i个单词的得分以及标题词表第n个词的得分：Specifically, based on

and trained

W ^title , W ^dec , b ^title and b ^dec , calculate the score of the i-th word in the historical tag vocabulary and the score of the n-th word in the title vocabulary in the parsing process of the first step by the following formulas:

其中，

表示在第1步历史标签词表里第i个单词的得分，

表示在第1步中视频标题中第n个词的得分。将

和

拼接起来，并选择得分最高的单词，即是第1步的输出词。假设得到的第一候选字符序列为：“我”(70％)、“爱”(90％)、“你”(76％)、“亲”(40％)、“情”(80％)，进而确定上述第一候选字符序列中置信度最高的字符为“爱”，进而如图4 所示，第一步的输出字符为“爱”。in,

Indicates the score of the i-th word in the history tag vocabulary in step 1,

Represents the score of the nth word in the video title in step 1. Will

and

Splice them together and select the word with the highest score, which is the output word of step 1. Suppose the obtained first candidate character sequence is: "I" (70%), "Love" (90%), "You" (76%), "Kin" (40%), "Qing" (80%), Further, it is determined that the character with the highest confidence in the above-mentioned first candidate character sequence is "love", and as shown in Fig. 4, the output character of the first step is "love".

由于

并非为结束向量，接着执行第二步迭代计算步骤，如下：because

is not the end vector, and then performs the second iterative calculation step, as follows:

输入Transformer模块，得到中间隐向量输出

以及

Enter the Transformer module to get the intermediate hidden vector output

as well as

基于判断

同样不为结束向量的判断结果，基于

以及训练得到的

W^title、W^dec、b^title和b^dec，通过以下式子分别算出第2步解析过程中，历史标签词表里第i个单词的得分以及标题词表第n个词的得分：based on judgment

It is also not the judgment result of the end vector, based on

and trained

W ^title , W ^dec , b ^title and b ^dec , calculate the score of the i-th word in the historical tag vocabulary and the score of the n-th word in the title vocabulary in the parsing process of the second step by the following formulas:

其中，

表示在第2步历史标签词表里第i个单词的得分，

表示在第2步中视频标题中第n个词的得分。将

和

拼接起来，并选择得分最高的单词，即是第2步的输出词。假设得到的第二候选字符序列为：“我”(60％)、“爱”(20％)、“你”(60％)、“亲”(40％)、“情”(80％)，进而确定上述第一候选字符序列中置信度最高的字符为“情”，如图4所示，第二步的输出词为“情”；in,

Indicates the score of the i-th word in the historical tag vocabulary in step 2,

represents the score of the nth word in the video title in step 2. Will

and

Splice them together and select the word with the highest score, which is the output word of step 2. Suppose the obtained second candidate character sequence is: "I" (60%), "Love" (20%), "You" (60%), "Kin" (40%), "Qing" (80%), Then it is determined that the character with the highest confidence in the above-mentioned first candidate character sequence is "Love", as shown in Figure 4, the output word of the second step is "Love";

由于

并非为结束向量，接着执行第三步解析步骤，如下：because

is not the end vector, and then performs the third parsing step, as follows:

输入Transformer模块，得到中间隐向量输出

以及

其中，

与预设的结束标识向量相匹配，即确定解析步骤结束，不再继续生成标签。Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.

Enter the Transformer module to get the intermediate hidden vector output

as well as

in,

Matches with the preset end identification vector, that is, it is determined that the parsing step is over, and the label is no longer generated.

S716，输出每一步确定的输出字符，拼接生成视频标签。S716, outputting the output characters determined in each step, and splicing to generate a video tag.

最后将第一步输出词“爱”和第二步的输出词“情”进行拼接，得到标题为“我爱你”的目标视频的标签为“爱情”。Finally, the output word "love" of the first step and the output word "love" of the second step are spliced together, and the tag of the target video titled "I love you" is obtained as "love".

在本发明的上述实施例中，采用获取待识别的目标视频；利用视频描述文本构建目标视频的目标候选字库，其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符；在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列；基于置信度从每个候选字符序列中确定出目标字符，得到M个目标字符；进而将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说，在本申请实施例中，通过对目标视频的视频特征和描述文本特征进行迭代计算，进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目标字符，再通过输出的目标字符组成目标视频标签。由于本申请实施例中用于输出字符的目标候选字库中结合了描述字符集，而描述字符集中包括了与视频相关的新鲜文本内容，因此可以结合描述字符集输出新鲜字符，进而根据输出的新鲜字符拼接得到新鲜标签，提升了输出的视频标签的准确性，解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问题。In the above-mentioned embodiment of the present invention, the target video to be recognized is acquired; the target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of The reference characters obtained by word segmentation of the historical label text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features Carry out N times of iterative calculation to obtain M candidate character sequences; Determine target characters from each candidate character sequence based on the confidence, and obtain M target characters; Then splicing the M target characters into a target video matching the target video Label. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library for outputting characters in this embodiment of the present application incorporates a description character set, and the description character set includes fresh text content related to videos, fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本发明所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

根据本发明实施例的另一个方面，还提供了一种用于实施上述视频标签的生成方法的视频标签的生成装置。如图8所示，该装置包括：According to another aspect of the embodiments of the present invention, there is also provided an apparatus for generating a video tag for implementing the above-mentioned method for generating a video tag. As shown in Figure 8, the device includes:

获取单元802，用于获取待识别的目标视频，其中，目标视频中携带有用于描述目标视频的视频描述文本；Obtaining unit 802 is used to obtain the target video to be identified, wherein, in the target video, the video description text for describing the target video is carried;

构建单元804，用于利用视频描述文本构建目标视频的目标候选字库，其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符；The construction unit 804 is used to construct a target candidate character library of the target video by using the video description text, wherein, the target candidate character library includes a reference character set and a description character set, and the reference character set includes the reference characters obtained by segmenting multiple historical tag texts, and the description The character set includes description characters obtained by segmenting the video description text;

计算单元806，用于在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字符以及与候选字符匹配的置信度，置信度用于指示候选字符与目标视频之间的匹配度，N、M为大于等于1的自然数；The calculation unit 806 is configured to perform N times of iterative calculation on the video features and the description features under the condition that the video features of the target video and the description features of the video description text are extracted to obtain M candidate character sequences, wherein each candidate The candidate character objects in the character sequence include each candidate character in the target candidate character library and the confidence level of matching the candidate character. The confidence level is used to indicate the matching degree between the candidate character and the target video. N and M are natural numbers greater than or equal to 1 ;

确定单元808，用于基于置信度从每个候选字符序列中确定出目标字符，得到M个目标字符；Determining unit 808, for determining the target character from each candidate character sequence based on the confidence, obtains M target characters;

拼接单元810，用于将M个目标字符拼接为与目标视频匹配的目标视频标签。The splicing unit 810 is used for splicing the M target characters into a target video tag matching the target video.

可选地，在本实施例中，上述各个单元模块所要实现的实施例，可以参考上述各个方法实施例，这里不再赘述。Optionally, in this embodiment, for the embodiments to be implemented by the foregoing unit modules, reference may be made to the foregoing respective method embodiments, and details are not described herein again.

根据本发明实施例的又一个方面，还提供了一种用于实施上述视频标签的生成方法的电子设备，该电子设备可以是图9所示的终端设备或服务器。本实施例以该电子设备为终端设备为例来说明。如图9所示，该电子设备包括存储器902和处理器904，该存储器902中存储有计算机程序，该处理器904被设置为通过计算机程序执行上述任一项方法实施例中的步骤。According to another aspect of the embodiments of the present invention, an electronic device for implementing the above-mentioned method for generating a video tag is also provided, and the electronic device may be the terminal device or the server shown in FIG. 9 . This embodiment is described by taking the electronic device as a terminal device as an example. As shown in FIG. 9 , the electronic device includes a memory 902 and a processor 904, where a computer program is stored in the memory 902, and the processor 904 is configured to execute the steps in any of the above method embodiments through the computer program.

可选地，在本实施例中，上述电子设备可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the above-mentioned electronic device may be located in at least one network device among multiple network devices of a computer network.

可选地，在本实施例中，上述处理器可以被设置为通过计算机程序执行以下步骤：Optionally, in this embodiment, the above-mentioned processor may be configured to execute the following steps through a computer program:

S1，获取待识别的目标视频，其中，所述目标视频中携带有用于描述目标视频的视频描述文本；S1, obtain the target video to be identified, wherein, in the target video, carry the video description text for describing the target video;

S2，利用视频描述文本构建目标视频的目标候选字库，其中，目标候选字库包括参考字符集和描述字符集，参考字符集中包括对多个历史标签文本进行分词得到的参考字符，描述字符集中包括对视频描述文本进行分词得到的描述字符；S2, using the video description text to construct a target candidate character library of the target video, wherein the target candidate character library includes a reference character set and a description character set, the reference character set includes reference characters obtained by segmenting multiple historical tag texts, and the description character set includes The description characters obtained by word segmentation of the video description text;

S3，在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下，对视频特征以及描述特征进行N次迭代计算，以得到M个候选字符序列，其中，每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字符以及与候选字符匹配的置信度，置信度用于指示候选字符与目标视频之间的匹配度，N、M为大于等于1的自然数；S3, in the case of extracting the video features of the target video and the description features of the video description text, perform N times of iterative calculation on the video features and the description features to obtain M candidate character sequences, wherein, in each candidate character sequence, The candidate character object includes each candidate character in the target candidate character library and the confidence level of matching with the candidate character, the confidence level is used to indicate the degree of matching between the candidate character and the target video, and N and M are natural numbers greater than or equal to 1;

S4，基于置信度从每个候选字符序列中确定出目标字符，得到M个目标字符；S4, determine the target character from each candidate character sequence based on the confidence, obtain M target characters;

S5，将M个目标字符拼接为与目标视频匹配的目标视频标签。S5, splicing the M target characters into target video tags matching the target video.

可选地，本领域普通技术人员可以理解，图9所示的结构仅为示意，电子设备也可以是车载终端、智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices，MID)、 PAD等终端设备。图9其并不对上述电子设备的结构造成限定。例如，电子设备还可包括比图9中所示更多或者更少的组件(如网络接口等)，或者具有与图9所示不同的配置。Optionally, those of ordinary skill in the art can understand that the structure shown in FIG. 9 is for illustration only, and the electronic device may also be a vehicle terminal, a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, and a mobile Internet Equipment (Mobile Internet Devices, MID), PAD and other terminal equipment. FIG. 9 does not limit the structure of the above electronic device. For example, the electronic device may also include more or fewer components than those shown in FIG. 9 (such as network interfaces, etc.), or have a different configuration than that shown in FIG. 9 .

其中，存储器902可用于存储软件程序以及模块，如本发明实施例中的视频标签的生成方法和装置对应的程序指令/模块，处理器904通过运行存储在存储器902内的软件程序以及模块，从而执行各种功能应用以及数据处理，即实现上述的视频标签的生成方法。存储器902可包括高速随机存储器，还可以包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器902可进一步包括相对于处理器904远程设置的存储器，这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。其中，存储器902具体可以但不限于用于存储观察视角画面中的各个元素、视频标签的生成信息等信息。作为一种示例，如图9所示，上述存储器902中可以但不限于包括上述视频标签的生成装置中的获取单元802、构建单元804、计算单元806、确定单元808以及拼接单元810。此外，还可以包括但不限于上述视频标签的生成装置中的其他模块单元，本示例中不再赘述。The memory 902 can be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for generating a video tag in the embodiment of the present invention, and the processor 904 runs the software programs and modules stored in the memory 902, thereby Execute various functional applications and data processing, that is, implement the above-mentioned method for generating video tags. Memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, memory 902 may further include memory located remotely from processor 904, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof. Wherein, the memory 902 can be specifically, but not limited to, be used to store information such as each element in the viewing angle picture, the generation information of the video tag, and the like. As an example, as shown in FIG. 9 , the above-mentioned memory 902 may include, but is not limited to, the acquiring unit 802, the constructing unit 804, the calculating unit 806, the determining unit 808 and the splicing unit 810 in the above-mentioned video tag generating apparatus. In addition, it may also include, but is not limited to, other module units in the above-mentioned video tag generating apparatus, which will not be repeated in this example.

可选地，上述的传输装置906用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中，传输装置906包括一个网络适配器(Network Interface Controller，NIC)，其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中，传输装置906为射频(Radio Frequency，RF)模块，其用于通过无线方式与互联网进行通讯。Optionally, the above-mentioned transmission device 906 is configured to receive or send data via a network. Specific examples of the above-mentioned networks may include wired networks and wireless networks. In one example, the transmission device 906 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through a network cable to communicate with the Internet or a local area network. In one example, the transmission device 906 is a radio frequency (RF) module for wirelessly communicating with the Internet.

此外，上述电子设备还包括：显示器908，和连接总线910，用于连接上述电子设备中的各个模块部件。In addition, the above-mentioned electronic device further includes: a display 908, and a connection bus 910 for connecting various module components in the above-mentioned electronic device.

在其他实施例中，上述终端设备或者服务器可以是一个分布式系统中的一个节点，其中，该分布式系统可以为区块链系统，该区块链系统可以是由该多个节点通过网络通信的形式连接形成的分布式系统。其中，节点之间可以组成点对点(P2P，Peer To Peer)网络，任意形式的计算设备，比如服务器、终端等电子设备都可以通过加入该点对点网络而成为该区块链系统中的一个节点。In other embodiments, the above-mentioned terminal device or server may be a node in a distributed system, wherein the distributed system may be a blockchain system, and the blockchain system may be communicated by the multiple nodes through a network A distributed system formed by connection in the form of. Among them, a peer-to-peer (P2P, Peer To Peer) network can be formed between nodes, and any form of computing equipment, such as servers, terminals and other electronic devices can become a node in the blockchain system by joining the peer-to-peer network.

根据本申请的一个方面，提供了一种计算机程序产品，该计算机程序产品包括计算机程序/指令，该计算机程序/指令包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分从网络上被下载和安装，和/或从可拆卸介质被安装。在该计算机程序被中央处理器执行时，执行本申请实施例提供的各种功能。According to one aspect of the present application, there is provided a computer program product comprising a computer program/instructions containing program code for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion, and/or installed from a removable medium. When the computer program is executed by the central processing unit, various functions provided by the embodiments of the present application are performed.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

根据本申请的一个方面，提供了一种计算机可读存储介质，计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述视频标签的生成方方法。According to one aspect of the present application, a computer-readable storage medium is provided, and a processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the above-mentioned video tag. generator method.

可选地，在本实施例中，上述计算机可读存储介质可以被设置为存储用于执行以下步骤的计算机程序：Optionally, in this embodiment, the above-mentioned computer-readable storage medium can be configured to store a computer program for performing the following steps:

可选地，在本实施例中，本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：闪存盘、只读存储器(Read-OnlyMemory，ROM)、随机存取器(Random Access Memory，RAM)、磁盘或光盘等。Optionally, in this embodiment, those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing the hardware related to the terminal device through a program, and the program can be stored in a In the computer-readable storage medium, the storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在上述计算机可读取的存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在存储介质中，包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。If the integrated units in the above-mentioned embodiments are implemented in the form of software functional units and sold or used as independent products, they may be stored in the above-mentioned computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, Several instructions are included to cause one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the above-mentioned methods of various embodiments of the present invention.

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的客户端，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如上述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed client may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the above-mentioned units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. To another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software functional units.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. A method for generating a video tag, comprising:

acquiring a target video to be identified, wherein the target video carries a video description text for describing the target video;

constructing a target candidate word library of the target video by using the video description text, wherein the target candidate word library comprises a reference character set and a description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical label texts, and the description character set comprises description characters obtained by segmenting the video description text;

under the condition that the video features of the target video and the description features of the video description text are extracted, performing iteration calculation on the video features and the description features for N times to obtain M candidate character sequences, wherein candidate character objects in each candidate character sequence comprise each candidate character in the target candidate word stock and a confidence coefficient matched with the candidate character, the confidence coefficient is used for indicating the matching degree between the candidate character and the target video, and N, M is a natural number greater than or equal to 1;

determining target characters from each candidate character sequence based on the confidence degrees to obtain M target characters;

and splicing the M target characters into target video labels matched with the target videos.

2. The method of claim 1, wherein the performing N iterative computations on the video feature and the description feature to obtain M candidate character sequences comprises:

in the process of performing ith iterative computation on the video features and the description features, an ith intermediate hidden vector set is computed in a vector conversion network based on the visual features in the video features, the description features and iterative reference vectors obtained by the (i-1) th iterative computation, wherein the ith intermediate hidden vector set comprises an ith visual hidden vector matched with the visual features, an ith description hidden vector matched with the description features and iterative reference vectors obtained by the ith iterative computation, and i is a natural number which is greater than or equal to 1 and less than or equal to N;

determining the confidence coefficient of each character in the target candidate word library which is matched with each other based on the iterative reference vector obtained by the ith iterative calculation, and determining a jth candidate character sequence according to the confidence coefficient, wherein j is a natural number which is greater than or equal to 0 and less than or equal to i;

executing the (i + 1) th iterative computation under the condition that the iterative reference vector obtained by the (i) th iterative computation does not reach the end condition;

and under the condition that the iteration reference vector obtained by the ith iteration calculation reaches an end condition, determining the iteration reference vector obtained by the previous i iteration calculations as N iteration reference vectors.

3. The method of claim 2, wherein the determining the confidence of each match of each character in the target candidate word library based on the iterative reference vector calculated by the ith iteration comprises:

and in a label generation network connected with the vector conversion network, determining a first confidence coefficient of each matching of each reference character in the reference character set and a second confidence coefficient of each matching of each description character in the description character set based on the iterative reference vector obtained by the ith iterative calculation.

4. The method of claim 3, wherein the determining a first confidence that each reference character in the reference character set matches and a second confidence that each description character in the description character set matches based on the iterative reference vector calculated in the ith iteration comprises:

determining a first weight coefficient and a first bias parameter which are respectively matched with each reference character in the label generation network, a second weight coefficient and a second bias parameter which are respectively matched with each description character, and a reference weight and a reference bias parameter;

determining the first confidence of each reference character based on the iterative reference vector, the first weight coefficient and the first bias parameter obtained by the ith iterative calculation;

obtaining a first intermediate result obtained by calculation based on the ith description hidden vector, the second weight coefficient and the second bias parameter, and a second intermediate result obtained by calculation based on the iteration reference vector obtained by the ith iteration calculation, the reference weight and the reference bias parameter;

determining the second confidence of each of the descriptive characters based on the first intermediate result and the second intermediate result.

5. The method according to claim 3, wherein the determining a confidence level of each matching of each character in the target candidate word library based on the iterative reference vector calculated by the ith iteration, and determining the jth candidate character sequence according to the confidence level comprises:

and performing weighted summation calculation on the acquired multiple first confidence degrees and the multiple second confidence degrees through a full connection layer in the label generation network to obtain the confidence degrees of respective matching of each character in the target candidate word library so as to generate the jth candidate character sequence.

6. The method of claim 3, further comprising, prior to obtaining the target video to be identified:

acquiring a sample video and a sample label matched with the sample video, wherein the sample video carries a sample video description text;

constructing a sample candidate word library of the sample video by using the sample video description text, wherein the sample candidate word library comprises a reference character set and a sample description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical labels, and the sample description character set comprises description characters obtained by segmenting the sample video description text;

and training the initialized label generation network by using the sample video, the sample label and the sample candidate word stock until a training convergence condition is reached.

7. The method of claim 2, wherein in the process of performing the ith iterative computation on the video feature and the description feature, computing an ith intermediate hidden vector set in a vector conversion network based on the visual feature in the video feature, the description feature and an iterative reference vector computed in the (i-1) th iterative computation comprises:

converting the visual features and the description features into word embedding vectors corresponding to the target video under the condition that i is equal to 1; extracting a category embedding vector, a paragraph embedding vector and a position embedding vector corresponding to the target video, wherein the category embedding vector is used for indicating category information of the target video, the paragraph embedding vector comprises a first paragraph falling sub-vector obtained based on the visual features and a second paragraph falling sub-vector obtained based on the description features, and the position embedding vector is used for indicating a frame sequence of each video frame in the target video and a sequence of each text word in the video description text; taking the word embedding vector, the category embedding vector, the paragraph embedding vector and the position embedding vector as input vectors of the vector conversion network to calculate and obtain the ith intermediate hidden vector set;

and under the condition that i is larger than 1, taking the word embedding vector, the category embedding vector, the paragraph embedding vector, the position embedding vector and an iteration reference vector obtained by the i-1 th iteration calculation as input vectors of the vector conversion network to calculate and obtain the i-th intermediate hidden vector set.

8. The method according to claim 2, wherein the calculating an i-th intermediate hidden vector set based on the visual features in the video features, the description features, and the i-1 st iterative reference vector in the vector conversion network comprises:

determining the ith visual hidden vector and the ith description hidden vector based on the visual features and the description features, wherein the ith visual hidden vector and the ith description hidden vector are used for indicating the context relationship of characters in the target video tag;

and determining the iteration reference vector obtained by the ith iteration calculation based on the visual feature, the description feature and the iteration reference vector obtained by the (i-1) th iteration calculation.

9. The method of claim 8, wherein the determining a target character from each of the candidate character sequences based on the confidence level comprises:

and according to the context relationship of the characters in the target video label, taking the target character determined in the jth candidate character sequence as the jth target character in the target video label.

10. The method of claim 9, further comprising, after said splicing the M target characters into a target video tag matching the target video:

and adjusting the sequence of the M target characters in the target video label based on the context semantic relationship to obtain the updated target video label.

11. An apparatus for generating a video tag, comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring a target video to be recognized, and the target video carries a video description text for describing the target video;

the construction unit is used for constructing a target candidate word stock of the target video by using the video description text, wherein the target candidate word stock comprises a reference character set and a description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical label texts, and the description character set comprises description characters obtained by segmenting the video description text;

a calculating unit, configured to perform iterative calculation on the video features and the description features of the video description text N times to obtain M candidate character sequences under the condition that the video features of the target video and the description features of the video description text are extracted, where a candidate character object in each candidate character sequence includes each candidate character in the target candidate word stock and a confidence coefficient that matches the candidate character, the confidence coefficient is used to indicate a matching degree between the candidate character and the target video, and N, M is a natural number greater than or equal to 1;

the determining unit is used for determining target characters from each candidate character sequence based on the confidence coefficient to obtain M target characters;

and the splicing unit is used for splicing the M target characters into target video labels matched with the target videos.

12. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 10.

13. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 10 by means of the computer program.