[go: up one dir, main page]

CN115114479A - Method and device for generating video label, storage medium and electronic device - Google Patents

Method and device for generating video label, storage medium and electronic device Download PDF

Info

Publication number
CN115114479A
CN115114479A CN202210404088.8A CN202210404088A CN115114479A CN 115114479 A CN115114479 A CN 115114479A CN 202210404088 A CN202210404088 A CN 202210404088A CN 115114479 A CN115114479 A CN 115114479A
Authority
CN
China
Prior art keywords
video
target
vector
character
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210404088.8A
Other languages
Chinese (zh)
Other versions
CN115114479B (en
Inventor
徐鲁辉
熊鹏飞
陈宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210404088.8A priority Critical patent/CN115114479B/en
Publication of CN115114479A publication Critical patent/CN115114479A/en
Application granted granted Critical
Publication of CN115114479B publication Critical patent/CN115114479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种视频标签的生成方法和装置、存储介质及电子设备。其中,该方法包括:获取待识别的目标视频,其中,目标视频中携带有用于描述目标视频的视频描述文本;利用视频描述文本构建目标视频的目标候选字库;在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下,对视频特征以及描述特征进行N次迭代计算,以得到M个候选字符序列,其中,每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字符以及与候选字符匹配的置信度;基于置信度从每个候选字符序列中确定出目标字符,得到M个目标字符;将M个目标字符拼接为与目标视频匹配的目标视频标签。本发明解决了现有方法确定的视频标签准确率低的技术问题。

Figure 202210404088

The invention discloses a method and device for generating a video label, a storage medium and an electronic device. Wherein, the method includes: acquiring a target video to be identified, wherein the target video carries a video description text for describing the target video; constructing a target candidate font library of the target video by using the video description text; extracting the video features of the target video and In the case of the description features of the video description text, the video features and the description features are iteratively calculated for N times to obtain M candidate character sequences, wherein the candidate character objects in each candidate character sequence include each candidate character in the target candidate character library. character and the confidence of matching the candidate characters; determine the target character from each candidate character sequence based on the confidence, and obtain M target characters; splicing the M target characters into target video tags matching the target video. The invention solves the technical problem of low video label accuracy determined by the prior method.

Figure 202210404088

Description

视频标签的生成方法和装置、存储介质及电子设备Method and device for generating video label, storage medium and electronic device

技术领域technical field

本发明涉及计算机领域,具体而言,涉及一种视频标签的生成方法和 装置、存储介质及电子设备。The present invention relates to the field of computers, and in particular, to a method and apparatus for generating a video tag, a storage medium and an electronic device.

背景技术Background technique

随着网络视频平台的发展,网络用户越来越习惯于通过浏览视频获取 最新的网络资讯。在网络上传播网络视频的过程中,视频平台通常会为视 频匹配视频标签,进而基于视频标签实现视频推荐以及视频搜索功能。对 网络视频而言,新鲜且匹配度高的视频标签可以使得网络用户在第一时间 发现新热事件,进而提高新热网络视频的点击率。With the development of network video platforms, network users are more and more accustomed to obtaining the latest network information by browsing videos. In the process of disseminating online videos on the Internet, video platforms usually match video tags for videos, and then implement video recommendation and video search functions based on video tags. For online videos, fresh and highly matched video tags can enable online users to discover new hot events at the first time, thereby increasing the click-through rate of new hot online videos.

现有为网络视频匹配标签的方法,通常在匹配过程中,对提取的视频 特征进行多次分类和聚类操作,进而在定义好的标签体系中确定出与该视 频匹配的标签。The existing method for matching tags of online video, usually in the matching process, performs multiple classification and clustering operations on the extracted video features, and then determines the tag matching the video in the defined tag system.

由此可见,现有的标签确定方法仅能在预先设置的标签体系中选出与 视频匹配的标签,而对于内容为新鲜事件的视频,由于预设的标签库中未 包括与新鲜事件对应的新鲜标签词条,因此无法为新鲜视频匹配新鲜标签。 进而导致平台在基于标签提供视频推荐和视频搜索功能的过程中,无法实 现精准推荐,也无法提供准确的搜索结果。也就是说,现有的视频标签的 确定方法存在确定出的标签准确率较低的技术问题。It can be seen that the existing label determination method can only select the label matching the video in the preset label system, and for the video whose content is a fresh event, because the preset label library does not include the label corresponding to the fresh event. Fresh hashtags entries, so it is not possible to match fresh hashtags for fresh videos. As a result, in the process of providing video recommendation and video search functions based on tags, the platform cannot achieve accurate recommendation or provide accurate search results. That is to say, the existing method for determining video tags has the technical problem of low accuracy of the determined tags.

针对上述的问题,目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种视频标签的生成方法和装置、存储介质及电 子设备,以至少解决现有方法确定的视频标签准确率低的技术问题。The embodiments of the present invention provide a method and device for generating a video tag, a storage medium and an electronic device, so as to at least solve the technical problem of the low accuracy rate of the video tag determined by the existing method.

根据本发明实施例的一个方面,提供了一种视频标签的生成方法,包 括:获取待识别的目标视频,其中,上述目标视频中携带有用于描述上述 目标视频的视频描述文本;利用上述视频描述文本构建上述目标视频的目 标候选字库,其中,上述目标候选字库包括参考字符集和描述字符集,上 述参考字符集中包括对多个历史标签文本进行分词得到的参考字符,上述 描述字符集中包括对上述视频描述文本进行分词得到的描述字符;在提取出上述目标视频的视频特征以及上述视频描述文本的描述特征的情况下, 对上述视频特征以及上述描述特征进行N次迭代计算,以得到M个候选 字符序列,其中,每个上述候选字符序列中的候选字符对象包括上述目标 候选字库中每个候选字符以及与上述候选字符匹配的置信度,上述置信度 用于指示上述候选字符与上述目标视频之间的匹配度,N、M为大于等于 1的自然数;基于上述置信度从每个上述候选字符序列中确定出目标字符, 得到M个上述目标字符;将M个上述目标字符拼接为与上述目标视频匹 配的目标视频标签。According to an aspect of the embodiments of the present invention, a method for generating a video tag is provided, including: acquiring a target video to be identified, wherein the target video carries video description text for describing the target video; using the video description The text constructs the target candidate character library of the above-mentioned target video, wherein, the above-mentioned target candidate character library includes a reference character set and a description character set, the above-mentioned reference character set includes a plurality of historical tag texts. Descriptive characters obtained by segmenting the video description text; in the case of extracting the video features of the above-mentioned target video and the description features of the above-mentioned video description text, the above-mentioned video features and the above-mentioned description features are iteratively calculated for N times to obtain M candidates Character sequence, wherein, the candidate character object in each of the above-mentioned candidate character sequence includes each candidate character in the above-mentioned target candidate character library and the confidence level of matching with the above-mentioned candidate character, and the above-mentioned confidence level is used to indicate the above-mentioned candidate character and the above-mentioned target video. The matching degree between the two, N and M are natural numbers greater than or equal to 1; the target character is determined from each of the above-mentioned candidate character sequences based on the above-mentioned confidence degree, and M above-mentioned target characters are obtained; M above-mentioned target characters are spliced into the above-mentioned target characters The target video tag for video matches.

根据本发明实施例的另一方面,还提供了一种视频标签的生成装置, 包括:获取单元,用于获取待识别的目标视频,其中,上述目标视频中携 带有用于描述上述目标视频的视频描述文本;构建单元,用于利用上述视 频描述文本构建上述目标视频的目标候选字库,其中,上述目标候选字库 包括参考字符集和描述字符集,上述参考字符集中包括对多个历史标签文 本进行分词得到的参考字符,上述描述字符集中包括对上述视频描述文本进行分词得到的描述字符;计算单元,用于在提取出上述目标视频的视频 特征以及上述视频描述文本的描述特征的情况下,对上述视频特征以及上 述描述特征进行N次迭代计算,以得到M个候选字符序列,其中,每个 上述候选字符序列中的候选字符对象包括上述目标候选字库中每个候选 字符以及与上述候选字符匹配的置信度,上述置信度用于指示上述候选字 符与上述目标视频之间的匹配度,N、M为大于等于1的自然数;确定单 元,用于基于上述置信度从每个上述候选字符序列中确定出目标字符,得 到M个上述目标字符;拼接单元,用于将M个上述目标字符拼接为与上 述目标视频匹配的目标视频标签。According to another aspect of the embodiments of the present invention, there is also provided an apparatus for generating a video tag, including: an acquisition unit configured to acquire a target video to be identified, wherein the target video carries a video for describing the target video A description text; a construction unit, used to construct a target candidate character library of the above-mentioned target video by using the above-mentioned video description text, wherein, the above-mentioned target candidate character library includes a reference character set and a description character set, and the above-mentioned reference character set includes a plurality of historical label texts for word segmentation The obtained reference character, the above-mentioned description character set includes the description character obtained by word segmentation on the above-mentioned video description text; a computing unit is used to extract the video feature of the above-mentioned target video and the description feature of the above-mentioned video description text. The video features and the above-mentioned description features are calculated N times of iterations to obtain M candidate character sequences, wherein the candidate character objects in each of the above-mentioned candidate character sequences include each candidate character in the above-mentioned target candidate character library and the above-mentioned candidate character matching. confidence, the confidence is used to indicate the degree of matching between the candidate characters and the target video, and N and M are natural numbers greater than or equal to 1; a determination unit is used to determine from each of the candidate character sequences based on the confidence extracting target characters to obtain M above-mentioned target characters; a splicing unit for splicing the M above-mentioned target characters into target video tags matching the above-mentioned target videos.

根据本发明实施例的又一方面,还提供了一种计算机可读的存储介质, 该计算机可读的存储介质中存储有计算机程序,其中,该计算机程序被设 置为运行时执行上述视频标签的生成方法。According to yet another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the above-mentioned video tagging process when running. generate method.

根据本申请实施例的又一个方面,提供一种计算机程序产品或计算 机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指 令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存 储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设 备执行如以上视频标签的生成方法。According to yet another aspect of the embodiments of the present application, there is provided a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video tag generation method as above.

根据本发明实施例的又一方面,还提供了一种电子设备,包括存储器 和处理器,上述存储器中存储有计算机程序,上述处理器被设置为通过所 述计算机程序执行上述的视频标签的生成方法。According to yet another aspect of the embodiments of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to perform the above-mentioned generation of the video tag through the computer program method.

在本发明实施例中,采用获取待识别的目标视频;利用视频描述文本 构建目标视频的目标候选字库,其中,目标候选字库包括参考字符集和描 述字符集,参考字符集中包括对多个历史标签文本进行分词得到的参考字 符,描述字符集中包括对视频描述文本进行分词得到的描述字符;在提取 出目标视频的视频特征以及视频描述文本的描述特征的情况下,对视频特 征以及描述特征进行N次迭代计算,以得到M个候选字符序列;基于置 信度从每个候选字符序列中确定出目标字符,得到M个目标字符;进而 将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说,在 本申请实施例中,通过对目标视频的视频特征和描述文本特征进行迭代计 算,进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目标 字符,再通过输出的目标字符组成目标视频标签。由于本申请实施例中用 于输出字符的目标候选字库中结合了描述字符集,而描述字符集中包括了 与视频相关的新鲜文本内容,因此可以结合描述字符集输出新鲜字符,进 而根据输出的新鲜字符拼接得到新鲜标签,提升了输出的视频标签的准确 性,解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问 题。In the embodiment of the present invention, the target video to be identified is acquired; a target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of historical tags. The reference characters obtained by word segmentation of the text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features N Iterative calculation to obtain M candidate character sequences; target characters are determined from each candidate character sequence based on confidence, and M target characters are obtained; then M target characters are spliced into target video tags matching the target video. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh text content related to the video, the fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一 部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发 明的不当限定。在附图中:The accompanying drawings described herein are used to provide a further understanding of the present invention, and constitute a part of the present application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached image:

图1是根据本发明实施例的一种可选的视频标签的生成方法的硬件环 境的示意图;1 is a schematic diagram of a hardware environment of a method for generating an optional video tag according to an embodiment of the present invention;

图2是根据本发明实施例的一种可选的视频标签的生成方法的流程图;2 is a flowchart of an optional method for generating a video tag according to an embodiment of the present invention;

图3是根据本发明实施例的一种可选的视频标签的生成方法的示意图;3 is a schematic diagram of an optional method for generating a video tag according to an embodiment of the present invention;

图4是根据本发明实施例的另一种可选的视频标签的生成方法的示意 图;4 is a schematic diagram of another optional method for generating a video tag according to an embodiment of the present invention;

图5是根据本发明实施例的一种可选的视频标签的生成方法的示意图;5 is a schematic diagram of an optional method for generating a video tag according to an embodiment of the present invention;

图6是根据本发明实施例的另一种可选的视频标签的生成方法的示意 图;6 is a schematic diagram of another optional method for generating a video tag according to an embodiment of the present invention;

图7是根据本发明实施例的另一种可选的视频标签的生成方法的流程 图;Fig. 7 is the flow chart of the generation method of another kind of optional video label according to an embodiment of the present invention;

图8是根据本发明实施例的一种可选的视频标签的生成装置的结构示 意图;8 is a schematic structural diagram of a device for generating an optional video tag according to an embodiment of the present invention;

图9是根据本发明实施例的一种可选的电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an optional electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述, 显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施 例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动 前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to make those skilled in the art better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Embodiments are part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

需要说明的是,本发明的说明书和权利要求书及上述附图中的术语 “第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序 或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里 描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序 实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆 盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、 产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚 地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, eg, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

根据本发明实施例的一个方面,提供了一种视频标签的生成方法,作 为一种可选的实施方式,上述视频标签的生成方法可以但不限于应用于如 图1所示的由服务器102和终端设备104所构成的视频标签的生成系统中。 如图1所示,服务器102通过网络110与终端设备104进行连接,上述网 络可以包括但不限于:有线网络,无线网络,其中,该有线网络包括:局 域网、城域网和广域网,该无线网络包括:蓝牙、WIFI及其他实现无线 通信的网络。上述终端设备可以包括但不限于以下至少之一:手机(如 Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices,移动互联网设备)、PAD、台式电脑、智能电视、车载设 备等。上述终端设备上可以安装有客户端,例如视频分享客户端等。上述 终端设备上还设置有显示器、处理器和存储器,显示器可以用于显示视频 上传应用程序的程序界面,以及显示上传服务器的视频内容,处理器可以 用于对待上传的视频文件进行传输前的预处理,例如,将获取到的视频文 件进行压缩处理;存储器用于待上传的视频进行存储。可以理解的是,在 上述终端设备104中获取到待上传的目标视频后,终端设备104可以通过 网络110向服务器102发送上述目标视频,服务器102接收到目标视频的 情况下,根据终端设备104上传的视频生成与目标视频匹配的视频标签提 示标签,终端设备104可以通过网络110接收服务器102返回的视频标签。 服务器102可以是单一服务器,也可以是由多个服务器组成的服务器集群, 或者是云服务器。上述服务器包括数据库和处理引擎。其中,上述数据库 中可包括用于为视频匹配标签的历史标签词库以及预先训练好的标签生 成模型;上述处理引擎用于根据获取的目标视频生成目标视频标签。According to an aspect of the embodiments of the present invention, a method for generating a video tag is provided. As an optional implementation manner, the above-mentioned method for generating a video tag may be applied, but not limited to, the method shown in FIG. 1 by the server 102 and the In the video tag generation system composed of the terminal device 104 . As shown in FIG. 1 , the server 102 is connected to the terminal device 104 through the network 110. The above-mentioned network may include, but is not limited to, a wired network and a wireless network, wherein the wired network includes: a local area network, a metropolitan area network, and a wide area network, and the wireless network Including: Bluetooth, WIFI and other networks that realize wireless communication. The above-mentioned terminal devices may include but are not limited to at least one of the following: mobile phones (such as Android mobile phones, iOS mobile phones, etc.), notebook computers, tablet computers, PDAs, MIDs (Mobile Internet Devices, mobile Internet Devices), PADs, desktop computers, smart TV, car equipment, etc. A client, such as a video sharing client, may be installed on the above-mentioned terminal device. The above-mentioned terminal equipment is also provided with a display, a processor and a memory. The display can be used to display the program interface of the video uploading application and display the video content of the uploading server. Processing, for example, compressing the obtained video file; the memory is used for storing the video to be uploaded. It can be understood that, after acquiring the target video to be uploaded in the above-mentioned terminal device 104, the terminal device 104 can send the above-mentioned target video to the server 102 through the network 110, and when the server 102 receives the target video, upload the target video according to the terminal device 104. The video generates a video tag prompt tag matching the target video, and the terminal device 104 can receive the video tag returned by the server 102 through the network 110 . The server 102 may be a single server, a server cluster composed of multiple servers, or a cloud server. The above server includes a database and a processing engine. Wherein, the above-mentioned database may include a historical label thesaurus for matching labels for videos and a pre-trained label generation model; the above-mentioned processing engine is used to generate target video labels according to the acquired target videos.

根据本发明实施例的一个方面,上述视频标签的生成系统还可以执行 以下步骤:终端设备104执行步骤S102,获取待识别的目标视频;接着 执行步骤S104,终端设备104通过网络110向服务器102发送目标视频; 服务器102执行步骤S106至S114,获取待识别的目标视频利用视频描述 文本构建目标视频的目标候选字库,其中,目标候选字库包括参考字符集 和描述字符集,参考字符集中包括对多个历史标签文本进行分词得到的参 考字符,描述字符集中包括对视频描述文本进行分词得到的描述字符;在 提取出目标视频的视频特征以及视频描述文本的描述特征的情况下,对视 频特征以及描述特征进行N次迭代计算,以得到M个候选字符序列,其 中,每个候选字符序列中的候选字符对象包括目标候选字库中每个候选字 符以及与候选字符匹配的置信度,置信度用于指示候选字符与目标视频之 间的匹配度,N、M为大于等于1的自然数;基于置信度从每个候选字符 序列中确定出目标字符,得到M个目标字符;将M个目标字符拼接为与 目标视频匹配的目标视频标签;接着执行步骤S116,服务器102通过网络 110向终端设备104发送目标视频标签。可以理解的是,在终端设备104 为具有足够计算处理能力的设备的情况下,上述步骤S106至S114也可以在终端设备104中进行。According to an aspect of the embodiments of the present invention, the above-mentioned video tag generation system may further perform the following steps: the terminal device 104 performs step S102 to obtain the target video to be identified; The target video; the server 102 executes steps S106 to S114 to obtain the target video to be identified and use the video description text to construct a target candidate character library of the target video, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of The reference characters obtained by word segmentation of the historical label text. The description character set includes the description characters obtained by word segmentation of the video description text; when the video features of the target video and the description features of the video description text are extracted, the video features and description features Carry out N times of iterative calculation to obtain M candidate character sequences, wherein the candidate character objects in each candidate character sequence include each candidate character in the target candidate character library and the confidence level of the candidate character matching, and the confidence level is used to indicate the candidate character. The matching degree between the character and the target video, N and M are natural numbers greater than or equal to 1; the target character is determined from each candidate character sequence based on the confidence, and M target characters are obtained; M target characters are spliced into the target The target video tag of the video matching; then step S116 is executed, the server 102 sends the target video tag to the terminal device 104 through the network 110 . It can be understood that, in the case where the terminal device 104 is a device with sufficient computing processing capability, the above steps S106 to S114 can also be performed in the terminal device 104 .

在本发明实施例中,采用获取待识别的目标视频;利用视频描述文本 构建目标视频的目标候选字库,其中,目标候选字库包括参考字符集和描 述字符集,参考字符集中包括对多个历史标签文本进行分词得到的参考字 符,描述字符集中包括对视频描述文本进行分词得到的描述字符;在提取 出目标视频的视频特征以及视频描述文本的描述特征的情况下,对视频特 征以及描述特征进行N次迭代计算,以得到M个候选字符序列;基于置 信度从每个候选字符序列中确定出目标字符,得到M个目标字符;进而 将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说,在 本申请实施例中,通过对目标视频的视频特征和描述文本特征进行迭代计 算,进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目标 字符,再通过输出的目标字符组成目标视频标签。由于本申请实施例中用 于输出字符的目标候选字库中结合了描述字符集,而描述字符集中包括了 与视频相关的新鲜文本内容,因此可以结合描述字符集输出新鲜字符,进 而根据输出的新鲜字符拼接得到新鲜标签,提升了输出的视频标签的准确 性,解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问 题。In the embodiment of the present invention, the target video to be identified is acquired; a target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of historical tags. The reference characters obtained by word segmentation of the text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features N Iterative calculation to obtain M candidate character sequences; target characters are determined from each candidate character sequence based on confidence, and M target characters are obtained; then M target characters are spliced into target video tags matching the target video. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh text content related to the video, the fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

上述仅是一种示例,本实施例中对此不作任何限定。The above is only an example, which is not limited in this embodiment.

作为一种可选的实施方式,如图2所示,上述视频标签的生成方法包 括以下步骤:As a kind of optional embodiment, as shown in Figure 2, the generation method of above-mentioned video label comprises the following steps:

S202,获取待识别的目标视频,其中,目标视频中携带有用于描述目 标视频的视频描述文本;S202, obtain the target video to be identified, wherein, the target video carries a video description text for describing the target video;

需要说明的是,上述视频描述本信息可以包括但不限于是目标视频的 视频标题信息,目标视频的字幕信息,目标视频的标签信息,目标视频中 的画面中出现的文本信息等一种或多种用于描述目标视频内容的文本信 息。在此不对具体的文本信息进行限制。It should be noted that the above-mentioned video description information may include, but is not limited to, the video title information of the target video, the subtitle information of the target video, the label information of the target video, and the text information that appears in the picture in the target video. One or more. A textual information used to describe the target video content. The specific text information is not limited here.

S204,利用视频描述文本构建目标视频的目标候选字库;S204, using the video description text to construct a target candidate character library of the target video;

其中,目标候选字库包括参考字符集和描述字符集,参考字符集中包 括对多个历史标签文本进行分词得到的参考字符,描述字符集中包括对视 频描述文本进行分词得到的描述字符。The target candidate character library includes a reference character set and a description character set, the reference character set includes reference characters obtained by word segmentation of multiple historical tag texts, and the description character set includes description characters obtained by word segmentation of video description text.

以下以一个具体的例子对上述方法进行说明。假设在历史生成标签的 操作中,已确定可以用以下几个标签对视频进行分类:“游戏”、“爱情”、 “电视剧”,进而基于历史生成的标签可以确定出参考字符集为“游、戏、 爱、情、电、视、剧”;The above method will be described below with a specific example. Assuming that in the operation of historically generating tags, it has been determined that the following tags can be used to classify videos: "game", "love", "tv series", and then based on the historically generated tags, it can be determined that the reference character set is "tourism, drama, love, affection, television, television, drama”;

假设获取到的待识别的目标视频的描述文本为目标视频中的一句字 幕“城市英雄游戏真好玩,大家快来玩吧!”,进而基于上述描述文本可以 确定出描述字符集为“城、市、英、雄、游、戏、真、好、玩、大、家、 快、来、玩、吧”。Assuming that the acquired description text of the target video to be identified is a subtitle in the target video "The city hero game is so fun, let's play it!", and then based on the above description text, it can be determined that the description character set is "city, city , hero, hero, game, play, really, good, play, big, home, come, come, play, let's go".

将上述参考字符集和上述描述字符集进行组合,即可得到目标候选字 库“游、戏、爱、情、电、视、剧、城、市、英、雄、游、戏、真、好、 玩、大、家、快、来、玩、吧”。Combining the above reference character set and the above description character set, the target candidate character library "game, game, love, affection, television, TV, drama, city, city, hero, hero, game, game, true, good, Play, big, home, come, come, play, let's go".

上述字库中包括了描述字符集中的字符“游、戏”以及参考字符集中 的字符“游、戏”。在一种可选的方式中,还可以对上述字库中重复的字 符进行去重操作,以得到更新后的目标候选字库“游、戏、爱、情、电、 视、剧、城、市、英、雄、真、好、玩、大、家、快、来、玩、吧”。The above-mentioned character library includes the characters "Game, Game" in the description character set and the characters "Game, Game" in the reference character set. In an optional manner, the repeated characters in the above-mentioned character library can also be de-duplicated, so as to obtain the updated target candidate character library "Game, Game, Love, Love, TV, TV, Drama, City, City, Hero, hero, true, good, play, big, home, fast, come, play, let’s go”.

可以理解的是,上述构建目标候选字库的方式仅为一种示例,在具体 实施方式中,上述确定目标候选字库的方式还可以是其他的类似方式,在 此不对具体构建上述目标候选字库的方式进行限定。It can be understood that the above-mentioned method of constructing the target candidate font library is only an example, and in the specific implementation manner, the above-mentioned method of determining the target candidate font library can also be other similar methods. be limited.

S206,在提取出目标视频的视频特征以及视频描述文本的描述特征的 情况下,对视频特征以及描述特征进行N次迭代计算,以得到M个候选 字符序列,其中,每个候选字符序列中的候选字符对象包括目标候选字库 中每个候选字符以及与候选字符匹配的置信度;S206, in the case of extracting the video features of the target video and the description features of the video description text, perform N times of iterative calculation on the video features and the description features to obtain M candidate character sequences, wherein, in each candidate character sequence, The candidate character object includes each candidate character in the target candidate character library and the confidence level of matching the candidate character;

可以理解的是,上述置信度用于指示候选字符与目标视频之间的匹配 度,N、M为大于等于1的自然数。It can be understood that the above confidence level is used to indicate the degree of matching between the candidate character and the target video, and N and M are natural numbers greater than or equal to 1.

可以理解的是,上述视频特征可以是用于表征视频画面风格的视觉特 征,还可以是用于表征视频画面内容的视频内容特征,此处不对上述视频 特征的具体表征的内容进行限制;上述描述特征为视频描述文本对应的文 本特征,可以是标题文本特征,也可以是字幕文本特征,还可以是视频画 面中包括的画面文本特征,此处也不对上述视频描述文本特征的具体内容 进行限制。It can be understood that the above-mentioned video features may be visual features used to characterize the style of video pictures, and may also be video content features used to characterize video picture content, and the content of the specific representations of the above-mentioned video features is not limited here; the above description The feature is the text feature corresponding to the video description text, which can be the title text feature, the subtitle text feature, or the picture text feature included in the video picture. The specific content of the above video description text feature is not limited here.

可以理解的是,在上述迭代计算过程中,可以将上述视频特征以及描 述文本特征转化为对应的向量,进而方便进行迭代计算。同时,上述迭代 计算指示在当前迭代计算的输入数值与上一次迭代计算的输出数值相关, 也就是说,在本申请的上述实施例中,迭代计算输出的后续结果与迭代计 算的先前结果存在关联性。It can be understood that, in the above-mentioned iterative calculation process, the above-mentioned video features and description text features can be converted into corresponding vectors, so as to facilitate the iterative calculation. At the same time, the above iterative calculation indicates that the input value of the current iterative calculation is related to the output value of the previous iterative calculation, that is, in the above-mentioned embodiment of the present application, the subsequent result output by the iterative calculation is related to the previous result of the iterative calculation sex.

S208,基于置信度从每个候选字符序列中确定出目标字符,得到M 个目标字符;S208, determine target characters from each candidate character sequence based on the confidence, and obtain M target characters;

以下对上述候选字符序列进行说明。可以理解的是,上述候选字符序 列为目标候选字库中每个字符匹配的置信度组成。继续以上述候选字库 “游、戏、爱、情、电、视、剧、城、市、英、雄、真、好、玩、大、家、 快、来、玩、吧”为例进行说明。The above-mentioned candidate character sequences will be described below. It can be understood that the above candidate character sequence is composed of the confidence level of each character matching in the target candidate character library. Continue to take the above-mentioned candidate fonts as an example to explain .

假设对上述视频的视频特征和描述特征进行第一次迭代计算的情况 下,得到的第一候选字符序列为:“游(98%)、戏(90%)、爱(10%)、 情(10%)、电(5%)、视(8%)、剧(8%)、城(58%)、市(60%)、英 (88%)、雄(70%)、真(60%)、好(70%)、玩(60%)、大(28%)、家 (5%)、快(18%)、来(8%)、玩(68%)、吧(1%)”;Assuming that the first iterative calculation is performed on the video features and description features of the above video, the obtained first candidate character sequence is: "you (98%), play (90%), love (10%), love ( 10%), electricity (5%), TV (8%), drama (8%), city (58%), city (60%), English (88%), male (70%), true (60%) ), good (70%), play (60%), big (28%), home (5%), fast (18%), come (8%), play (68%), let’s (1%)” ;

假设输出字符的规则为将置信度最高的字符进行输出,进而可以根据 上述候选序列确定出当前序列输出的第一字符为“游”;Assume that the rule for outputting characters is to output the character with the highest confidence, and then it can be determined that the first character output by the current sequence is "you" according to the above-mentioned candidate sequence;

假设对上述视频的视频特征和描述特征,再结合上一次迭代的结果进 行第二次迭代计算,假设得到的第二候选字符序列为:“游(70%)、戏(98%)、 爱(10%)、情(10%)、电(5%)、视(8%)、剧(8%)、城(58%)、市 (60%)、英(88%)、雄(70%)、真(60%)、好(70%)、玩(60%)、大(28%)、家(5%)、快(18%)、来(8%)、玩(68%)、吧(1%)”;Assume that the video features and description features of the above videos are combined with the results of the previous iteration to perform the second iterative calculation. It is assumed that the obtained second candidate character sequence is: "you (70%), play (98%), love ( 10%), love (10%), electricity (5%), TV (8%), drama (8%), city (58%), city (60%), British (88%), male (70%) ), true (60%), good (70%), play (60%), big (28%), home (5%), fast (18%), come (8%), play (68%), bar(1%)";

假设输出字符的规则为将置信度最高的字符进行输出,进而可以根据 上述候选序列确定出当前序列输出的第二字符为“戏”;Assume that the rule for outputting characters is to output the character with the highest confidence, and then it can be determined that the second character output by the current sequence is "play" according to the above-mentioned candidate sequence;

根据上述方法进行多次迭代,假设最终输出的M个字符分别为“游”、 “戏”、“城”、“市”、“英”、“雄”、“好”、“游”、“戏”。Perform multiple iterations according to the above method, assuming that the final output M characters are "you", "play", "city", "shi", "ying", "xiong", "good", "you", " play".

S210,将M个目标字符拼接为与目标视频匹配的目标视频标签;S210, splicing M target characters into target video tags matching the target video;

继续以上述实施例对该步骤进行说明,在得到的M个目标字符分别 为“游”、“戏”、“城”、“市”、“英”、“雄”、“好”、“游”、“戏”的情况下, 可以将上述字符按照输出的先后顺序进行拼接,以得到三个标签“游戏”、 “城市英雄”、“好游戏”。Continuing to describe the step with the above embodiment, the obtained M target characters are "you", "play", "city", "city", "ying", "xiong", "good", "game" In the case of "play" and "play", the above characters can be spliced in the order of output to obtain three labels "game", "city hero" and "good game".

在一种可选的方式中,在输出上述字符的过程中,还可以通过输出标 识符以提示目标视频标签的拼接成方式。比如,在得到的M个目标字符 分别为“游”、“戏”、“CLS”、“城”、“市”、“英”、“雄”、“CLS”、“好”、 “游”、“戏”的情况下,可以根据“游”、“戏”以及“城”、“市”、“英”、 “雄”之间的间隔标识符“CLS”将字符“游”、“戏”拼接成的“游戏” 作为第一目标标签,将“城”、“市”、“英”、“雄”拼接成的“城市英雄” 作为第二目标标签,并将间隔标识符“CLS”之后的“好”、“游”、“戏” 拼接成的“好游戏”作为第三目标标签。In an optional manner, in the process of outputting the above-mentioned characters, the output identifier can also be used to prompt the splicing method of the target video tag. For example, the obtained M target characters are "you", "play", "CLS", "city", "city", "ying", "xiong", "CLS", "good", "you" , in the case of "play", the characters "you", "play" can be changed according to the interval identifier "CLS" between "you", "play" and "city", "city", "ying" and "xiong" "The "game" spliced into the first target tag, the "city hero" spliced into "city", "city", "ying", "xiong" as the second target tag, and the interval identifier "CLS" The "good game" formed by splicing "good", "you" and "play" after that is used as the third target tag.

在另一种可选的方式中,在得到的M个目标字符为无序的字符“游”、 “游”、“戏”、“城”、“英”、“市”、“雄”、“好”、“戏”的情况下,可以根 据语义分析,确定将“游”、“戏”拼接为第一标签“游戏”,将“游”、“好”、 “戏”拼接为第二标签“好游戏”以及将“城”、“英”、“市”、“雄”拼接 为第三标签“城市英雄”。由此可见,在一款新游戏《城市英雄》的推广 视频上线的情况下,通过本申请的上述实施方式,通过上述实施方式,可 以在原历史标签中并不包含“城市英雄”的情况下,基于描述字符库中提 供的新鲜字符,经过迭代计算操作以得到“城市英雄”标签,进而实现了 自动生成新鲜标签的技术效果,提高了标签生成的效率和准确度。同时, 由于本申请的上述实施方式中,是基于逐字输出的方式产生标签,可以在 描述文本以及历史标签库中原本并未包括“好游戏”的标签的情况下,产 生“好游戏”的标签。进而使得用户在搜索“好游戏”的情况下,也能检 索到上述视频,进而提升上述视频的点击率和观看热度。In another optional manner, the obtained M target characters are disordered characters "you", "you", "play", "city", "ying", "city", "male", In the case of "good" and "game", according to semantic analysis, it can be determined that "game" and "play" are spliced into the first tag "game", and "game", "good" and "game" are spliced into the second tag The label "good game" and the splicing of "city", "ying", "city", and "hero" into the third label "city hero". It can be seen from this that in the case where the promotion video of a new game "City Hero" is online, through the above-mentioned implementation manner of the present application, in the case where "City Hero" is not included in the original history tag, Based on the fresh characters provided in the description character library, the "city hero" label is obtained through iterative calculation operations, thereby realizing the technical effect of automatically generating fresh labels and improving the efficiency and accuracy of label generation. At the same time, because in the above-mentioned embodiment of the present application, the tags are generated based on word-by-word output, the description text and the historical tag library do not originally include the tag of "good game". Label. This enables users to retrieve the above-mentioned videos even when searching for "good games", thereby increasing the click-through rate and viewing popularity of the above-mentioned videos.

以下结合图3对另一个具体的实施方式对上述方法进行说明。The above method will be described below with reference to FIG. 3 for another specific embodiment.

如图3所示,示出一个视频的标题图像,该标题图像中包括了文本内 容“今天你馋了吗”,同时该视频的标题为“追剧必备的小零食来一波, 快@他做给你吃吧”,且该视频中包括了字幕内容“这个教程非常简单”。As shown in Figure 3, a title image of a video is shown, and the title image includes the text content "Are you greedy today?" At the same time, the title of the video is "A wave of snacks necessary for chasing dramas, come @ He will make it for you", and the video includes subtitles "This tutorial is very easy".

如图3所示的视频图像,可以将上述相关的文本作为该视频的视频描 述文本信息,即包括了视频图像中的文本内容“今天你馋了吗”,以及视 频标题的文本内容“追剧必备的小零食来一波,快@他做给你吃吧”以及 字幕内容“这个教程非常简单”。For the video image shown in Figure 3, the above-mentioned related text can be used as the video description text information of the video, that is, the text content in the video image "Are you greedy today" and the text content of the video title "Chase drama" A wave of must-have snacks, please @he make it for you" and the subtitle content "This tutorial is very simple".

接着,提取该视频的视频特征,该视频特征可以是如图3所示的视频 封面图像的图像特征;还可以提取视频的视频类别特征“美食”作为另一 种视频特征;提取视频描述文本信息的描述特征,即提取上述视频描述文 本信息对应的特征信息,作为一种具体的形式,上述特征均可以通过向量 或矩阵进行表示。Next, extract the video feature of the video, which can be the image feature of the video cover image as shown in Figure 3; the video category feature "food" of the video can also be extracted as another video feature; Extract the video description text information The description feature is to extract the feature information corresponding to the above-mentioned video description text information. As a specific form, the above-mentioned features can be represented by vectors or matrices.

进一步地,根据上述提取的视觉特征、类别特征以及描述特征,从确 定出由参考字符库以及描述字符库确定的候选字符库中每一个字符对应 的置信度。假设预设用于产生参考字符的历史标签文本中并未包括“美食 教程”这个标签,在传统的标签提取方式中,仅能从预设的历史标签库中 确定出标签,进而无法得到“美食教程”这个与视频相关度较高的标签, 也就是说,存在提取的标签的准确率较低的技术问题。Further, according to the above-mentioned extracted visual features, category features and description features, the confidence level corresponding to each character in the candidate character library determined by the reference character library and the description character library is determined. Assuming that the historical tag text preset for generating reference characters does not include the tag "food tutorial", in the traditional tag extraction method, the tag can only be determined from the preset historical tag library, and the "food tutorial" cannot be obtained. "Tutorial" is a label with high video relevance, that is to say, there is a technical problem that the accuracy of the extracted label is low.

而在本实施例中,当前候选字符库中包括了由历史标签分词得到的参 考字符库,以及当前视频描述文本字符库,即包括了上述视频图像中的文 本内容“今天你馋了吗”,以及视频标题的文本内容“追剧必备的小零食 来一波,快@他做给你吃吧”以及字幕内容“这个教程非常简单”等文本 字符内容,并通过上述方法得出历史标签库以及上述文本内容组成的候选 标签库的每一个标签文本对应的置信度。可见,由于文本描述内容中包括 了关键文本“教程”,同时该视频的分类特征也体现了关键特征“美食”, 进而可以输出“美食教程”标签。In this embodiment, the current candidate character library includes the reference character library obtained by the historical tag word segmentation, and the current video description text character library, that is, the text content in the above-mentioned video image "Are you greedy today?" As well as the text content of the video title "A wave of snacks necessary for chasing dramas, hurry @ he will make it for you" and the text content of the subtitle content "This tutorial is very simple" and other text characters, and obtain the historical tag database through the above method and the confidence level corresponding to each tag text of the candidate tag library composed of the above text content. It can be seen that since the text description content includes the key text "tutorial", and the classification feature of the video also reflects the key feature "food", the label of "food tutorial" can be output.

假设对上述视频的相关特征进行迭代计算,输出的多个字符为“油、 炸、食、品、脆、皮、花、生、美、食、教、程、零、食”的情况下,即 可确定目标标签为最终输出的标签如图3所示,为“油炸食品”、“脆皮花 生”、“美食教程”以及“零食”。Assuming that the relevant features of the above video are iteratively calculated, and the output characters are "oil, fried, food, product, crisp, skin, flower, raw, beauty, food, teaching, course, zero, food", It can be determined that the target label is the final output label as shown in Figure 3, which is "fried food", "crispy peanut", "food tutorial" and "snack".

可见,通过本申请的上述方法,可以基于视频的描述文本的内容,将 本不包括在历史标签库中的标签进行输出,进而实现提高输出的视频标签 的准确度的技术效果。It can be seen that, through the above method of the present application, the tags that are not included in the historical tag library can be output based on the content of the description text of the video, thereby achieving the technical effect of improving the accuracy of the output video tags.

在本发明实施例中,采用获取待识别的目标视频;利用视频描述文本 构建目标视频的目标候选字库,其中,目标候选字库包括参考字符集和描 述字符集,参考字符集中包括对多个历史标签文本进行分词得到的参考字 符,描述字符集中包括对视频描述文本进行分词得到的描述字符;在提取 出目标视频的视频特征以及视频描述文本的描述特征的情况下,对视频特 征以及描述特征进行N次迭代计算,以得到M个候选字符序列;基于置 信度从每个候选字符序列中确定出目标字符,得到M个目标字符;进而 将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说,在 本申请实施例中,通过对目标视频的视频特征和描述文本特征进行迭代计 算,进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目标 字符,再通过输出的目标字符组成目标视频标签。由于本申请实施例中用 于输出字符的目标候选字库中结合了描述字符集,而描述字符集中包括了 与视频相关的新鲜文本内容,因此可以结合描述字符集输出新鲜字符,进 而根据输出的新鲜字符拼接得到新鲜标签,提升了输出的视频标签的准确 性,解决了现有的视频标签确定方法得到的视频标签准确性较低的技术问 题。In the embodiment of the present invention, the target video to be identified is acquired; a target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of historical tags. The reference characters obtained by word segmentation of the text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features N Iterative calculation to obtain M candidate character sequences; target characters are determined from each candidate character sequence based on confidence, and M target characters are obtained; then M target characters are spliced into target video tags matching the target video. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh text content related to the video, the fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

作为一种可选的实施方式,上述对视频特征以及描述特征进行N次迭 代计算,以得到M个候选字符序列包括:As a kind of optional embodiment, the above-mentioned iterative calculation is carried out to the video feature and the description feature for N times, to obtain M candidate character sequences including:

S1,在对视频特征以及描述特征执行第i次迭代计算的过程中,在向 量转换网络中基于视频特征中的视觉特征、描述特征以及第i-1次迭代计 算得到的迭代参考向量,计算得到第i次中间隐向量集,其中,第i次中 间隐向量集中包括与视觉特征匹配的第i次视觉隐向量、与描述特征相匹 配的第i次描述隐向量,及第i次迭代计算得到的迭代参考向量,i为大于 等于1,且小于等于N的自然数;S1, in the process of performing the i-th iterative calculation on the video features and the description features, in the vector transformation network, based on the visual features, the description features in the video features and the iterative reference vector obtained by the i-1th iteration calculation, the calculation is obtained. The ith intermediate latent vector set, wherein the ith intermediate latent vector set includes the ith visual latent vector matched with the visual feature, the ith descriptive latent vector matched with the description feature, and the ith iterative calculation is obtained The iterative reference vector of , i is a natural number greater than or equal to 1 and less than or equal to N;

S2,基于第i次迭代计算得到的迭代参考向量,确定目标候选字库中 的各个字符各自匹配的置信度,并根据置信度确定出第j个候选字符序列, 其中,j为大于等于0,且小于等于i的自然数;S2, based on the iterative reference vector obtained by the ith iterative calculation, determine the respective matching confidences of each character in the target candidate font library, and determine the jth candidate character sequence according to the confidence, where j is greater than or equal to 0, and A natural number less than or equal to i;

S3,在第i次迭代计算得到的迭代参考向量并未达到结束条件的情况 下,执行第i+1次迭代计算;S3, under the condition that the iteration reference vector obtained by the i-th iterative calculation does not reach the end condition, perform the i+1-th iteration calculation;

S4,在第i次迭代计算得到的迭代参考向量达到结束条件的情况下, 将前i次迭代计算得到的迭代参考向量确定为N个迭代参考向量。S4 , in the case that the iterative reference vector obtained by the ith iteration calculation reaches the end condition, determine the iteration reference vector obtained by the previous i iteration calculation as N iteration reference vectors.

需要说明的是,上述向量转换网络可以是一种Transformer模型。 Transformer的典型结构包括编码器和解码器,通常将两种结构统称为 Transformer模块,两种结构的区别在于解码器在计算过程中只能看见当 前位置以及之前位置的信息。在本实施例中可以采用的模型包括BERT模 型(Bidirectional Encoder Representation fromTransformers,双向编码器标识 模型)、GPT模型(Generative Pre-Training,生成预训练模型)、UniLM模 型(Unified Language Model,统一预训练语言模型)以及VPUniLM模型。作为一种优选的方式,本实施例中可以采用VPUniLM模型作为上述向量 转换网络。It should be noted that the above vector transformation network may be a Transformer model. The typical structure of Transformer includes an encoder and a decoder. The two structures are usually collectively referred to as Transformer modules. The difference between the two structures is that the decoder can only see the information of the current position and the previous position during the calculation process. Models that can be used in this embodiment include BERT model (Bidirectional Encoder Representation from Transformers, bidirectional encoder identification model), GPT model (Generative Pre-Training, generate pre-training model), UniLM model (Unified Language Model, unified pre-training language) model) and the VPUniLM model. As a preferred way, in this embodiment, the VPUniLM model can be used as the above-mentioned vector transformation network.

可以理解的是,上述目标视频的视觉特征可以是目标视频的每一帧画 面提取得到的图像特征,还可以是对目标视频进行风格提取得到的图像特 征。此处不对用于表征目标视频的画面内容的具体视觉特征的形式进行限 制。It can be understood that the visual features of the target video may be image features obtained by extracting each frame of the target video, or may also be image features obtained by performing style extraction on the target video. The form of specific visual features used to characterize the picture content of the target video is not limited here.

需要说明的是,在上述模型中,获取到视频特征中的视觉特征以及描 述特征的同时,还需要获取上一次迭代计算得到的迭代参考向量,以将上 述视觉特征、描述特征以及上一次迭代计算得到的迭代参考向量作为当前 迭代计算的输入量,进而计算得到与视觉特征匹配的视觉隐向量、与描述 特征相匹配的描述隐向量,及当前迭代参考向量。可以理解的是,在本实 施例中,由于每次迭代计算的过程均需要结合上一次的迭代计算得到的迭代参考隐向量作为输入量,因此每次的迭代计算的结果均与历史迭代计算 得到的迭代结果相关,或者说,每次迭代计算结果均指示了历史迭代计算 结果的特征,进而可以通过迭代输出的先后顺序确定输出的迭代计算结果 之间的关联关系。It should be noted that, in the above model, while obtaining the visual features and descriptive features in the video features, it is also necessary to obtain the iterative reference vector obtained by the previous iterative calculation, so as to combine the above-mentioned visual features, description features and the previous iterative calculation. The obtained iterative reference vector is used as the input of the current iterative calculation, and then the visual latent vector matched with the visual feature, the description latent vector matched with the description feature, and the current iterative reference vector are calculated. It can be understood that, in this embodiment, since the process of each iterative calculation needs to be combined with the iterative reference latent vector obtained by the previous iterative calculation as the input, the result of each iterative calculation is obtained from the historical iterative calculation. In other words, each iterative calculation result indicates the characteristics of the historical iterative calculation results, and then the correlation between the output iterative calculation results can be determined by the sequence of the iterative outputs.

以下对上述计数字母i,j进行说明。可以理解的是,对于第i次迭代 计算,可以确定得到第j个用于生成目标字符的候选字符序列,j可以小 于或等于i。也就是说,通过迭代计算得到的迭代参考向量的数量可以大 于或等于候选字符序列的数量。在具体的实现方式中,第i次得到的迭代 参考向量可以是用于指示间隔符的向量。比如,第一次迭代计算得到的迭 代参考向量用于确定第一候选字符序列;第二次迭代计算得到的迭代参考 向量用于确定第二候选字符序列;第三次迭代计算得到的迭代参考向量用 于确定第一间隔符;第四次迭代计算得到的迭代参考向量用于确定第三候 选字符序列;第五次迭代计算得到的迭代参考向量用于确定第四候选字符 序列;第六次迭代计算得到的迭代参考向量指示结束间隔符。进而,对应 于上述六次迭代参考计算,可以确定四个候选字符序列,进一步地,可以 根据四个候选字符序列确定得到四个目标字符,并基于上述四个目标字符 组合得到两个目标标签。The above-mentioned count letters i, j will be described below. It can be understood that, for the i-th iterative calculation, the j-th candidate character sequence for generating the target character can be determined, and j can be less than or equal to i. That is, the number of iterative reference vectors obtained by iterative calculation can be greater than or equal to the number of candidate character sequences. In a specific implementation manner, the iterative reference vector obtained for the i-th time may be a vector for indicating a spacer. For example, the iterative reference vector calculated by the first iteration is used to determine the first candidate character sequence; the iteration reference vector calculated by the second iteration is used to determine the second candidate character sequence; the iterative reference vector calculated by the third iteration It is used to determine the first spacer; the iterative reference vector calculated in the fourth iteration is used to determine the third candidate character sequence; the iterative reference vector calculated in the fifth iteration is used to determine the fourth candidate character sequence; the sixth iteration The computed iteration reference vector indicates the end delimiter. Furthermore, corresponding to the above-mentioned six iterative reference calculations, four candidate character sequences can be determined, and further, four target characters can be determined and obtained according to the four candidate character sequences, and two target labels can be obtained based on the combination of the above-mentioned four target characters.

在另一种实施方式中,上述计数字母i和j可以相等,也就是说,通 过迭代计算得到的迭代参考向量的数量可以等于候选字符序列的数量,比 如对应于五个迭代参考向量确定出五个候选字符序列,进而确定出五个目 标字符,进一步根据上述五个目标字符拼接得到一个或多个目标标签。In another implementation manner, the above-mentioned count letters i and j may be equal, that is, the number of iterative reference vectors obtained through iterative calculation may be equal to the number of candidate character sequences, for example, five iterative reference vectors are determined corresponding to five A candidate character sequence is obtained, five target characters are determined, and one or more target labels are obtained by splicing the above five target characters.

通过本申请的上述方法,以在对视频特征以及描述特征执行第i次迭 代计算的过程中,在向量转换网络中基于视频特征中的视觉特征、描述特 征以及第i-1次迭代计算得到的迭代参考向量,计算得到第i次中间隐向 量集;基于第i次迭代计算得到的迭代参考向量,确定目标候选字库中的 各个字符各自匹配的置信度,并根据置信度确定出第j个候选字符序列, 其中,j为大于等于0,且小于等于i的自然数;在第i次迭代计算得到的 迭代参考向量并未达到结束条件的情况下,执行第i+1次迭代计算;在第 i次迭代计算得到的迭代参考向量达到结束条件的情况下,将前i次迭代 计算得到的迭代参考向量确定为N个迭代参考向量,进而基于迭代参考向 量确定出对应的候选字符序列。本申请的上述实时方式中,采用迭代的方 式产生中间隐向量,并基于中间隐向量中的迭代参考向量确定候选字符序 列,进而确保了候选字符序列之间的先后关联关系。以存在先后关联关系 额候选字符序列确定输出的目标字符和拼接的目标标签,实现了输出准确 度高的视频标签的技术效果,解决了现有标签生成的方法得到的标签准确 率较低的技术问题。Through the above method of the present application, in the process of performing the i-th iterative calculation on the video features and the description features, in the vector transformation network based on the visual features in the video features, the description features and the i-1th iterative calculation. Iterate the reference vector, and calculate the ith intermediate latent vector set; based on the iterative reference vector obtained by the ith iteration calculation, determine the respective matching confidence levels of each character in the target candidate font library, and determine the jth candidate according to the confidence level. character sequence, where j is a natural number greater than or equal to 0 and less than or equal to i; if the iteration reference vector obtained by the i-th iterative calculation does not reach the end condition, execute the i+1-th iteration calculation; When the iterative reference vector calculated by the next iteration reaches the end condition, the iterative reference vector calculated by the previous i iterations is determined as N iterative reference vectors, and then the corresponding candidate character sequence is determined based on the iterative reference vector. In the above-mentioned real-time mode of the present application, an iterative method is used to generate an intermediate latent vector, and a candidate character sequence is determined based on the iterative reference vector in the intermediate latent vector, thereby ensuring the sequential association relationship between the candidate character sequences. Determining the output target character and the spliced target label based on the candidate character sequence that has a sequential relationship question.

作为一种可选的实施方式,上述基于第i次迭代计算得到的迭代参考 向量,确定目标候选字库中的各个字符各自匹配的置信度包括:在与向量 转换网络相连接的标签生成网络中,基于第i次迭代计算得到的迭代参考 向量,确定参考字符集中的各个参考字符各自匹配的第一置信度,及描述 字符集中的各个描述字符各自匹配的第二置信度。As an optional embodiment, the above-mentioned iterative reference vector obtained based on the i-th iterative calculation, determining the respective matching confidence levels of each character in the target candidate font library includes: in the label generation network connected to the vector conversion network, Based on the iterative reference vector calculated by the i-th iteration, determine the first confidence level of each reference character in the reference character set and the second confidence level of each description character in the description character set.

可以理解的是,在本实施例中,上述通过迭代参考向量计算置信度的 过程中,针对参考字符集中的各个参考字符的置信度计算的过程与针对描 述字符集中的各个描述字符的置信度的过程不同。比如可以采用不同的计 算方式,还可以采用不同的计算参数,从而对计算参考字符集的第一置信 度和计算描述字符集的第二置信度的方法进行区分。It can be understood that, in this embodiment, in the above-mentioned process of calculating the confidence level by iterating the reference vector, the process of calculating the confidence level for each reference character in the reference character set is different from the confidence level for each description character in the description character set. The process is different. For example, different calculation methods and different calculation parameters can be used to distinguish the methods of calculating the first confidence level of the reference character set and the method of calculating the second confidence level of the description character set.

在一种可选的方式中,对计算得到的参考字符集对应的初始第一置信 度可以配置更高的权重参数,以得到第一置信度;对计算得到的描述字符 集对应的初始第二置信度可以配置较低的权重参数,以得到第二置信度, 从而提升参考字符集中各个字符的置信度,进而使得输出的字符为参考字 符集中的字符的概率更高,从而实现通过参考字符集对输出的字符类型和 规则进行约束的技术效果。In an optional manner, a higher weight parameter may be configured for the initial first confidence level corresponding to the calculated reference character set to obtain the first confidence level; for the initial second confidence level corresponding to the calculated description character set The confidence level can be configured with a lower weight parameter to obtain the second confidence level, thereby improving the confidence level of each character in the reference character set, thereby making the output character a higher probability of being a character in the reference character set, so as to achieve a higher probability of passing the reference character set The technical effect of constraining output character types and rules.

通过本申请的上述实施例,以在与向量转换网络相连接的标签生成网 络中,基于第i次迭代计算得到的迭代参考向量,确定参考字符集中的各 个参考字符各自匹配的第一置信度,及描述字符集中的各个描述字符各自 匹配的第二置信度的方式,对参考字符集和描述字符集对应的置信度计算 方法进行区分,进而可以根据需要调整参考字符集和描述字符集中输出字 符的概率,实现通过参考字符集对输出的字符类型和规则进行约束的技术 效果。Through the above-mentioned embodiments of the present application, in the label generation network connected with the vector conversion network, based on the iterative reference vector obtained by the ith iteration calculation, the first confidence level of each reference character in the reference character set is determined, respectively, and the second confidence level of each description character set in the description character set, distinguish the confidence calculation methods corresponding to the reference character set and the description character set, and then adjust the output characters in the reference character set and description character set as required. Probability, to achieve the technical effect of constraining the output character type and rules by referring to the character set.

作为一种可选的实施方式,上述基于第i次迭代计算得到的迭代参考 向量,确定参考字符集中的各个参考字符各自匹配的第一置信度,及描述 字符集中的各个描述字符各自匹配的第二置信包括:As an optional implementation manner, the above-mentioned iterative reference vector obtained by the i-th iteration calculation is used to determine the first confidence level of each reference character in the reference character set for each matching, and the first confidence level for each matching of each description character in the description character set. Two confidences include:

S1,确定出标签生成网络中与每个参考字符各自匹配的第一权重系数 和第一偏置参数,与各个描述字符各自匹配的第二权重系数和第二偏置参 数,及参考权重和参考偏置参数;S1, determine a first weight coefficient and a first bias parameter that match each reference character in the label generation network, a second weight coefficient and a second bias parameter that match each description character, and the reference weight and reference Bias parameter;

S2,基于第i次迭代计算得到的迭代参考向量、第一权重系数及第一 偏置参数,确定出每个参考字符的第一置信度;S2, determine the first confidence level of each reference character based on the iterative reference vector, the first weight coefficient and the first bias parameter obtained based on the i-th iterative calculation;

S3,获取基于第i次描述隐向量、第二权重系数及第二偏置参数计算 得到的第一中间结果,及基于第i次迭代计算得到的迭代参考向量、参考 权重和参考偏置参数计算得到的第二中间结果;S3, obtaining the first intermediate result calculated based on the i-th description implicit vector, the second weight coefficient and the second bias parameter, and the iterative reference vector, reference weight and reference bias parameter calculated based on the i-th iterative calculation the second intermediate result obtained;

S4,基于第一中间结果与第二中间结果,确定出每个描述字符的第二 置信度。S4, based on the first intermediate result and the second intermediate result, determine the second confidence level of each description character.

以下对上述实施方式进行具体说明。需要说明的是,在本实施例中, 视频描述文本为视频标题。The above-mentioned embodiment will be specifically described below. It should be noted that, in this embodiment, the video description text is the video title.

首先,假设上述描述隐向量为

Figure BDA0003601550010000171
指示与描述文本特征(在本实 施例中为标题)经过上述Transformer模型的一次迭代计算,输出的与第n 个字符对应的第n个隐向量;假设上述参考迭代隐向量为
Figure BDA0003601550010000172
需要说明 的是,在本实施例中,在迭代计算过程中,需要多次输入和多次输入,
Figure BDA0003601550010000173
表示在第t次迭代计算输出的最后一个位置的隐向量。First, assume that the above described hidden vector is
Figure BDA0003601550010000171
The indication and description text feature (title in this embodiment) is calculated by one iteration of the above-mentioned Transformer model, and the outputted nth latent vector corresponding to the nth character; it is assumed that the above-mentioned reference iterative latent vector is
Figure BDA0003601550010000172
It should be noted that, in this embodiment, in the iterative calculation process, multiple inputs and multiple inputs are required,
Figure BDA0003601550010000173
The latent vector representing the last position of the computed output at the t-th iteration.

接着,根据模型的训练结果确定出标签生成网络中与各个参考字符各 自匹配的第一权重系数,和第一偏置参数,与各个描述字符各自匹配的第 二权重系数和第二偏置参数,及参考权重和参考偏置参数,即确定出以下 6个参数:Next, according to the training result of the model, determine the first weight coefficient and the first bias parameter in the label generation network that match each reference character, and the second weight coefficient and the second bias parameter that match each description character respectively, And the reference weight and reference bias parameters, that is, the following six parameters are determined:

第一权重系数:

Figure BDA0003601550010000174
表示参考字符集中第i个字符的权重参数;The first weight coefficient:
Figure BDA0003601550010000174
Indicates the weight parameter of the i-th character in the reference character set;

第一偏置参数:

Figure BDA0003601550010000175
First bias parameter:
Figure BDA0003601550010000175

第二权重系数:Wtitle、WdecSecond weight coefficient: W title , W dec ;

第二偏置参数:btitle和bdec是偏置参数。The second bias parameter: b title and b dec are bias parameters.

基于上述训练得到的各个参数,假设

Figure BDA0003601550010000176
表示模型在标题位置的隐 向量输出(即描述隐向量),在第t步解码阶段,
Figure BDA0003601550010000177
表示该步输出的最 后一个位置的隐向量输出(即参考迭代隐向量),
Figure BDA0003601550010000178
表示在第t步参考字 符集中第i个字符的得分,那么
Figure BDA0003601550010000179
由以下公式计算出:Based on the parameters obtained from the above training, it is assumed that
Figure BDA0003601550010000176
Represents the latent vector output of the model at the title position (that is, describing the latent vector), in the t-th decoding stage,
Figure BDA0003601550010000177
The latent vector output representing the last position of the output of this step (that is, the reference iterative latent vector),
Figure BDA0003601550010000178
represents the score of the i-th character in the reference character set at step t, then
Figure BDA0003601550010000179
Calculated by the following formula:

Figure BDA00036015500100001710
Figure BDA00036015500100001710

式中,

Figure BDA00036015500100001711
表示参考字符集中第i个字符的权重参数,
Figure BDA00036015500100001712
是参考字 符集中第i个字符的偏置参数。In the formula,
Figure BDA00036015500100001711
represents the weight parameter of the i-th character in the reference character set,
Figure BDA00036015500100001712
is the bias parameter for the ith character in the reference character set.

为了从标题文本中选择字符,添加了指针网络模块来预测从标题中选 择词的得分,假设标题中第n个字符对应的隐向量为

Figure BDA0003601550010000181
表示在 第t步第n个字符的得分,
Figure BDA0003601550010000182
由以下公式计算出:In order to select characters from the title text, a pointer network module is added to predict the score for selecting words from the title, assuming that the latent vector corresponding to the nth character in the title is
Figure BDA0003601550010000181
represents the score of the nth character at step t,
Figure BDA0003601550010000182
Calculated by the following formula:

Figure BDA0003601550010000183
Figure BDA0003601550010000183

式中,Wtitle和Wdec是权重参数,btitle和bdec是偏置参数。where W title and W dec are weight parameters, and b title and b dec are bias parameters.

需要说明的是,上述第一中间结果即上式中的

Figure BDA0003601550010000184
上述第二中间结果即上式中的
Figure BDA0003601550010000185
每个参考字符的第一匹 配度即
Figure BDA0003601550010000186
每个描述字符的第二匹配度即
Figure BDA0003601550010000187
It should be noted that the above-mentioned first intermediate result is that in the above formula
Figure BDA0003601550010000184
The second intermediate result above is that in the above formula
Figure BDA0003601550010000185
The first matching degree of each reference character is
Figure BDA0003601550010000186
The second matching degree of each description character is
Figure BDA0003601550010000187

通过本申请的上述实施方式,通过确定出标签生成网络中与每个参考 字符各自匹配的第一权重系数和第一偏置参数,与各个描述字符各自匹配 的第二权重系数和第二偏置参数,及参考权重和参考偏置参数;基于第i 次迭代计算得到的迭代参考向量、第一权重系数及第一偏置参数,确定出 每个参考字符的第一置信度;获取基于第i次描述隐向量、第二权重系数 及第二偏置参数计算得到的第一中间结果,及基于第i次迭代计算得到的迭代参考向量、参考权重和参考偏置参数计算得到的第二中间结果;基于 第一中间结果与第二中间结果,确定出每个描述字符的第二置信度,确定 出每个文本分词的第二匹配度,从而精确确定出参考字符集中每一个候选 字符的得分以及描述字符集中每一个描述字符的得分,进而可以精确地当 前迭代计算步骤中,每一个候选字符的得分,提高了标签生成的精确性。Through the above-mentioned embodiments of the present application, by determining the first weight coefficient and the first offset parameter matching each reference character in the label generation network, the second weight coefficient and the second offset matching each description character respectively parameters, and reference weight and reference bias parameters; based on the iterative reference vector, the first weight coefficient and the first bias parameter calculated by the i-th iteration, determine the first confidence level of each reference character; Describe the first intermediate result calculated by the implicit vector, the second weight coefficient and the second bias parameter, and the second intermediate result calculated based on the iterative reference vector, the reference weight and the reference bias parameter calculated based on the ith iteration ; Based on the first intermediate result and the second intermediate result, determine the second degree of confidence of each description character, determine the second degree of matching of each text word segmentation, thereby accurately determine the score of each candidate character in the reference character set and The score of each description character in the description character set can be used to accurately calculate the score of each candidate character in the current iterative calculation step, which improves the accuracy of label generation.

作为一种可选的实施方式,上述基于第i次迭代计算得到的迭代参考 向量,确定目标候选字库中的各个字符各自匹配的置信度,并根据置信度 确定出第j个候选字符序列包括:通过标签生成网络中的全连接层对获取 到的多个第一置信度和多个第二置信度进行加权求和计算,得到目标候选 字库中的各个字符各自匹配的置信度,以生成第j个候选字符序列。As an optional embodiment, the above-mentioned iterative reference vector obtained based on the i-th iterative calculation, determines the respective matching confidence levels of each character in the target candidate font library, and determines the j-th candidate character sequence according to the confidence level. The sequence includes: The fully connected layer in the label generation network performs a weighted sum calculation on the obtained multiple first confidence levels and multiple second confidence levels, and obtains the respective matching confidence levels of each character in the target candidate font to generate the jth candidate character sequences.

可以理解的是,获取到第t次的迭代计算得到每个参考字符的第一匹 配度即

Figure BDA0003601550010000188
每个描述字符对应的的第二匹配度即
Figure BDA0003601550010000189
在一种实施方式 中,可以为上述第一匹配度和第二匹配度配置不同的权重,仅基于加权求 和的结果确定出每个标签对应的置信度。It can be understood that the first matching degree of each reference character obtained by the t-th iteration calculation is
Figure BDA0003601550010000188
The second matching degree corresponding to each description character is
Figure BDA0003601550010000189
In an embodiment, different weights may be configured for the first matching degree and the second matching degree, and the confidence corresponding to each tag is determined only based on the result of the weighted summation.

在另一种实施方式中,还可以是在多个标签提取步骤中,获取到第t 次迭代计算的每个参考字符的第一匹配度即

Figure BDA0003601550010000191
每个描述字符的第二匹 配度即
Figure BDA0003601550010000192
为上述第一匹配度和第二匹配度配置相同的权重,进而将得 到的
Figure BDA0003601550010000193
以及对应的字符进行组合,直接得到候选字符序列。In another implementation manner, in multiple label extraction steps, the first matching degree of each reference character calculated by the t-th iteration can be obtained, namely
Figure BDA0003601550010000191
The second matching degree of each description character is
Figure BDA0003601550010000192
The same weight is configured for the first matching degree and the second matching degree, and then the obtained
Figure BDA0003601550010000193
And the corresponding characters are combined to directly obtain the candidate character sequence.

通过本申请的上述实施方式,以通过标签生成网络中的全连接层对获 取到的多个第一置信度和多个第二置信度进行加权求和计算,得到目标候 选字库中的各个字符各自匹配的置信度,以生成第j个候选字符序列,从 而通过多步迭代计算步骤确定出多个第一匹配度和第二匹配度,并基于多 个第一匹配度和第二匹配度确定出多个候选序列,以及与候选序列对应的 目标字符,进而提升了生成标签的准确性。Through the above-mentioned embodiments of the present application, the fully connected layer in the label generation network performs weighted sum calculation on the obtained multiple first confidence levels and multiple second confidence levels, and obtains each character in the target candidate font database. The confidence level of the matching is to generate the jth candidate character sequence, so as to determine multiple first matching degrees and second matching degrees through multiple iterative calculation steps, and determine based on the multiple first matching degrees and second matching degrees. Multiple candidate sequences and target characters corresponding to the candidate sequences improve the accuracy of generating labels.

作为一种可选的实施方式,上述在获取待识别的目标视频之前,还包 括:As a kind of optional implementation, before obtaining the target video to be identified, the above also includes:

S1,获取样本视频和与样本视频匹配的样本标签,其中,样本视频中 携带有样本视频描述文本;S1, obtain a sample video and a sample label matching the sample video, wherein the sample video is carried with the sample video description text;

S2,利用样本视频描述文本构建样本视频的样本候选字库,其中,样 本候选字库包括参考字符集和样本描述字符集,参考字符集中包括对多个 历史标签进行分词得到的参考字符,样本描述字符集中包括对样本视频描 述文本进行分词得到的描述字符;S2, using the sample video description text to construct a sample candidate character library of the sample video, wherein the sample candidate character library includes a reference character set and a sample description character set, the reference character set includes reference characters obtained by segmenting multiple historical tags, and the sample description character set Including the description characters obtained by word segmentation of the sample video description text;

S3,利用样本视频、样本标签及样本候选字库对初始化的标签生成网 络进行训练,直至达到训练收敛条件。S3, use the sample video, sample label and sample candidate font to train the initialized label generation network until the training convergence condition is reached.

以下对训练阶段的实施方式进行说明。在训练阶段,已知视频对应的 标签,如一个视频含有“游戏”、“城市英雄”两个标签,将这两个标签用 符号“;”连接起来作为真实值,并将该视频的视觉特征、类别嵌入向量、 标题的词嵌入向量和真实值的词嵌入向量一起输入,预测阶段需将视觉特 征、类别嵌入向量、标题的词嵌入向量和当前已预测出的词嵌入向量作为 输入,该模型使用编码器操作计算视觉特征和原标题词嵌入向量位置的隐 向量,使用解码器操作计算真实值位置的隐向量。隐向量经过一层全连接 操作计算出词典的概率分布,这里损失函数类型为交叉熵,n表示预测文 本的长度,V表示词典。Embodiments of the training phase are described below. In the training phase, the labels corresponding to the known videos are known. For example, a video contains two labels of "game" and "city hero", and the two labels are connected with the symbol ";" as the real value, and the visual features of the video are used. , the category embedding vector, the word embedding vector of the title, and the word embedding vector of the real value are input together. The prediction stage needs to use the visual feature, the category embedding vector, the word embedding vector of the title and the currently predicted word embedding vector as input. The encoder operation is used to calculate the latent vector of the position of the visual feature and the original title word embedding vector, and the decoder operation is used to calculate the hidden vector of the ground truth position. The hidden vector calculates the probability distribution of the dictionary through a layer of full connection operation, where the loss function type is cross entropy, n represents the length of the predicted text, and V represents the dictionary.

Figure BDA0003601550010000201
Figure BDA0003601550010000201

在上式所示的损失函数达到收敛条件的情况下,确定以下参数训练完 成:第一权重系数:

Figure BDA0003601550010000202
表示参考字符集中i个字符的权重参数;When the loss function shown in the above formula reaches the convergence condition, it is determined that the training of the following parameters is completed: the first weight coefficient:
Figure BDA0003601550010000202
Indicates the weight parameter of i characters in the reference character set;

第一偏置参数:

Figure BDA0003601550010000203
First bias parameter:
Figure BDA0003601550010000203

第二权重系数:Wtitle、WdecSecond weight coefficient: W title , W dec ;

第二偏置参数:btitle、bdecThe second bias parameter: b title , b dec .

以下对另一种UniLM(Unified Language Model,统一预训练语言模 型)模型的训练过程进行说明。The following describes the training process of another UniLM (Unified Language Model, unified pre-trained language model) model.

通过本申请的上述方法,以获取样本视频和与样本视频匹配的样本标 签,其中,样本视频中携带有样本视频描述文本;利用样本视频描述文本 构建样本视频的样本候选字库,其中,样本候选字库包括参考字符集和样 本描述字符集,参考字符集中包括对多个历史标签进行分词得到的参考字 符,样本描述字符集中包括对样本视频描述文本进行分词得到的描述字符; 利用样本视频、样本标签及样本候选字库对初始化的标签生成网络进行训 练,直至达到训练收敛条件的方法,从而使用一种可监督的方式得到上述 标签生成网络,提高上述标签生成网络得到的目标标签的可靠性,解决了 现有的标签生成方式得到的标签准确率低的技术问题。The above method of the present application is used to obtain a sample video and a sample label matching the sample video, wherein the sample video carries the sample video description text; the sample video description text is used to construct a sample candidate font library of the sample video, wherein the sample candidate font library Including reference character set and sample description character set, the reference character set includes reference characters obtained by word segmentation of multiple historical tags, and the sample description character set includes description characters obtained by word segmentation of sample video description text; Using sample videos, sample labels and The sample candidate character library trains the initialized label generation network until the training convergence condition is reached, so as to obtain the above label generation network in a supervised way, improve the reliability of the target label obtained by the above label generation network, and solve the problem of current situation. The technical problem of low label accuracy in some label generation methods.

作为一种可选的方式,上述在对视频特征以及描述特征执行第i次迭 代计算的过程中,在向量转换网络中基于视频特征中的视觉特征、描述特 征以及第i-1次迭代计算得到的迭代参考向量,计算得到第i次中间隐向 量集包括:As an optional method, in the process of performing the i-th iterative calculation on the video features and the description features, the vector transformation network is based on the visual features in the video features, the description features, and the i-1th iterative calculation to obtain The iterative reference vector of , the i-th intermediate hidden vector set is calculated to include:

在i等于1的情况下,将视觉特征和描述特征转换为目标视频对应的 词嵌入向量;提取目标视频对应的类别嵌入向量、段落嵌入向量及位置嵌 入向量,其中,类别嵌入向量用于指示目标视频的类别信息,段落嵌入向 量中包括基于视觉特征得到的第一段落子向量及基于描述特征得到的第 二段落子向量,位置嵌入向量用于指示目标视频中各个视频帧的帧序,及 视频描述文本中各个文本分词的次序;将词嵌入向量、类别嵌入向量、段 落嵌入向量及位置嵌入向量作为向量转换网络的输入向量,以计算得到第 i次中间隐向量集;When i is equal to 1, convert the visual features and description features into the word embedding vector corresponding to the target video; extract the category embedding vector, paragraph embedding vector and position embedding vector corresponding to the target video, where the category embedding vector is used to indicate the target video The category information of the video. The paragraph embedding vector includes the first paragraph sub-vector obtained based on the visual feature and the second paragraph sub-vector obtained based on the description feature. The position embedding vector is used to indicate the frame sequence of each video frame in the target video, and the video description. The order of word segmentation of each text in the text; the word embedding vector, category embedding vector, paragraph embedding vector and position embedding vector are used as the input vector of the vector conversion network to calculate the i-th intermediate latent vector set;

在i大于1的情况下,将词嵌入向量、类别嵌入向量、段落嵌入向量、 位置嵌入向量,及第i-1次迭代计算得到的迭代参考向量作为向量转换网 络的输入向量,以计算得到第i次中间隐向量集。When i is greater than 1, use the word embedding vector, category embedding vector, paragraph embedding vector, position embedding vector, and the iterative reference vector calculated by the i-1th iteration as the input vector of the vector conversion network to calculate the first The set of intermediate latent vectors i times.

作为一种可选的实施方式,上述在向量转换网络中基于视频特征中的 视觉特征、描述特征以及第i-1次迭代计算得到的迭代参考向量,计算得 到第i次中间隐向量集包括:As a kind of optional embodiment, above-mentioned in the vector conversion network based on the visual feature in the video feature, the description feature and the iterative reference vector that the ith-1th iterative calculation obtains, the calculation obtains the ith intermediate latent vector set including:

S1,基于视觉特征及描述特征确定第i次视觉隐向量、第i次描述隐 向量,其中,第i次视觉隐向量和第i次描述隐向量用于指示目标视频标 签中字符的上下文关系;S1, determine the ith visual latent vector, the ith description latent vector based on the visual feature and the description feature, wherein, the ith visual latent vector and the ith description latent vector are used to indicate the contextual relationship of the character in the target video label;

S2,基于视觉特征、描述特征及第i-1次迭代计算得到的迭代参考向 量,确定第i次迭代计算得到的迭代参考向量。S2: Determine the iterative reference vector obtained by the ith iterative calculation based on the visual feature, the description feature and the iterative reference vector obtained by the i-1th iterative calculation.

上述目标视频的类别嵌入向量是用于指示目标视频的大致类别的特 征,如目标视频为一个有关运动会的视频,该视频的类别特征可以用于表 征该视频是“体育类”,对应于类别嵌入向量可以为(0,0,0,0,1); 又如,目标视频为一个歌曲mv的视频,则该视频的类别特征可以用于表 征该视频为“歌曲类”,对应于类别嵌入向量可以为(0,0,0,0,2); 又如,目标视频为一个纪录片《中国美食》,则该视频的类别特征可以用 于表征该视频为“美食类”以及“纪录片类”,对应于类别嵌入向量可以 为(0,0,0,0,3)。以上仅为一种示例性说明,不对本实施例中具体表 征目标视频的类别嵌入向量的具体内容对应形式进行限制。The category embedding vector of the above-mentioned target video is a feature used to indicate the general category of the target video. For example, if the target video is a video about a sports meeting, the category feature of the video can be used to indicate that the video is "sports category", which corresponds to the category embedding. The vector can be (0, 0, 0, 0, 1); for another example, if the target video is a video of a song mv, the category feature of the video can be used to characterize the video as a "song category", corresponding to the category embedding vector It can be (0, 0, 0, 0, 2); for another example, if the target video is a documentary "Chinese Food", the category feature of the video can be used to characterize the video as "food" and "documentary", The embedding vector corresponding to the category can be (0, 0, 0, 0, 3). The above is only an exemplary description, and does not limit the specific content corresponding form of the category embedding vector that specifically represents the target video in this embodiment.

以下结合图4对上述方法进行说明。The above method will be described below with reference to FIG. 4 .

如图4所示,在进行迭代计算之前,输入上述向量转换网络中的向量 组合包括:位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量。As shown in Figure 4, before the iterative calculation, the vector combinations input into the above-mentioned vector transformation network include: position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector.

针对视频的类别特征,在本实施例中,可以将每一种视频类别对应于 一个向量组,以用向量组表示视频特征。在图4所示的模型示意图中,转 换后的类别嵌入向量为(5,5,5,5,5,5,5)。For the category features of videos, in this embodiment, each video category can be corresponding to a vector group, so that the video features are represented by the vector group. In the schematic diagram of the model shown in Figure 4, the converted category embedding vector is (5, 5, 5, 5, 5, 5, 5).

接着,将视觉特征以及描述特征转换为词嵌入向量。首先对目标视频 进行抽帧,抽取的帧率为1fps,将抽取的视频帧输入CLIP模型中得到视 觉特征向量,对应于图4中的视频向量(F,F,F)。并将视频描述文本转 换为向量,对应于图4中的标题向量(我,爱,你),并加入连接向量CLS, 即组成词嵌入向量(F,F,F,CLS,我,爱,你)。Next, the visual features and descriptive features are converted into word embedding vectors. First, extract the frame of the target video, the extracted frame rate is 1fps, and input the extracted video frame into the CLIP model to obtain the visual feature vector, which corresponds to the video vector (F, F, F) in Figure 4. And convert the video description text into a vector, corresponding to the title vector (I, Love, You) in Figure 4, and join the connection vector CLS, that is, the word embedding vector (F, F, F, CLS, I, Love, You) ).

同时如图4示出的,还获取了视频的段落嵌入向量以及位置嵌入向量。 其中,段落嵌入向量为(0,0,0,1,1,1,1),可以理解的是,段落嵌 入向量用于表征词嵌入向量中的的视频向量部分与标题向量部分,其中, (0,0,0)对应于视频向量(F,F,F),(1,1,1)对应于标题向量(我, 爱,你)。位置嵌入向量(0,1,2,0,1,2,3)用于表征输入模型中的 向量的先后顺序。如(0,1,2)部分用于标识视频向量(F,F,F)的视 频帧的帧序;(1,2,3)部分用于标识标题向量(我,爱,你)三个文本 分词的次序。At the same time, as shown in FIG. 4 , the paragraph embedding vector and the position embedding vector of the video are also obtained. Among them, the paragraph embedding vector is (0, 0, 0, 1, 1, 1, 1). It can be understood that the paragraph embedding vector is used to represent the video vector part and the title vector part in the word embedding vector, where, ( 0, 0, 0) corresponds to the video vector (F, F, F) and (1, 1, 1) corresponds to the title vector (I, Love, You). The position embedding vector (0, 1, 2, 0, 1, 2, 3) is used to characterize the order of vectors in the input model. For example, the (0, 1, 2) part is used to identify the frame sequence of the video frame of the video vector (F, F, F); the (1, 2, 3) part is used to identify the title vector (I, Love, You) three The order of text segmentation.

如图4所示,将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及 类别嵌入向量第一次输入Transformer模块之后,即得到中间隐向量集

Figure BDA0003601550010000221
Figure BDA0003601550010000222
(图中未示出
Figure BDA0003601550010000223
)。As shown in Figure 4, after the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector are input into the Transformer module for the first time, the intermediate hidden vector set is obtained.
Figure BDA0003601550010000221
Figure BDA0003601550010000222
(not shown in the figure
Figure BDA0003601550010000223
).

可以理解的是,将上述中间隐向量集输入指针网络模块中即可得到由 参考字符集以及描述字符集中的各个字符的得分。It can be understood that, by inputting the above-mentioned intermediate latent vector set into the pointer network module, the score of each character in the reference character set and the description character set can be obtained.

通过本申请的上述方法,在i等于1的情况下,将视觉特征和描述特 征转换为目标视频对应的词嵌入向量;提取目标视频对应的类别嵌入向量、 段落嵌入向量及位置嵌入向量,其中,类别嵌入向量用于指示目标视频的 类别信息,段落嵌入向量中包括基于视觉特征得到的第一段落子向量及基 于描述特征得到的第二段落子向量,位置嵌入向量用于指示目标视频中各 个视频帧的帧序,及视频描述文本中各个文本分词的次序;将词嵌入向量、 类别嵌入向量、段落嵌入向量及位置嵌入向量作为向量转换网络的输入向 量,以计算得到第i次中间隐向量集,从而通过指针网络模块以及 Transformer模块得到各个候选字符的置信度值,进而解决了现有标签生成 方法得到的标签准确率低的技术问题。Through the above method of the present application, when i is equal to 1, the visual feature and the description feature are converted into the word embedding vector corresponding to the target video; the category embedding vector, paragraph embedding vector and position embedding vector corresponding to the target video are extracted, wherein, The category embedding vector is used to indicate the category information of the target video. The paragraph embedding vector includes the first paragraph sub-vector obtained based on the visual feature and the second paragraph sub-vector obtained based on the description feature. The position embedding vector is used to indicate each video frame in the target video. The frame order of , and the order of each text segmentation in the video description text; the word embedding vector, category embedding vector, paragraph embedding vector and position embedding vector are used as the input vector of the vector conversion network to calculate the ith intermediate hidden vector set, Therefore, the confidence value of each candidate character is obtained through the pointer network module and the Transformer module, thereby solving the technical problem of low label accuracy obtained by the existing label generation method.

作为一种可选的实施方式,基于置信度从每个候选字符序列中确定出 目标字符包括:根据目标视频标签中字符的上下文关系,将第j个候选字 符序列中确定出的目标字符,作为目标视频标签中的第j个目标字符。As an optional implementation manner, determining the target character from each candidate character sequence based on the confidence includes: according to the contextual relationship of the characters in the target video tag, using the target character determined in the jth candidate character sequence as The jth target character in the target video tag.

可以理解的是,在模型预测阶段,将通过上述方法得到

Figure BDA0003601550010000231
Figure BDA0003601550010000232
拼 接起来,并选择得分最高的字符,即是第当前步骤得到的第t个候选字符 序列,并输出第t个目标字符。It can be understood that in the model prediction stage, the above method will be used to obtain
Figure BDA0003601550010000231
and
Figure BDA0003601550010000232
Splice them together, and select the character with the highest score, which is the t-th candidate character sequence obtained in the current step, and output the t-th target character.

继续结合图4对上述方法进行说明。Continue to describe the above method with reference to FIG. 4 .

如图4所示,第一步迭代计算的过程如下:As shown in Figure 4, the iterative calculation process of the first step is as follows:

将图4中示出的上述位置嵌入向量、段落嵌入向量、词嵌入向量以及 类别嵌入向量第一次输入Transformer模块之后,即得到中间隐向量集输 出

Figure BDA0003601550010000233
以及
Figure BDA0003601550010000234
(图中未示出)。After the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector shown in Figure 4 are input into the Transformer module for the first time, the output of the intermediate hidden vector set is obtained.
Figure BDA0003601550010000233
as well as
Figure BDA0003601550010000234
(not shown in the figure).

基于

Figure BDA0003601550010000235
以及训练得到的
Figure BDA0003601550010000236
Wtitle、Wdec、btitle和 bdec,通过以下式子分别算出第1步迭代计算过程中,参考字符集中第i 个字符的得分以及描述字符集中第n个字符的得分:based on
Figure BDA0003601550010000235
and trained
Figure BDA0003601550010000236
W title , W dec , b title and b dec , respectively calculate the score of the i-th character in the reference character set and the score of the n-th character in the description character set in the iterative calculation process in the first step by the following formulas:

Figure BDA0003601550010000241
Figure BDA0003601550010000241

Figure BDA0003601550010000242
Figure BDA0003601550010000242

其中,

Figure BDA0003601550010000243
表示在第1步参考字符集中第i个字符的得分,
Figure BDA0003601550010000244
表示 在第1步中描述字符集中第n个字符的得分。将
Figure BDA0003601550010000245
Figure BDA0003601550010000246
拼接起来,并 选择得分最高的单词,即是第1步的输出词。如图4所示,第一步的输出 词为“爱”;in,
Figure BDA0003601550010000243
represents the score of the i-th character in the reference character set in step 1,
Figure BDA0003601550010000244
represents the score describing the nth character in the character set in step 1. Will
Figure BDA0003601550010000245
and
Figure BDA0003601550010000246
Splice them together and select the word with the highest score, which is the output word of step 1. As shown in Figure 4, the output word of the first step is "love";

第二步解析步骤如下:The second step of analysis is as follows:

将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量 以及第一步中得到的

Figure BDA0003601550010000247
输入Transformer模块,得到中间隐向量集
Figure BDA0003601550010000248
Figure BDA0003601550010000249
以及
Figure BDA00036015500100002410
Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.
Figure BDA0003601550010000247
Enter the Transformer module to get the intermediate hidden vector set
Figure BDA0003601550010000248
Figure BDA0003601550010000249
as well as
Figure BDA00036015500100002410

基于

Figure BDA00036015500100002411
以及训练得到的
Figure BDA00036015500100002412
Wtitle、Wdec、btitle和 bdec,通过以下式子分别算出第2步迭代计算过程中,参考字符集中第i 个字符的得分以及描述字符集中第n个字符的得分:based on
Figure BDA00036015500100002411
and trained
Figure BDA00036015500100002412
W title , W dec , b title and b dec , respectively calculate the score of the i-th character in the reference character set and the score of the n-th character in the description character set in the iterative calculation process in the second step by the following formulas:

Figure BDA00036015500100002413
Figure BDA00036015500100002413

Figure BDA00036015500100002414
Figure BDA00036015500100002414

其中,

Figure BDA00036015500100002415
表示在第2步参考字符集中第i个字符的得分,
Figure BDA00036015500100002416
表示 在第2步中描述字符集中第n个字符的得分。将
Figure BDA00036015500100002417
Figure BDA00036015500100002418
拼接起来,并 选择得分最高的单词,即是第2步的输出词。如图4所示,第一步的输出 词为“情”;in,
Figure BDA00036015500100002415
represents the score of the i-th character in the reference character set in step 2,
Figure BDA00036015500100002416
represents the score describing the nth character in the character set in step 2. Will
Figure BDA00036015500100002417
and
Figure BDA00036015500100002418
Splice them together and select the word with the highest score, which is the output word of step 2. As shown in Figure 4, the output word of the first step is "love";

结束步骤如下:The end steps are as follows:

将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量 以及第一步中得到的

Figure BDA00036015500100002419
输入Transformer模块,得到中间隐向量集
Figure BDA00036015500100002420
Figure BDA00036015500100002421
以及
Figure BDA00036015500100002422
其中,
Figure BDA00036015500100002423
与预设的 结束标识向量相匹配,即确定迭代步骤结束,不再继续生成新的字符。Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.
Figure BDA00036015500100002419
Enter the Transformer module to get the intermediate hidden vector set
Figure BDA00036015500100002420
Figure BDA00036015500100002421
as well as
Figure BDA00036015500100002422
in,
Figure BDA00036015500100002423
Matching with the preset end identification vector, that is, it is determined that the iteration step is over, and no new characters are continued to be generated.

进而将第一步输出词“爱”和第二步的输出词“情”进行拼接,得到 标题为“我爱你”的目标视频的目标标签为“爱情”。Then, the output word "love" of the first step and the output word "love" of the second step are spliced together, and the target label of the target video titled "I love you" is obtained as "love".

通过本申请的上述实施方式,通过迭代计算,逐字输出的方式,由于 本申请实施例中用于输出字符的目标候选字库中结合了描述字符集,而描 述字符集中包括了与视频相关的新鲜文本内容,因此可以结合描述字符集 输出新鲜字符,进而根据输出的新鲜字符拼接得到新鲜标签,提升了输出 的视频标签的准确性,解决了现有的视频标签确定方法得到的视频标签准 确性较低的技术问题。Through the above-mentioned embodiments of the present application, through iterative calculation and word-by-word output, since the target candidate character library used for outputting characters in the embodiment of the present application combines the description character set, and the description character set includes fresh video-related characters Therefore, fresh characters can be output in combination with the description character set, and then fresh labels can be obtained by splicing the output fresh characters, which improves the accuracy of the output video labels, and solves the problem of the accuracy of the video labels obtained by the existing video label determination methods. Low technical issues.

作为一种可选的实施方式,上述将M个目标字符拼接为与目标视频 匹配的目标视频标签之后,还包括:基于上下文语义关系调整目标视频标 签中M个目标字符的顺序,以得到更新后的目标视频标签。As an optional implementation manner, after splicing the M target characters into target video tags matching the target video, the method further includes: adjusting the order of the M target characters in the target video tag based on the contextual semantic relationship, so as to obtain the updated target video tag.

在一种可选的方式中,假设根据上述方法进行多次迭代,假设初步输 出的M个字符按顺序分别为“城”、“市”、“雄”、“英”、“游”、“戏”、“游”、 “戏”、“好”,则可以根据对上述各个字符进行语义分析,可以确定“雄”、 “英”两个字符的顺序进行调整,以得到“英雄”词条;同时,还可以将 “游”、“戏”、“好”三个字符的顺序进行调整,以得到“好游戏”词条; 进而确定更新后的M个字符的最终顺序为:“城”、“市”、“英”、“雄”、“游”、 “戏”、“好”、“游”、“戏”。并基于该字符顺序确定目标标签:“城市英雄”、 “游戏”、“好游戏”。In an optional way, it is assumed that multiple iterations are performed according to the above method, and it is assumed that the initial output M characters are "city", "city", "xiong", "ying", "you", " "Play", "You", "Play", and "Good", you can determine the order of the two characters "Xiong" and "English" according to the semantic analysis of the above characters, so as to obtain the entry for "Hero" At the same time, the order of the three characters "you", "play" and "good" can also be adjusted to obtain the entry of "good game"; and the final sequence of the updated M characters is determined as: "city" , "city", "ying", "hero", "you", "play", "good", "play", "play". And based on this character order, target labels are determined: "city hero", "game", "good game".

以下对上述输出字符的顺序的方式进行说明。如图5所示,上述 Transformer模块中可以采用Attention矩阵进行字符的输入和输出。如图 中所示,矩阵的每一行代表着输出,而每一列代表着输入,而Attention 矩阵就表示输出和输入的关联。假定白色方格都代表0,黑色方格代表1, 那么第1行表示输出的字符(x1)只能与起始标记<s>相关了,而第2行就表示输出的字符(x2)只能跟起始标记<s>和(x1)相关了,依此类推。 也就是说,只需要在Transformer的Attention矩阵中引入下三角形形式的 Mask,并将输入输出错开一位训练,就可以实现单向语言模型,即按照预 定的顺序输出字符。The manner in which the above-mentioned order of outputting characters is described below. As shown in Figure 5, the above Transformer module can use the Attention matrix to input and output characters. As shown in the figure, each row of the matrix represents the output, and each column represents the input, and the Attention matrix represents the association between the output and the input. Assuming that the white squares represent 0 and the black squares represent 1, then the first line indicates that the output character (x 1 ) can only be related to the start tag <s>, and the second line indicates the output character (x 2 ) ) can only be associated with start tags <s> and (x 1 ), and so on. That is to say, a one-way language model can be implemented only by introducing a lower triangular Mask into the Attention matrix of the Transformer, and staggering the input and output by one bit for training, that is, outputting characters in a predetermined order.

可选地,对一种乱序语言模型进行说明。Optionally, an out-of-order language model is described.

乱序语言模型跟语言模型一样,都是做条件概率分解,但是乱序语言 模型的分解顺序是随机的,如下式所示:Like the language model, the out-of-order language model performs conditional probability decomposition, but the decomposition order of the out-of-order language model is random, as shown in the following formula:

p(x1,x2,x3,…xn)=p(x1)p(x2|x2)p(x3|x1,x2)…p(xn|x1,x2,…xn-1)p(x 1 ,x 2 ,x 3 ,...x n )=p(x 1 )p(x 2 |x 2 )p(x 3 |x 1 ,x 2 )...p(x n |x 1 ,x 2 ,…x n-1 )

=p(x3)p(x1|x3)p(x2|x3,x1)…p(xn|x3,x1,…xn-1)=p(x 3 )p(x 1 |x 3 )p(x 2 |x 3 ,x 1 )…p(x n |x 3 ,x 1 ,…x n-1 )

=…=…

=p(xn-1)p(x1|xn-1)p(xn|xn-1,x1)…p(x2|xn-1,x1,…x3)=p(x n-1 )p(x 1 |x n-1 )p(x n |x n-1 ,x 1 )…p(x 2 |x n-1 ,x 1 ,…x 3 )

根据上式可知,对于字符x1、x2…xn而言,任意一种“出场顺序”都 有可能。原则上来说,每一种顺序都对应着一个模型,所以原则上就有n! 个语言模型。而基于Transformer的模型,则可以将这所有顺序都在同一 个模型中进行实现。According to the above formula, for the characters x 1 , x 2 . . . x n , any "appearance order" is possible. In principle, each order corresponds to a model, so in principle there are n! a language model. With Transformer-based models, all of these sequences can be implemented in the same model.

以下结合图6对实现方式进行说明。以生成“北京欢迎你”为例,假 设要求生成的字符顺序为:“<s>→迎→京→你→欢北→北→<e>”,那么通 过图6所示的MSAK上述实施例中的Attention矩阵,即可实现按照上述 顺序的字符输出。如图6中所示,第4行只有一个黑色格,表示“迎”只 能跟起始标记<s>相关,而第2行有两个黑格,表示“京”只能跟起始标 记<s>和“迎”相关,依此类推。直观来看,这就像是把单向语言模型的 下三角形式的Mask“打乱”了。The implementation is described below with reference to FIG. 6 . Taking the generation of "Beijing welcomes you" as an example, assuming that the sequence of characters required to be generated is: "<s>→ying→jing→you→huanbei→bei→<e>", then through the above embodiment of MSAK shown in Figure 6 In the Attention matrix, the character output in the above order can be realized. As shown in Figure 6, the fourth row has only one black box, which means that "Ying" can only be related to the start tag <s>, while the second row has two black boxes, which means that "Jing" can only be related to the start mark <s> is related to "Welcome", and so on. Intuitively, this is like "shuffling" the lower triangular Mask of the unidirectional language model.

也就是说,实现某种特定顺序的语言模型,就相当于将原来的下三角 形式的Mask以某种方式打乱。正因为上述Attention提供了这样的一个n×n 的Attention矩阵,进而本实施例中的Transformer模块可以存在足够多的 自由度,以不同的方式去Mask上述字符矩阵,从而实现多样化的字符输 出顺序的效果。That is to say, implementing a language model in a certain order is equivalent to disrupting the original lower-triangular form of Mask in some way. Because the above Attention provides such an n×n Attention matrix, the Transformer module in this embodiment can have enough degrees of freedom to mask the above character matrix in different ways, thereby realizing a variety of character output orders. Effect.

以下结合图7所示流程来说明本申请提供的视频标签的生成方法的完 整过程:The complete process of the generation method of the video label provided by the application is described below in conjunction with the process shown in Figure 7:

S702,构建目标候选字库;S702, constructing a target candidate font library;

以视频描述文本为视频标题为例对本实施方式进行说明。假设视频标 题为“我爱你”,参考字符库中包括了字符“亲”、“情”;即结合上述视频 标题和上述参考字符库中的字符以得到目标候选字库:“我”、“爱”、“你”、 “亲”、“情”。This embodiment will be described by taking the video description text as the video title as an example. Suppose the video title is "I love you", and the reference character library includes the characters "Kin" and "Qing"; that is, the target candidate character library is obtained by combining the above video title and the characters in the above reference character library: "I", "Love" ", "you", "pro", "love".

S704,提取视频向量;S704, extract the video vector;

具体地,如图4所示,输入上述向量转换网络中的向量组合包括:位 置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量。Specifically, as shown in Figure 4, the vector combination input into the above-mentioned vector conversion network includes: position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector.

针对视频的类别特征,在本实施例中,可以将每一种视频类别对应于 一个向量组,以用向量组表示视频特征。在图4所示的模型示意图中,转 换后的类别嵌入向量为(5,5,5,5,5,5,5)。For the category features of videos, in this embodiment, each video category can be corresponding to a vector group, so that the video features are represented by the vector group. In the schematic diagram of the model shown in Figure 4, the converted category embedding vector is (5, 5, 5, 5, 5, 5, 5).

接着,将视觉特征以及描述特征转换为词嵌入向量。首先对目标视频 进行抽帧,抽取的帧率为1fps,将抽取的视频帧输入CLIP模型中得到视 觉特征向量,对应于图4中的视频向量(F,F,F)。并将视频描述文本转 换为向量,对应于图4中的标题向量(我,爱,你),并加入连接向量CLS, 即组成词嵌入向量(F,F,F,CLS,我,爱,你)。Next, the visual features and descriptive features are converted into word embedding vectors. First, extract the frame of the target video, the extracted frame rate is 1fps, and input the extracted video frame into the CLIP model to obtain the visual feature vector, which corresponds to the video vector (F, F, F) in Figure 4. And convert the video description text into a vector, corresponding to the title vector (I, Love, You) in Figure 4, and join the connection vector CLS, that is, the word embedding vector (F, F, F, CLS, I, Love, You) ).

同时如图4示出的,还获取了视频的段落嵌入向量以及位置嵌入向量。 其中,段落嵌入向量为(0,0,0,1,1,1,1),可以理解的是,段落嵌 入向量用于表征词嵌入向量中的的视频向量部分与标题向量部分,其中, (0,0,0)对应于视频向量(F,F,F),(1,1,1)对应于标题向量(我, 爱,你)。位置嵌入向量(0,1,2,0,1,2,3)用于表征输入模型中的 向量的先后顺序。如(0,1,2)部分用于标识视频向量(F,F,F)的视 频帧的帧序;(1,2,3)部分用于标识标题向量(我,爱,你)三个文本 分词的次序。At the same time, as shown in FIG. 4 , the paragraph embedding vector and the position embedding vector of the video are also obtained. Among them, the paragraph embedding vector is (0, 0, 0, 1, 1, 1, 1). It can be understood that the paragraph embedding vector is used to represent the video vector part and the title vector part in the word embedding vector, where, ( 0, 0, 0) corresponds to the video vector (F, F, F) and (1, 1, 1) corresponds to the title vector (I, Love, You). The position embedding vector (0, 1, 2, 0, 1, 2, 3) is used to characterize the order of vectors in the input model. For example, the (0, 1, 2) part is used to identify the frame sequence of the video frame of the video vector (F, F, F); the (1, 2, 3) part is used to identify the title vector (I, Love, You) three The order of text segmentation.

S706,将提取的视频向量输入Transformer模型,得到输出的中间向 量;S706, the extracted video vector is input into the Transformer model, and the intermediate vector of the output is obtained;

具体地,将图4中示出的上述位置嵌入向量、段落嵌入向量、词嵌入 向量以及类别嵌入向量第一次输入Transformer模块之后,即得到中间隐 向量输出

Figure BDA0003601550010000281
以及
Figure BDA0003601550010000282
(图中 未示出)。Specifically, after the above-mentioned position embedding vector, paragraph embedding vector, word embedding vector, and category embedding vector shown in FIG. 4 are input into the Transformer module for the first time, the intermediate hidden vector output is obtained.
Figure BDA0003601550010000281
as well as
Figure BDA0003601550010000282
(not shown in the figure).

S708,判断中间向量是否包括结束向量;S708, determine whether the intermediate vector includes the end vector;

即判断

Figure BDA0003601550010000283
是否为结束向量,在
Figure BDA0003601550010000284
并非结束向量的情况下,执行后 续步骤。judgment
Figure BDA0003601550010000283
is the end vector, in
Figure BDA0003601550010000284
If it is not the end vector, perform the next steps.

S710,基于训练好的参数集合对中间向量进行解析计算,得到第t步 中各个候选字符的得分;S710, the intermediate vector is analyzed and calculated based on the trained parameter set, and the score of each candidate character in the t step is obtained;

S712,判断当前候选字符是否为最高得分的候选字符;S712, determine whether the current candidate character is the candidate character with the highest score;

S714,将得分最高的候选字符确定为本步骤的输出字符;S714, the candidate character with the highest score is determined as the output character of this step;

具体地,基于

Figure BDA0003601550010000285
以及训练得到的
Figure BDA0003601550010000286
Wtitle、 Wdec、btitle和bdec,通过以下式子分别算出第1步解析过程中,历史标签 词表里第i个单词的得分以及标题词表第n个词的得分:Specifically, based on
Figure BDA0003601550010000285
and trained
Figure BDA0003601550010000286
W title , W dec , b title and b dec , calculate the score of the i-th word in the historical tag vocabulary and the score of the n-th word in the title vocabulary in the parsing process of the first step by the following formulas:

Figure BDA0003601550010000287
Figure BDA0003601550010000287

Figure BDA0003601550010000288
Figure BDA0003601550010000288

其中,

Figure BDA0003601550010000289
表示在第1步历史标签词表里第i个单词的得分,
Figure BDA00036015500100002810
表 示在第1步中视频标题中第n个词的得分。将
Figure BDA00036015500100002811
Figure BDA00036015500100002812
拼接起来,并选 择得分最高的单词,即是第1步的输出词。假设得到的第一候选字符序列 为:“我”(70%)、“爱”(90%)、“你”(76%)、“亲”(40%)、“情”(80%), 进而确定上述第一候选字符序列中置信度最高的字符为“爱”,进而如图4 所示,第一步的输出字符为“爱”。in,
Figure BDA0003601550010000289
Indicates the score of the i-th word in the history tag vocabulary in step 1,
Figure BDA00036015500100002810
Represents the score of the nth word in the video title in step 1. Will
Figure BDA00036015500100002811
and
Figure BDA00036015500100002812
Splice them together and select the word with the highest score, which is the output word of step 1. Suppose the obtained first candidate character sequence is: "I" (70%), "Love" (90%), "You" (76%), "Kin" (40%), "Qing" (80%), Further, it is determined that the character with the highest confidence in the above-mentioned first candidate character sequence is "love", and as shown in Fig. 4, the output character of the first step is "love".

由于

Figure BDA0003601550010000291
并非为结束向量,接着执行第二步迭代计算步骤,如下:because
Figure BDA0003601550010000291
is not the end vector, and then performs the second iterative calculation step, as follows:

将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量 以及第一步中得到的

Figure BDA0003601550010000292
输入Transformer模块,得到中间隐向量输出
Figure BDA0003601550010000293
Figure BDA0003601550010000294
以及
Figure BDA0003601550010000295
Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.
Figure BDA0003601550010000292
Enter the Transformer module to get the intermediate hidden vector output
Figure BDA0003601550010000293
Figure BDA0003601550010000294
as well as
Figure BDA0003601550010000295

基于判断

Figure BDA0003601550010000296
同样不为结束向量的判断结果,基于
Figure BDA0003601550010000297
以及 训练得到的
Figure BDA0003601550010000298
Wtitle、Wdec、btitle和bdec,通过以下式子分别 算出第2步解析过程中,历史标签词表里第i个单词的得分以及标题词表 第n个词的得分:based on judgment
Figure BDA0003601550010000296
It is also not the judgment result of the end vector, based on
Figure BDA0003601550010000297
and trained
Figure BDA0003601550010000298
W title , W dec , b title and b dec , calculate the score of the i-th word in the historical tag vocabulary and the score of the n-th word in the title vocabulary in the parsing process of the second step by the following formulas:

Figure BDA0003601550010000299
Figure BDA0003601550010000299

Figure BDA00036015500100002910
Figure BDA00036015500100002910

其中,

Figure BDA00036015500100002911
表示在第2步历史标签词表里第i个单词的得分,
Figure BDA00036015500100002912
表 示在第2步中视频标题中第n个词的得分。将
Figure BDA00036015500100002913
Figure BDA00036015500100002914
拼接起来,并选 择得分最高的单词,即是第2步的输出词。假设得到的第二候选字符序列 为:“我”(60%)、“爱”(20%)、“你”(60%)、“亲”(40%)、“情”(80%), 进而确定上述第一候选字符序列中置信度最高的字符为“情”,如图4所 示,第二步的输出词为“情”;in,
Figure BDA00036015500100002911
Indicates the score of the i-th word in the historical tag vocabulary in step 2,
Figure BDA00036015500100002912
represents the score of the nth word in the video title in step 2. Will
Figure BDA00036015500100002913
and
Figure BDA00036015500100002914
Splice them together and select the word with the highest score, which is the output word of step 2. Suppose the obtained second candidate character sequence is: "I" (60%), "Love" (20%), "You" (60%), "Kin" (40%), "Qing" (80%), Then it is determined that the character with the highest confidence in the above-mentioned first candidate character sequence is "Love", as shown in Figure 4, the output word of the second step is "Love";

由于

Figure BDA00036015500100002915
并非为结束向量,接着执行第三步解析步骤,如下:because
Figure BDA00036015500100002915
is not the end vector, and then performs the third parsing step, as follows:

将上述位置嵌入向量、段落嵌入向量、词嵌入向量以及类别嵌入向量 以及第一步中得到的

Figure BDA00036015500100002916
输入Transformer模块,得到中间隐向量输出
Figure BDA00036015500100002917
Figure BDA00036015500100002918
以及
Figure BDA00036015500100002919
其中,
Figure BDA00036015500100002920
与预设的 结束标识向量相匹配,即确定解析步骤结束,不再继续生成标签。Embed the above position embedding vector, paragraph embedding vector, word embedding vector and category embedding vector as well as those obtained in the first step.
Figure BDA00036015500100002916
Enter the Transformer module to get the intermediate hidden vector output
Figure BDA00036015500100002917
Figure BDA00036015500100002918
as well as
Figure BDA00036015500100002919
in,
Figure BDA00036015500100002920
Matches with the preset end identification vector, that is, it is determined that the parsing step is over, and the label is no longer generated.

S716,输出每一步确定的输出字符,拼接生成视频标签。S716, outputting the output characters determined in each step, and splicing to generate a video tag.

最后将第一步输出词“爱”和第二步的输出词“情”进行拼接,得到 标题为“我爱你”的目标视频的标签为“爱情”。Finally, the output word "love" of the first step and the output word "love" of the second step are spliced together, and the tag of the target video titled "I love you" is obtained as "love".

在本发明的上述实施例中,采用获取待识别的目标视频;利用视频描 述文本构建目标视频的目标候选字库,其中,目标候选字库包括参考字符 集和描述字符集,参考字符集中包括对多个历史标签文本进行分词得到的 参考字符,描述字符集中包括对视频描述文本进行分词得到的描述字符; 在提取出目标视频的视频特征以及视频描述文本的描述特征的情况下,对 视频特征以及描述特征进行N次迭代计算,以得到M个候选字符序列; 基于置信度从每个候选字符序列中确定出目标字符,得到M个目标字符; 进而将M个目标字符拼接为与目标视频匹配的目标视频标签。也就是说, 在本申请实施例中,通过对目标视频的视频特征和描述文本特征进行迭代 计算,进而从参考字符集和视频描述文本构成的描述字符集中逐字输出目 标字符,再通过输出的目标字符组成目标视频标签。由于本申请实施例中 用于输出字符的目标候选字库中结合了描述字符集,而描述字符集中包括 了与视频相关的新鲜文本内容,因此可以结合描述字符集输出新鲜字符, 进而根据输出的新鲜字符拼接得到新鲜标签,提升了输出的视频标签的准 确性,解决了现有的视频标签确定方法得到的视频标签准确性较低的技术 问题。In the above-mentioned embodiment of the present invention, the target video to be recognized is acquired; the target candidate character library of the target video is constructed by using the video description text, wherein the target candidate character library includes a reference character set and a description character set, and the reference character set includes a plurality of The reference characters obtained by word segmentation of the historical label text, the description character set includes the description characters obtained by word segmentation of the video description text; in the case of extracting the video features of the target video and the description features of the video description text, the video features and description features Carry out N times of iterative calculation to obtain M candidate character sequences; Determine target characters from each candidate character sequence based on the confidence, and obtain M target characters; Then splicing the M target characters into a target video matching the target video Label. That is to say, in this embodiment of the present application, by iteratively calculating the video features and the description text features of the target video, the target characters are output word by word from the description character set composed of the reference character set and the video description text, and then the target characters are output through the output The target characters make up the target video tag. Since the target candidate character library for outputting characters in this embodiment of the present application incorporates a description character set, and the description character set includes fresh text content related to videos, fresh characters can be output in combination with the description character set, and then according to the output freshness Character splicing obtains fresh labels, which improves the accuracy of output video labels, and solves the technical problem of low accuracy of video labels obtained by existing video label determination methods.

需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都 表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受 所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序 或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实 施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

根据本发明实施例的另一个方面,还提供了一种用于实施上述视频标 签的生成方法的视频标签的生成装置。如图8所示,该装置包括:According to another aspect of the embodiments of the present invention, there is also provided an apparatus for generating a video tag for implementing the above-mentioned method for generating a video tag. As shown in Figure 8, the device includes:

获取单元802,用于获取待识别的目标视频,其中,目标视频中携带 有用于描述目标视频的视频描述文本;Obtaining unit 802 is used to obtain the target video to be identified, wherein, in the target video, the video description text for describing the target video is carried;

构建单元804,用于利用视频描述文本构建目标视频的目标候选字库, 其中,目标候选字库包括参考字符集和描述字符集,参考字符集中包括对 多个历史标签文本进行分词得到的参考字符,描述字符集中包括对视频描 述文本进行分词得到的描述字符;The construction unit 804 is used to construct a target candidate character library of the target video by using the video description text, wherein, the target candidate character library includes a reference character set and a description character set, and the reference character set includes the reference characters obtained by segmenting multiple historical tag texts, and the description The character set includes description characters obtained by segmenting the video description text;

计算单元806,用于在提取出目标视频的视频特征以及视频描述文本 的描述特征的情况下,对视频特征以及描述特征进行N次迭代计算,以得 到M个候选字符序列,其中,每个候选字符序列中的候选字符对象包括 目标候选字库中每个候选字符以及与候选字符匹配的置信度,置信度用于 指示候选字符与目标视频之间的匹配度,N、M为大于等于1的自然数;The calculation unit 806 is configured to perform N times of iterative calculation on the video features and the description features under the condition that the video features of the target video and the description features of the video description text are extracted to obtain M candidate character sequences, wherein each candidate The candidate character objects in the character sequence include each candidate character in the target candidate character library and the confidence level of matching the candidate character. The confidence level is used to indicate the matching degree between the candidate character and the target video. N and M are natural numbers greater than or equal to 1 ;

确定单元808,用于基于置信度从每个候选字符序列中确定出目标字 符,得到M个目标字符;Determining unit 808, for determining the target character from each candidate character sequence based on the confidence, obtains M target characters;

拼接单元810,用于将M个目标字符拼接为与目标视频匹配的目标视 频标签。The splicing unit 810 is used for splicing the M target characters into a target video tag matching the target video.

可选地,在本实施例中,上述各个单元模块所要实现的实施例,可以 参考上述各个方法实施例,这里不再赘述。Optionally, in this embodiment, for the embodiments to be implemented by the foregoing unit modules, reference may be made to the foregoing respective method embodiments, and details are not described herein again.

根据本发明实施例的又一个方面,还提供了一种用于实施上述视频标 签的生成方法的电子设备,该电子设备可以是图9所示的终端设备或服务 器。本实施例以该电子设备为终端设备为例来说明。如图9所示,该电子 设备包括存储器902和处理器904,该存储器902中存储有计算机程序, 该处理器904被设置为通过计算机程序执行上述任一项方法实施例中的步 骤。According to another aspect of the embodiments of the present invention, an electronic device for implementing the above-mentioned method for generating a video tag is also provided, and the electronic device may be the terminal device or the server shown in FIG. 9 . This embodiment is described by taking the electronic device as a terminal device as an example. As shown in FIG. 9 , the electronic device includes a memory 902 and a processor 904, where a computer program is stored in the memory 902, and the processor 904 is configured to execute the steps in any of the above method embodiments through the computer program.

可选地,在本实施例中,上述电子设备可以位于计算机网络的多个网 络设备中的至少一个网络设备。Optionally, in this embodiment, the above-mentioned electronic device may be located in at least one network device among multiple network devices of a computer network.

可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执 行以下步骤:Optionally, in this embodiment, the above-mentioned processor may be configured to execute the following steps through a computer program:

S1,获取待识别的目标视频,其中,所述目标视频中携带有用于描述 目标视频的视频描述文本;S1, obtain the target video to be identified, wherein, in the target video, carry the video description text for describing the target video;

S2,利用视频描述文本构建目标视频的目标候选字库,其中,目标候 选字库包括参考字符集和描述字符集,参考字符集中包括对多个历史标签 文本进行分词得到的参考字符,描述字符集中包括对视频描述文本进行分 词得到的描述字符;S2, using the video description text to construct a target candidate character library of the target video, wherein the target candidate character library includes a reference character set and a description character set, the reference character set includes reference characters obtained by segmenting multiple historical tag texts, and the description character set includes The description characters obtained by word segmentation of the video description text;

S3,在提取出目标视频的视频特征以及视频描述文本的描述特征的情 况下,对视频特征以及描述特征进行N次迭代计算,以得到M个候选字 符序列,其中,每个候选字符序列中的候选字符对象包括目标候选字库中 每个候选字符以及与候选字符匹配的置信度,置信度用于指示候选字符与 目标视频之间的匹配度,N、M为大于等于1的自然数;S3, in the case of extracting the video features of the target video and the description features of the video description text, perform N times of iterative calculation on the video features and the description features to obtain M candidate character sequences, wherein, in each candidate character sequence, The candidate character object includes each candidate character in the target candidate character library and the confidence level of matching with the candidate character, the confidence level is used to indicate the degree of matching between the candidate character and the target video, and N and M are natural numbers greater than or equal to 1;

S4,基于置信度从每个候选字符序列中确定出目标字符,得到M个 目标字符;S4, determine the target character from each candidate character sequence based on the confidence, obtain M target characters;

S5,将M个目标字符拼接为与目标视频匹配的目标视频标签。S5, splicing the M target characters into target video tags matching the target video.

可选地,本领域普通技术人员可以理解,图9所示的结构仅为示意, 电子设备也可以是车载终端、智能手机(如Android手机、iOS手机等)、 平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、 PAD等终端设备。图9其并不对上述电子设备的结构造成限定。例如,电 子设备还可包括比图9中所示更多或者更少的组件(如网络接口等),或 者具有与图9所示不同的配置。Optionally, those of ordinary skill in the art can understand that the structure shown in FIG. 9 is for illustration only, and the electronic device may also be a vehicle terminal, a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, and a mobile Internet Equipment (Mobile Internet Devices, MID), PAD and other terminal equipment. FIG. 9 does not limit the structure of the above electronic device. For example, the electronic device may also include more or fewer components than those shown in FIG. 9 (such as network interfaces, etc.), or have a different configuration than that shown in FIG. 9 .

其中,存储器902可用于存储软件程序以及模块,如本发明实施例中 的视频标签的生成方法和装置对应的程序指令/模块,处理器904通过运行 存储在存储器902内的软件程序以及模块,从而执行各种功能应用以及数 据处理,即实现上述的视频标签的生成方法。存储器902可包括高速随机 存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪 存、或者其他非易失性固态存储器。在一些实例中,存储器902可进一步 包括相对于处理器904远程设置的存储器,这些远程存储器可以通过网络 连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、 移动通信网及其组合。其中,存储器902具体可以但不限于用于存储观察 视角画面中的各个元素、视频标签的生成信息等信息。作为一种示例,如 图9所示,上述存储器902中可以但不限于包括上述视频标签的生成装置 中的获取单元802、构建单元804、计算单元806、确定单元808以及拼接 单元810。此外,还可以包括但不限于上述视频标签的生成装置中的其他 模块单元,本示例中不再赘述。The memory 902 can be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for generating a video tag in the embodiment of the present invention, and the processor 904 runs the software programs and modules stored in the memory 902, thereby Execute various functional applications and data processing, that is, implement the above-mentioned method for generating video tags. Memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, memory 902 may further include memory located remotely from processor 904, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof. Wherein, the memory 902 can be specifically, but not limited to, be used to store information such as each element in the viewing angle picture, the generation information of the video tag, and the like. As an example, as shown in FIG. 9 , the above-mentioned memory 902 may include, but is not limited to, the acquiring unit 802, the constructing unit 804, the calculating unit 806, the determining unit 808 and the splicing unit 810 in the above-mentioned video tag generating apparatus. In addition, it may also include, but is not limited to, other module units in the above-mentioned video tag generating apparatus, which will not be repeated in this example.

可选地,上述的传输装置906用于经由一个网络接收或者发送数据。 上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装 置906包括一个网络适配器(Network Interface Controller,NIC),其可通 过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。 在一个实例中,传输装置906为射频(Radio Frequency,RF)模块,其用 于通过无线方式与互联网进行通讯。Optionally, the above-mentioned transmission device 906 is configured to receive or send data via a network. Specific examples of the above-mentioned networks may include wired networks and wireless networks. In one example, the transmission device 906 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through a network cable to communicate with the Internet or a local area network. In one example, the transmission device 906 is a radio frequency (RF) module for wirelessly communicating with the Internet.

此外,上述电子设备还包括:显示器908,和连接总线910,用于连 接上述电子设备中的各个模块部件。In addition, the above-mentioned electronic device further includes: a display 908, and a connection bus 910 for connecting various module components in the above-mentioned electronic device.

在其他实施例中,上述终端设备或者服务器可以是一个分布式系统中 的一个节点,其中,该分布式系统可以为区块链系统,该区块链系统可以 是由该多个节点通过网络通信的形式连接形成的分布式系统。其中,节点 之间可以组成点对点(P2P,Peer To Peer)网络,任意形式的计算设备, 比如服务器、终端等电子设备都可以通过加入该点对点网络而成为该区块 链系统中的一个节点。In other embodiments, the above-mentioned terminal device or server may be a node in a distributed system, wherein the distributed system may be a blockchain system, and the blockchain system may be communicated by the multiple nodes through a network A distributed system formed by connection in the form of. Among them, a peer-to-peer (P2P, Peer To Peer) network can be formed between nodes, and any form of computing equipment, such as servers, terminals and other electronic devices can become a node in the blockchain system by joining the peer-to-peer network.

根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序 产品包括计算机程序/指令,该计算机程序/指令包含用于执行流程图所示 的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分 从网络上被下载和安装,和/或从可拆卸介质被安装。在该计算机程序被中 央处理器执行时,执行本申请实施例提供的各种功能。According to one aspect of the present application, there is provided a computer program product comprising a computer program/instructions containing program code for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion, and/or installed from a removable medium. When the computer program is executed by the central processing unit, various functions provided by the embodiments of the present application are performed.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

根据本申请的一个方面,提供了一种计算机可读存储介质,计算机设 备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算 机指令,使得该计算机设备执行上述视频标签的生成方方法。According to one aspect of the present application, a computer-readable storage medium is provided, and a processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the above-mentioned video tag. generator method.

可选地,在本实施例中,上述计算机可读存储介质可以被设置为存储 用于执行以下步骤的计算机程序:Optionally, in this embodiment, the above-mentioned computer-readable storage medium can be configured to store a computer program for performing the following steps:

S1,获取待识别的目标视频,其中,所述目标视频中携带有用于描述 目标视频的视频描述文本;S1, obtain the target video to be identified, wherein, in the target video, carry the video description text for describing the target video;

S2,利用视频描述文本构建目标视频的目标候选字库,其中,目标候 选字库包括参考字符集和描述字符集,参考字符集中包括对多个历史标签 文本进行分词得到的参考字符,描述字符集中包括对视频描述文本进行分 词得到的描述字符;S2, using the video description text to construct a target candidate character library of the target video, wherein the target candidate character library includes a reference character set and a description character set, the reference character set includes reference characters obtained by segmenting multiple historical tag texts, and the description character set includes The description characters obtained by word segmentation of the video description text;

S3,在提取出目标视频的视频特征以及视频描述文本的描述特征的情 况下,对视频特征以及描述特征进行N次迭代计算,以得到M个候选字 符序列,其中,每个候选字符序列中的候选字符对象包括目标候选字库中 每个候选字符以及与候选字符匹配的置信度,置信度用于指示候选字符与 目标视频之间的匹配度,N、M为大于等于1的自然数;S3, in the case of extracting the video features of the target video and the description features of the video description text, perform N times of iterative calculation on the video features and the description features to obtain M candidate character sequences, wherein, in each candidate character sequence, The candidate character object includes each candidate character in the target candidate character library and the confidence level of matching with the candidate character, the confidence level is used to indicate the degree of matching between the candidate character and the target video, and N and M are natural numbers greater than or equal to 1;

S4,基于置信度从每个候选字符序列中确定出目标字符,得到M个 目标字符;S4, determine the target character from each candidate character sequence based on the confidence, obtain M target characters;

S5,将M个目标字符拼接为与目标视频匹配的目标视频标签。S5, splicing the M target characters into target video tags matching the target video.

可选地,在本实施例中,本领域普通技术人员可以理解上述实施例的 各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬 件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包 括:闪存盘、只读存储器(Read-OnlyMemory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。Optionally, in this embodiment, those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing the hardware related to the terminal device through a program, and the program can be stored in a In the computer-readable storage medium, the storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

上述实施例中的集成的单元如果以软件功能单元的形式实现并作为 独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。 基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的 部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计 算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算 机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例上述方法的全部或部分步骤。If the integrated units in the above-mentioned embodiments are implemented in the form of software functional units and sold or used as independent products, they may be stored in the above-mentioned computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, Several instructions are included to cause one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the above-mentioned methods of various embodiments of the present invention.

在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实 施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可 通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的, 例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外 的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统, 或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦 合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或 通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed client may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the above-mentioned units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. To another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的, 作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地 方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的 部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元 中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在 一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软 件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software functional units.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的 普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进 和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims (13)

1. A method for generating a video tag, comprising:
acquiring a target video to be identified, wherein the target video carries a video description text for describing the target video;
constructing a target candidate word library of the target video by using the video description text, wherein the target candidate word library comprises a reference character set and a description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical label texts, and the description character set comprises description characters obtained by segmenting the video description text;
under the condition that the video features of the target video and the description features of the video description text are extracted, performing iteration calculation on the video features and the description features for N times to obtain M candidate character sequences, wherein candidate character objects in each candidate character sequence comprise each candidate character in the target candidate word stock and a confidence coefficient matched with the candidate character, the confidence coefficient is used for indicating the matching degree between the candidate character and the target video, and N, M is a natural number greater than or equal to 1;
determining target characters from each candidate character sequence based on the confidence degrees to obtain M target characters;
and splicing the M target characters into target video labels matched with the target videos.
2. The method of claim 1, wherein the performing N iterative computations on the video feature and the description feature to obtain M candidate character sequences comprises:
in the process of performing ith iterative computation on the video features and the description features, an ith intermediate hidden vector set is computed in a vector conversion network based on the visual features in the video features, the description features and iterative reference vectors obtained by the (i-1) th iterative computation, wherein the ith intermediate hidden vector set comprises an ith visual hidden vector matched with the visual features, an ith description hidden vector matched with the description features and iterative reference vectors obtained by the ith iterative computation, and i is a natural number which is greater than or equal to 1 and less than or equal to N;
determining the confidence coefficient of each character in the target candidate word library which is matched with each other based on the iterative reference vector obtained by the ith iterative calculation, and determining a jth candidate character sequence according to the confidence coefficient, wherein j is a natural number which is greater than or equal to 0 and less than or equal to i;
executing the (i + 1) th iterative computation under the condition that the iterative reference vector obtained by the (i) th iterative computation does not reach the end condition;
and under the condition that the iteration reference vector obtained by the ith iteration calculation reaches an end condition, determining the iteration reference vector obtained by the previous i iteration calculations as N iteration reference vectors.
3. The method of claim 2, wherein the determining the confidence of each match of each character in the target candidate word library based on the iterative reference vector calculated by the ith iteration comprises:
and in a label generation network connected with the vector conversion network, determining a first confidence coefficient of each matching of each reference character in the reference character set and a second confidence coefficient of each matching of each description character in the description character set based on the iterative reference vector obtained by the ith iterative calculation.
4. The method of claim 3, wherein the determining a first confidence that each reference character in the reference character set matches and a second confidence that each description character in the description character set matches based on the iterative reference vector calculated in the ith iteration comprises:
determining a first weight coefficient and a first bias parameter which are respectively matched with each reference character in the label generation network, a second weight coefficient and a second bias parameter which are respectively matched with each description character, and a reference weight and a reference bias parameter;
determining the first confidence of each reference character based on the iterative reference vector, the first weight coefficient and the first bias parameter obtained by the ith iterative calculation;
obtaining a first intermediate result obtained by calculation based on the ith description hidden vector, the second weight coefficient and the second bias parameter, and a second intermediate result obtained by calculation based on the iteration reference vector obtained by the ith iteration calculation, the reference weight and the reference bias parameter;
determining the second confidence of each of the descriptive characters based on the first intermediate result and the second intermediate result.
5. The method according to claim 3, wherein the determining a confidence level of each matching of each character in the target candidate word library based on the iterative reference vector calculated by the ith iteration, and determining the jth candidate character sequence according to the confidence level comprises:
and performing weighted summation calculation on the acquired multiple first confidence degrees and the multiple second confidence degrees through a full connection layer in the label generation network to obtain the confidence degrees of respective matching of each character in the target candidate word library so as to generate the jth candidate character sequence.
6. The method of claim 3, further comprising, prior to obtaining the target video to be identified:
acquiring a sample video and a sample label matched with the sample video, wherein the sample video carries a sample video description text;
constructing a sample candidate word library of the sample video by using the sample video description text, wherein the sample candidate word library comprises a reference character set and a sample description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical labels, and the sample description character set comprises description characters obtained by segmenting the sample video description text;
and training the initialized label generation network by using the sample video, the sample label and the sample candidate word stock until a training convergence condition is reached.
7. The method of claim 2, wherein in the process of performing the ith iterative computation on the video feature and the description feature, computing an ith intermediate hidden vector set in a vector conversion network based on the visual feature in the video feature, the description feature and an iterative reference vector computed in the (i-1) th iterative computation comprises:
converting the visual features and the description features into word embedding vectors corresponding to the target video under the condition that i is equal to 1; extracting a category embedding vector, a paragraph embedding vector and a position embedding vector corresponding to the target video, wherein the category embedding vector is used for indicating category information of the target video, the paragraph embedding vector comprises a first paragraph falling sub-vector obtained based on the visual features and a second paragraph falling sub-vector obtained based on the description features, and the position embedding vector is used for indicating a frame sequence of each video frame in the target video and a sequence of each text word in the video description text; taking the word embedding vector, the category embedding vector, the paragraph embedding vector and the position embedding vector as input vectors of the vector conversion network to calculate and obtain the ith intermediate hidden vector set;
and under the condition that i is larger than 1, taking the word embedding vector, the category embedding vector, the paragraph embedding vector, the position embedding vector and an iteration reference vector obtained by the i-1 th iteration calculation as input vectors of the vector conversion network to calculate and obtain the i-th intermediate hidden vector set.
8. The method according to claim 2, wherein the calculating an i-th intermediate hidden vector set based on the visual features in the video features, the description features, and the i-1 st iterative reference vector in the vector conversion network comprises:
determining the ith visual hidden vector and the ith description hidden vector based on the visual features and the description features, wherein the ith visual hidden vector and the ith description hidden vector are used for indicating the context relationship of characters in the target video tag;
and determining the iteration reference vector obtained by the ith iteration calculation based on the visual feature, the description feature and the iteration reference vector obtained by the (i-1) th iteration calculation.
9. The method of claim 8, wherein the determining a target character from each of the candidate character sequences based on the confidence level comprises:
and according to the context relationship of the characters in the target video label, taking the target character determined in the jth candidate character sequence as the jth target character in the target video label.
10. The method of claim 9, further comprising, after said splicing the M target characters into a target video tag matching the target video:
and adjusting the sequence of the M target characters in the target video label based on the context semantic relationship to obtain the updated target video label.
11. An apparatus for generating a video tag, comprising:
the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring a target video to be recognized, and the target video carries a video description text for describing the target video;
the construction unit is used for constructing a target candidate word stock of the target video by using the video description text, wherein the target candidate word stock comprises a reference character set and a description character set, the reference character set comprises reference characters obtained by segmenting a plurality of historical label texts, and the description character set comprises description characters obtained by segmenting the video description text;
a calculating unit, configured to perform iterative calculation on the video features and the description features of the video description text N times to obtain M candidate character sequences under the condition that the video features of the target video and the description features of the video description text are extracted, where a candidate character object in each candidate character sequence includes each candidate character in the target candidate word stock and a confidence coefficient that matches the candidate character, the confidence coefficient is used to indicate a matching degree between the candidate character and the target video, and N, M is a natural number greater than or equal to 1;
the determining unit is used for determining target characters from each candidate character sequence based on the confidence coefficient to obtain M target characters;
and the splicing unit is used for splicing the M target characters into target video labels matched with the target videos.
12. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 10.
13. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 10 by means of the computer program.
CN202210404088.8A 2022-04-18 2022-04-18 Video tag generation method and device, storage medium and electronic device Active CN115114479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210404088.8A CN115114479B (en) 2022-04-18 2022-04-18 Video tag generation method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210404088.8A CN115114479B (en) 2022-04-18 2022-04-18 Video tag generation method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN115114479A true CN115114479A (en) 2022-09-27
CN115114479B CN115114479B (en) 2025-03-14

Family

ID=83324810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210404088.8A Active CN115114479B (en) 2022-04-18 2022-04-18 Video tag generation method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN115114479B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970906A (en) * 2014-05-27 2014-08-06 百度在线网络技术(北京)有限公司 Method and device for establishing video tags and method and device for displaying video contents
US20200336802A1 (en) * 2019-04-16 2020-10-22 Adobe Inc. Generating tags for a digital video
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information
CN113590876A (en) * 2021-01-22 2021-11-02 腾讯科技(深圳)有限公司 Video label setting method and device, computer equipment and storage medium
US20210406553A1 (en) * 2019-08-29 2021-12-30 Tencent Technology (Shenzhen) Company Limited Method and apparatus for labelling information of video frame, device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970906A (en) * 2014-05-27 2014-08-06 百度在线网络技术(北京)有限公司 Method and device for establishing video tags and method and device for displaying video contents
US20200336802A1 (en) * 2019-04-16 2020-10-22 Adobe Inc. Generating tags for a digital video
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information
US20210406553A1 (en) * 2019-08-29 2021-12-30 Tencent Technology (Shenzhen) Company Limited Method and apparatus for labelling information of video frame, device, and storage medium
CN113590876A (en) * 2021-01-22 2021-11-02 腾讯科技(深圳)有限公司 Video label setting method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何春辉;: "基于描述文本和实体标签的网络视频分类算法", 湖南城市学院学报(自然科学版), no. 03, 15 May 2018 (2018-05-15) *

Also Published As

Publication number Publication date
CN115114479B (en) 2025-03-14

Similar Documents

Publication Publication Date Title
CN108875074B (en) Answer selection method and device based on cross attention neural network and electronic equipment
CN106599226B (en) A content recommendation method and content recommendation system
CN117114063B (en) Methods for training large generative language models and for image processing tasks
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN111581510A (en) Shared content processing method and device, computer equipment and storage medium
CN110234018B (en) Multimedia content description generation method, training method, device, equipment and medium
CN113688951B (en) Video data processing method and device
CN115115914B (en) Information identification method, apparatus and computer readable storage medium
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
CN113705313A (en) Text recognition method, device, equipment and medium
CN115062134B (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN110717038A (en) Object classification method and device
CN118692014B (en) Video tag identification method, device, equipment, medium and product
CN116958738A (en) Training method and device of picture recognition model, storage medium and electronic equipment
CN117390074A (en) A serialized recommendation method, device and storage medium based on long user behavior
CN118484517A (en) Text query method and device, electronic equipment and storage medium
CN117018632A (en) Game platform intelligent management method, system and storage medium
CN113704507A (en) Data processing method, computer device and readable storage medium
CN115129883B (en) Entity linking method and device, storage medium and electronic equipment
CN110866195B (en) Text description generation method and device, electronic equipment and storage medium
CN117874354A (en) Method and device for sorting search recommended words and electronic equipment
CN118075573A (en) Video title generation method and device, electronic equipment and storage medium
CN115114479B (en) Video tag generation method and device, storage medium and electronic device
CN112905884B (en) Method, apparatus, medium and program product for generating sequence annotation model
CN117033646A (en) Information query method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant