CN114372172A

CN114372172A - Method and device for generating video cover image, computer equipment and storage medium

Info

Publication number: CN114372172A
Application number: CN202210011031.1A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-06
Filing date: 2022-01-06
Publication date: 2022-04-19
Anticipated expiration: 2042-01-06
Also published as: CN114372172B

Abstract

The application provides a method, a device, computer equipment and a storage medium for generating a video cover image, which can be applied to artificial intelligence, the traffic field or the map field and the like and are used for solving the problem of low accuracy of the generated video cover image. The method comprises the following steps: acquiring a video to be processed and an initial cover image of the video to be processed; generating a target cover image of the video to be processed based on the video to be processed and the initial cover image; wherein the target cover image is generated by adding a picture-in-picture image on the initial cover image, and the picture-in-picture image is matched with the content of the video to be processed. Therefore, the picture-in-picture image plays a role in emphasizing the content of the video to be processed on the basis of the initial cover image, and the accuracy of generating the video cover is improved.

Description

Method, device, computer device and storage medium for generating video cover image

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种生成视频封面图像的方法、装置、计算机设备及存储介质。The present application relates to the field of computer technology, and in particular, to a method, apparatus, computer device, and storage medium for generating a video cover image.

背景技术Background technique

随着自媒体科技的不断发展，越来越多的网络媒体平台可以为网络对象提供发布视频和观看视频的服务。网络媒体平台上，网络对象可以通过上传制作好的视频，达到发布视频的目的。发布的视频可以通过视频的标题和封面图像，提供视频预览，从而网络对象可以通过视频预览，点击进入感兴趣的视频，观看其他网络对象发布的视频。With the continuous development of self-media technology, more and more network media platforms can provide network objects with the service of publishing videos and watching videos. On the network media platform, the network object can achieve the purpose of publishing the video by uploading the produced video. The published video can provide a video preview through the title and cover image of the video, so that network objects can click to enter the video of interest through the video preview, and watch the videos released by other network objects.

视频的封面图像通常是从视频中选取的视频帧，然而，视频中的一个视频帧仅能够表征一个视频场景，无法准确地表征视频的主题，更无法突出视频帧表达的重点内容。从而，网络对象无法通过视频的封面图像，准确地点击进入感兴趣的视频，使得网络对象需要点击进入多个视频进行观看，才能够观看到感兴趣的视频。The cover image of a video is usually a video frame selected from the video. However, a video frame in the video can only represent a video scene, and cannot accurately represent the theme of the video, let alone highlight the key content expressed by the video frame. Therefore, the network object cannot accurately click to enter the video of interest through the cover image of the video, so that the network object needs to click to enter multiple videos for viewing before viewing the video of interest.

可见，相关技术下，生成的视频封面图像的准确性较低。It can be seen that under the related art, the accuracy of the generated video cover image is low.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种生成视频封面图像的方法、装置、计算机设备及存储介质，以解决生成视频封面图像的准确性较低的问题。Embodiments of the present application provide a method, apparatus, computer device, and storage medium for generating a video cover image, so as to solve the problem of low accuracy in generating a video cover image.

第一方面，提供一种生成视频封面图像的方法，包括：In a first aspect, a method for generating a video cover image is provided, including:

获取待处理视频，以及所述待处理视频的初始封面图像；Obtain the video to be processed, and the initial cover image of the video to be processed;

基于所述待处理视频和所述初始封面图像，生成所述待处理视频的目标封面图像；generating a target cover image of the video to be processed based on the video to be processed and the initial cover image;

其中，所述目标封面图像是在所述初始封面图像上，添加画中画图像生成的，所述画中画图像与所述待处理视频的内容相匹配。The target cover image is generated by adding a picture-in-picture image to the initial cover image, and the picture-in-picture image matches the content of the video to be processed.

第二方面，提供一种生成视频封面图像的装置，包括：In a second aspect, a device for generating a video cover image is provided, comprising:

获取模块：用于获取待处理视频，以及所述待处理视频的初始封面图像；Acquisition module: used to acquire the video to be processed and the initial cover image of the video to be processed;

处理模块：基于所述待处理视频和所述初始封面图像，生成所述待处理视频的目标封面图像；Processing module: based on the video to be processed and the initial cover image, generate a target cover image of the video to be processed;

可选的，所述处理模块具体用于：Optionally, the processing module is specifically used for:

基于信息提取策略，提取所述待处理视频包含的关键特征信息，其中，所述关键特征信息用于表征：所述待处理视频的主题，以及包含的关键对象；Based on the information extraction strategy, extract the key feature information contained in the video to be processed, wherein the key feature information is used to represent: the subject of the video to be processed, and the key objects contained;

获取与所述关键特征信息相匹配的各个候选素材，并从所述各个候选素材中选取出满足呈现条件的候选素材，作为目标素材；Acquire each candidate material that matches the key feature information, and select a candidate material that satisfies the presentation conditions from the various candidate materials as the target material;

将所述目标素材作为所述初始封面图像的画中画图像，合成所述目标素材与所述初始封面图像，生成所述目标封面图像。Using the target material as a picture-in-picture image of the initial cover image, synthesizing the target material and the initial cover image to generate the target cover image.

获取各个参考图像，其中，所述参考图像是从网络资源中收集的图像；obtaining each reference image, wherein the reference image is an image collected from a network resource;

提取所述待处理视频包含的各个待处理视频帧各自的视频帧特征，以及提取所述各个参考图像各自的图像特征；Extracting the respective video frame features of each to-be-processed video frame contained in the video to be processed, and extracting the respective image features of the respective reference images;

基于各个视频帧特征和各个图像特征，从所述各个待处理视频帧和所述各个参考图像中，确定与所述关键特征信息相匹配的各个候选素材。Based on each video frame feature and each image feature, from each of the video frames to be processed and each of the reference images, each candidate material matching the key feature information is determined.

可选的，所述关键特征信息包含：词特征、人脸特征和物体特征；Optionally, the key feature information includes: word features, face features, and object features;

则所述处理模块具体用于：Then the processing module is specifically used for:

分别确定所述词特征与各个视频帧特征之间，以及所述词特征与各个图像特征之间的图文相似度；Determine the graphic similarity between the word feature and each video frame feature, and between the word feature and each image feature, respectively;

分别确定所述人脸特征与各个视频帧特征之间，以及所述人脸特征与各个图像特征之间的人脸相似度；Respectively determine the facial similarity between the facial feature and each video frame feature, and between the facial feature and each image feature;

分别确定所述物体特征与各个视频帧特征之间，以及，所述物体特征与各个图像特征之间的物体相似度；respectively determining the object similarity between the object feature and each video frame feature, and between the object feature and each image feature;

基于获得的图文匹配度、各个人脸匹配度和各个物体匹配度，从所述各个待处理视频帧和所述各个参考图像中，确定与所述关键特征信息相匹配的各个候选素材。Based on the obtained image-text matching degree, each face matching degree and each object matching degree, each candidate material matching the key feature information is determined from each of the video frames to be processed and each of the reference images.

可选的，所述处理模块采用如下方法获得所述关键特征信息：Optionally, the processing module adopts the following method to obtain the key feature information:

获取所述待处理视频的发布信息和字幕文件，并提取所述发布信息和所述字幕文件包含的关键词的词特征；Obtain the release information and the subtitle file of the video to be processed, and extract the word features of the keywords contained in the release information and the subtitle file;

获取所述待处理视频中的关键视频帧，并提取所述关键视频帧包含的人脸区域的人脸特征，以及包含的物体区域的物体特征，其中，所述关键视频帧为所述待处理视频中，表征视频场景切换的视频帧；Acquiring key video frames in the video to be processed, and extracting the facial features of the face region included in the key video frames, and the object features of the object regions included, wherein the key video frames are the to-be-processed In the video, the video frame representing the video scene switching;

将获得的词特征、人脸特征和物体特征，作为所述关键特征信息。The obtained word features, face features and object features are used as the key feature information.

基于各个视频帧特征和各个图像特征，从所述各个待处理视频帧和所述各个参考图像中，确定与所述关键特征信息相匹配的各个候选图像；Based on each video frame feature and each image feature, from each of the video frames to be processed and each of the reference images, determine each candidate image that matches the key feature information;

分别确定所述各个候选图像，各自与所述关键特征信息相匹配的图像区域；Determining the respective candidate images, each of which matches the image area with the key feature information;

基于各个图像区域，分别对所述各个候选图像进行裁剪处理，获得各个候选素材。Based on the respective image regions, the respective candidate images are cropped to obtain respective candidate materials.

基于清晰度评估策略，对所述各个候选素材进行清晰度评估，确定所述各个候选素材各自的清晰度评估值；Based on the sharpness evaluation strategy, perform sharpness evaluation on the respective candidate materials, and determine the respective sharpness evaluation values of the respective candidate materials;

基于内容质量评估策略，对所述各个候选素材进行内容质量评估，确定所述各个候选素材各自的内容质量评估值；Based on the content quality evaluation strategy, perform content quality evaluation on each candidate material, and determine the respective content quality evaluation value of each candidate material;

获得各个清晰度评估值和各个内容质量评估值的加权和，并将加权和符合大于呈现阈值的候选素材，作为所述目标素材。A weighted sum of each clarity evaluation value and each content quality evaluation value is obtained, and the weighted sum conforms to the candidate material larger than the presentation threshold as the target material.

检测所述初始封面图像，确定所述初始封面图像包含的目标对象；Detecting the initial cover image, and determining the target object contained in the initial cover image;

基于所述目标对象在所述初始封面图像中的位置，将所述初始封面图像划分为目标区域和非目标区域；dividing the initial cover image into a target area and a non-target area based on the position of the target object in the initial cover image;

基于所述非目标区域的形状尺寸，调整所述目标素材的尺寸；Adjusting the size of the target material based on the shape and size of the non-target area;

将调整后的目标素材，覆盖在所述初始封面图像中所述非目标区域之上，生成所述待处理视频的目标封面图像。The adjusted target material is overlaid on the non-target area in the initial cover image to generate the target cover image of the video to be processed.

第三方面，提供一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现如第一方面所述的方法。In a third aspect, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method according to the first aspect.

第四方面，提供一种计算机设备，包括：In a fourth aspect, a computer device is provided, comprising:

存储器，用于存储程序指令；memory for storing program instructions;

处理器，用于调用所述存储器中存储的程序指令，按照获得的程序指令执行如第一方面所述的方法。The processor is configured to call the program instructions stored in the memory, and execute the method according to the first aspect according to the obtained program instructions.

第五方面，提供一种计算机可读存储介质，所述存储介质存储有计算机可执行指令，所述计算机可执行指令用于使计算机执行如第一方面或所述的方法。In a fifth aspect, a computer-readable storage medium is provided, the storage medium stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to perform the method according to the first aspect or the description.

本申请实施例中，初始封面图像上添加的画中画图像，与待处理视频的内容相匹配，例如，与待处理视频的主题相关，或与待处理视频包含的关键对象相匹配，那么获得的目标封面图像，能够更加准确地表征待处理视频的内容，提高生成视频封面的准确性。In this embodiment of the present application, the picture-in-picture image added to the initial cover image matches the content of the video to be processed, for example, is related to the theme of the video to be processed, or matches the key objects contained in the video to be processed, then obtain The target cover image can more accurately represent the content of the video to be processed and improve the accuracy of generating video cover.

进一步的，画中画图像可以在初始封面图像的基础之上，起到强调待处理视频的内容的作用，例如，起到强调待处理视频的主题或包含的关键对象的作用，避免单独呈现初始封面图像时表征内容不明确的情况，进一步提高生成视频封面的准确性。Further, the picture-in-picture image can play a role in emphasizing the content of the video to be processed on the basis of the initial cover image, for example, it can play a role in emphasizing the subject of the video to be processed or the key objects contained in it, so as to avoid showing the initial image separately. When the cover image represents the unclear content, the accuracy of generating the video cover is further improved.

附图说明Description of drawings

图1a为相关技术中的生成视频封面图像的方法的一种原理示意图；1a is a schematic diagram of a method for generating a video cover image in the related art;

图1b为本申请实施例提供的生成视频封面图像的方法的一种原理示意图一；FIG. 1b is a schematic diagram 1 of a principle of a method for generating a video cover image provided by an embodiment of the application;

图1c为本申请实施例提供的生成视频封面图像的方法的一种应用场景；Fig. 1c is an application scenario of the method for generating a video cover image provided by an embodiment of the present application;

图2a为本申请实施例提供的生成视频封面图像的方法的一种流程示意图一；Fig. 2a is a schematic flowchart 1 of a method for generating a video cover image provided by an embodiment of the application;

图2b为本申请实施例提供的生成视频封面图像的方法的一种流程示意图二；2b is a second schematic flowchart of a method for generating a video cover image provided by an embodiment of the application;

图3为本申请实施例提供的生成视频封面图像的方法的一种原理示意图二；3 is a schematic diagram 2 of a principle of a method for generating a video cover image provided by an embodiment of the present application;

图4a为本申请实施例提供的生成视频封面图像的方法的一种原理示意图三；4a is a schematic diagram three of a principle of a method for generating a video cover image provided by an embodiment of the application;

图4b为本申请实施例提供的生成视频封面图像的方法的一种原理示意图四；FIG. 4b is a fourth schematic diagram of a principle of a method for generating a video cover image provided by an embodiment of the application;

图4c为本申请实施例提供的生成视频封面图像的方法的一种原理示意图五；Fig. 4c is a schematic diagram five of a principle of a method for generating a video cover image provided by an embodiment of the present application;

图5为本申请实施例提供的生成视频封面图像的方法的一种原理示意图六；FIG. 5 is a schematic sixth schematic diagram of a method for generating a video cover image provided by an embodiment of the present application;

图6a为本申请实施例提供的生成视频封面图像的方法的一种原理示意图七；6a is a seventh schematic diagram of a principle of a method for generating a video cover image provided by an embodiment of the present application;

图6b为本申请实施例提供的生成视频封面图像的方法的一种原理示意图八；FIG. 6b is a schematic eighth schematic diagram of a method for generating a video cover image provided by an embodiment of the present application;

图6c为本申请实施例提供的生成视频封面图像的方法的一种原理示意图九；6c is a schematic diagram 9 of a principle of a method for generating a video cover image provided by an embodiment of the present application;

图6d为本申请实施例提供的生成视频封面图像的方法的一种原理示意图十；FIG. 6d is a schematic schematic diagram ten of a method for generating a video cover image provided by an embodiment of the present application;

图6e为本申请实施例提供的生成视频封面图像的方法的一种原理示意图十一；FIG. 6e is an eleventh schematic schematic diagram of a principle of a method for generating a video cover image provided by an embodiment of the application;

图7为本申请实施例提供的生成视频封面图像的装置的一种结构示意图一；7 is a schematic structural diagram 1 of an apparatus for generating a video cover image provided by an embodiment of the present application;

图8为本申请实施例提供的生成视频封面图像的装置的一种结构示意图二。FIG. 8 is a second schematic structural diagram of an apparatus for generating a video cover image according to an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。In order to make the purposes, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present application.

下面对本申请实施例中的部分用语进行解释说明，以便于本领域技术人员理解。Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1)Feeds：(1) Feeds:

Feeds为消息来源，是web feed、news feed、syndicated feed的简写形式。Feeds是一种资料格式，网站透过它将最新资讯传播给用户，通常以时间轴方式排列。用户能够订阅网站的先决条件是，网站提供了消息来源。Feeds are news sources, which are short forms of web feeds, news feeds, and syndicated feeds. Feeds are a data format through which a website disseminates the latest information to users, usually arranged in a timeline. A prerequisite for a user to be able to subscribe to a website is that the website provides a source of news.

(2)短视频：(2) Short video:

短视频是一种互联网内容传播方式，一般是在互联网新媒体上传播的时长在5分钟以内的视频传播内容。随着移动终端普及和网络的提速，短平快的大流量传播内容逐渐获得各大平台、粉丝和资本的青睐。Short video is a kind of Internet content dissemination method, which is generally a video dissemination content within 5 minutes of dissemination on new Internet media. With the popularization of mobile terminals and the speeding up of the Internet, the short-term, fast-paced and large-flow dissemination of content has gradually gained the favor of major platforms, fans and capital.

本申请实施例涉及人工智能(ArtificialIntelligence，AI)领域，是基于机器学习(MachineLearning，ML)技术设计的，可以应用与云计算、智慧交通、辅助驾驶或地图等领域。The embodiments of the present application relate to the field of artificial intelligence (Artificial Intelligence, AI), are designed based on machine learning (Machine Learning, ML) technology, and can be applied to fields such as cloud computing, smart transportation, assisted driving, or maps.

人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它研究各种机器的设计原理与实现方法，企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器，使机器具有感知、推理和决策的功能。Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology of computer science, which studies the design principles and implementation methods of various machines, attempts to understand the essence of intelligence, and produces a new kind of artificial intelligence that can respond in a similar way to human intelligence. Intelligent machines enable machines to perceive, reason and make decisions.

人工智能是一门综合学科，涉及的领域广泛，既有硬件层面的技术，也有软件层面的技术。人工智能的基础技术一般包括传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作交互系统、机电一体化等技术。人工智能的软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术、机器学习/深度学习、自动驾驶、智慧交通等几大方向。随着人工智能的发展与进步，人工智能得以在多个领域中展开研究和应用，例如，常见的智能家居、智能客服、虚拟助理、智能音箱、智能营销、智能穿戴设备、无人驾驶、自动驾驶、无人机、机器人、智能医疗、车联网、自动驾驶、智慧交通等领域，相信随着未来技术的进一步发展，人工智能将在更多的领域中得到应用，发挥出越来越重要的价值。本申请实施例提供的方案，涉及人工智能的深度学习、增强现实等技术，具体通过如下实施例进一步说明。Artificial intelligence is a comprehensive discipline that covers a wide range of fields, including both hardware-level technologies and software-level technologies. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operating interactive systems, and mechatronics. The software technologies of artificial intelligence mainly include computer vision technology, speech processing technology, natural language processing technology, machine learning/deep learning, autonomous driving, and intelligent transportation. With the development and progress of artificial intelligence, artificial intelligence can be researched and applied in many fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, smart wearable devices, unmanned driving, automatic It is believed that with the further development of future technology, artificial intelligence will be applied in more fields and play an increasingly important role. value. The solutions provided by the embodiments of the present application relate to technologies such as deep learning of artificial intelligence and augmented reality, and are further described by the following embodiments.

机器学习是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科，专门研究计算机通过模拟人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构，使计算机不断改善自身的性能。Machine learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Reorganize the existing knowledge structure, so that the computer continuously improves its own performance.

机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域；而机器学习的核心则是深度学习，是实现机器学习的一种技术。机器学习通常包括深度学习、强化学习、迁移学习、归纳学习、人工神经网络、式教学习等技术，深度学习则包括卷积神经网络(Convolutional Neural Networks，CNN)、深度置信网络、递归神经网络、自动编码器、生成对抗网络等技术。Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence; the core of machine learning is deep learning, which is a technology for realizing machine learning. Machine learning usually includes deep learning, reinforcement learning, transfer learning, inductive learning, artificial neural network, teaching learning and other technologies, while deep learning includes Convolutional Neural Networks (CNN), deep belief networks, recurrent neural networks, Autoencoders, Generative Adversarial Networks, etc.

应当说明的是，本申请实施例中，涉及到用户画像、用户历史操作记录等相关的数据，当本申请以上实施例运用到具体产品或技术中时，需要获得用户许可或者同意，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that, in the embodiments of this application, related data such as user portraits, user historical operation records, etc. are involved. When the above embodiments of this application are applied to specific products or technologies, it is necessary to obtain user permission or consent, and relevant data. The collection, use, and processing of data must comply with the relevant laws, regulations and standards of relevant countries and regions.

下面对本申请实施例提供的生成视频封面图像的方法的应用领域进行简单介绍。The following briefly introduces the application fields of the method for generating a video cover image provided by the embodiments of the present application.

随着自媒体科技的不断发展，越来越多的网络媒体平台可以为网络对象提供发布视频和观看视频的服务。内容生产的门槛降低，使得各种内容的发布量以指数级速度增长。内容来源来自各种内容创作机构，比如自媒体和机构的专业生产内容(Professionally-generated Content，PGC)，用户生产内容(User-generated Content，UGC)。With the continuous development of self-media technology, more and more network media platforms can provide network objects with the service of publishing videos and watching videos. The threshold for content production has been lowered, allowing the release of various content to grow exponentially. Content sources come from various content creation agencies, such as professionally-generated content (PGC) and user-generated content (UGC) from self-media and agencies.

网络媒体平台上，网络对象可以通过上传制作好的视频，达到发布视频的目的。发布的视频可以通过视频的标题和封面图像，提供视频预览，从而网络对象可以通过视频预览，点击进入感兴趣的视频，观看其他网络对象发布的视频。On the network media platform, the network object can achieve the purpose of publishing the video by uploading the produced video. The published video can provide a video preview through the title and cover image of the video, so that network objects can click to enter the video of interest through the video preview, and watch the videos released by other network objects.

网络对象观看内容最核心的因素是内容的标题，封面图像和作者等。请参考图1a，封面图像是视频给网络对象的第一印象，在消费视频内容时，封面图像的好坏很大程度上影响了网络对象的观看欲望。封面图像的好坏包含两个层面，一方面是封面图像的质量，如是否清晰，是否带有瑕疵等，另一方面是封面图像所传达的信息，内容是否有效，是否契合视频主题等。网络对象上传视频，从某种程度上是一种创作过程，通过视频传递内容，并且在视频的标题中，概括地描述出来。好的封面图像应该尽可能与标题及内容，或者视频中表达的画面相关，因为它通常最直接地反映了主题，比如某个场景，某个明星或者强调的某个点及这些点之间的相互关系。The most important factors in the viewing of content by web objects are the title, cover image and author of the content. Referring to Figure 1a, the cover image is the first impression of the video on the network object. When consuming video content, the quality of the cover image greatly affects the viewing desire of the network object. The quality of the cover image includes two aspects, one is the quality of the cover image, such as whether it is clear, whether it has flaws, etc., and the other is the information conveyed by the cover image, whether the content is effective, whether it fits the theme of the video, etc. Web objects upload video, to a certain extent, a creative process, delivering content through the video, and in the title of the video, it is generally described. A good cover image should be as relevant as possible to the title and content, or to the picture expressed in the video, because it usually most directly reflects the subject, such as a scene, a star, or a point highlighted and the gaps between those points. mutual relationship.

为了解决生成视频封面图像的准确性较低的问题，本申请提出一种生成视频封面图像的方法。该方法在获取待处理视频，以及待处理视频的初始封面图像之后，可以基于待处理视频和初始封面图像，生成待处理视频的目标封面图像。请参考图1b，目标封面图像是在初始封面图像上，添加画中画图像生成的，画中画图像与待处理视频的内容相匹配。In order to solve the problem of low accuracy in generating a video cover image, the present application proposes a method for generating a video cover image. After acquiring the video to be processed and the initial cover image of the video to be processed, the method can generate a target cover image of the video to be processed based on the video to be processed and the initial cover image. Referring to FIG. 1b, the target cover image is generated by adding a picture-in-picture image to the initial cover image, and the picture-in-picture image matches the content of the video to be processed.

下面对本申请提供的生成视频封面图像的方法的应用场景进行说明。The following describes the application scenarios of the method for generating a video cover image provided by the present application.

请参考图1c，为本申请提供的生成视频封面图像的方法的一种应用场景示意图。该应用场景中包括客户端101和服务端102。客户端101和服务端102之间可以通信，通信方式可以是采用有线通信技术进行通信，例如通过连接网线或串口线进行通信；也可以是采用无线通信技术进行通信，例如，通过蓝牙或无线保真(wireless fidelity，WIFI)等技术进行通信，具体不做限制。Please refer to FIG. 1 c , which is a schematic diagram of an application scenario of the method for generating a video cover image provided by the present application. The application scenario includes a client 101 and a server 102 . The client 101 and the server 102 can communicate, and the communication method can be by using wired communication technology, for example, by connecting a network cable or serial cable; or by using wireless communication technology, for example, by using Bluetooth or wireless security. Wireless fidelity (wireless fidelity, WIFI) and other technologies to communicate, there is no specific limitation.

客户端101泛指可以向服务端102提供待处理视频的设备，例如终端设备、终端设备可以访问的网页，或终端设备可以访问的第三方程序等。终端设备可以是智能交通设备、摄像头、手机、智能语音交互设备、智能家电、车载终端等。服务端102泛指可以对待处理视频进行处理的设备，例如，终端设备或服务器等。服务器包括但不限于云服务器、本地服务器或关联的第三方服务器等。客户端101和服务端102均可以采用云计算，以减少本地计算资源的占用；同样也可以采用云存储，以减少本地存储资源的占用。The client 101 generally refers to a device that can provide the video to be processed to the server 102 , such as a terminal device, a web page that the terminal device can access, or a third-party program that the terminal device can access. The terminal device may be an intelligent transportation device, a camera, a mobile phone, an intelligent voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. The server 102 generally refers to a device that can process the video to be processed, such as a terminal device or a server. Servers include but are not limited to cloud servers, local servers or associated third-party servers, etc. Both the client 101 and the server 102 can use cloud computing to reduce the occupation of local computing resources; similarly, cloud storage can also be used to reduce the occupation of local storage resources.

作为一种实施例，客户端101和服务端102可以是同一个设备，具体不做限制。本申请实施例中，以客户端101和服务端102分别为不同的设备为例进行介绍。As an embodiment, the client 101 and the server 102 may be the same device, which is not specifically limited. In the embodiments of the present application, the client 101 and the server 102 are respectively different devices for description as an example.

下面基于图1c，以客户端101为客户端，服务端102为服务器为例，对本申请实施例提供的生成视频封面图像的方法进行具体介绍。Based on FIG. 1 c , the method for generating a video cover image provided by the embodiment of the present application will be described in detail by taking the client 101 as the client and the server 102 as the server as an example.

请参考图2a，为本申请实施例提供的生成视频封面图像的方法的一种流程示意图。Please refer to FIG. 2a , which is a schematic flowchart of a method for generating a video cover image provided by an embodiment of the present application.

S21，获取待处理视频，以及待处理视频的初始封面图像。S21: Acquire the video to be processed and an initial cover image of the video to be processed.

客户端可以获取待处理视频，以及待处理视频的初始封面图像。例如，客户端响应于目标对象针对显示界面的控制操作，加载待处理视频，以及待处理视频的初始封面图像。控制操作可以是上传视频的操作，也可以是拍摄视频操作，也可以是进入浏览视频缩略图的界面的操作等，具体不做限制。又例如，客户端可以接收其他设备发送的待处理视频，以及待处理视频的初始封面图像等。The client can obtain the video to be processed and the initial cover image of the video to be processed. For example, in response to the target object's control operation on the display interface, the client loads the video to be processed and the initial cover image of the video to be processed. The control operation may be an operation of uploading a video, an operation of shooting a video, or an operation of entering an interface for browsing video thumbnails, etc., which are not specifically limited. For another example, the client may receive the video to be processed sent by other devices, as well as the initial cover image of the video to be processed, and the like.

又例如，在获得待处理视频之后，客户端可以基于待处理视频，提取待处理视频的初始封面图像，初始封面图像可以是随机从待处理视频中抽取的视频帧，也可以是与下一关键视频帧之间包含的视频帧的数量最多的关键视频帧，也可以是响应于目标对象触发的选取操作时，从待处理视频中选取的指定视频帧等，具体不做限制。For another example, after obtaining the video to be processed, the client can extract the initial cover image of the video to be processed based on the video to be processed. The key video frame with the largest number of video frames included between the video frames may also be a designated video frame selected from the video to be processed in response to a selection operation triggered by the target object, etc., which is not specifically limited.

S22，基于待处理视频和初始封面图像，生成待处理视频的目标封面图像。S22, based on the video to be processed and the initial cover image, generate a target cover image of the video to be processed.

在获取待处理视频，以及待处理视频的初始封面图像之后，客户端可以基于待处理视频和初始封面图像，生成待处理视频的目标封面图像。目标封面图像可以是由客户端确定出与待处理视频的内容相匹配的画中画图像，并在初始封面图像上，添加画中画图像生成的。目标封面图像也可以是客户端向服务器发送待处理视频和初始封面图像，由服务器确定出与待处理视频的内容相匹配的画中画图像，并在初始封面图像上，添加画中画图像生成的，客户端接收服务器发送的目标封面图像之后，客户端获得目标封面图像等，具体不做限制。本申请实施例以服务器向客户端发送目标封面图像为例进行介绍。After acquiring the video to be processed and the initial cover image of the video to be processed, the client may generate a target cover image of the video to be processed based on the video to be processed and the initial cover image. The target cover image may be generated by the client determining a picture-in-picture image that matches the content of the video to be processed, and adding the picture-in-picture image to the initial cover image. The target cover image can also be the client sending the video to be processed and the initial cover image to the server, the server determines the picture-in-picture image that matches the content of the video to be processed, and adds the picture-in-picture image to the initial cover image to generate Yes, after the client receives the target cover image sent by the server, the client obtains the target cover image, etc. There is no specific limitation. The embodiments of the present application are introduced by taking the server sending the target cover image to the client as an example.

下面对服务器确定待处理视频的目标封面图像的过程进行具体介绍，客户端确定待处理视频的目标封面图像的过程类似，在此不再赘述，请参考图2b。The process of determining the target cover image of the video to be processed by the server will be described in detail below. The process of determining the target cover image of the video to be processed by the client is similar, which will not be repeated here. Please refer to FIG. 2b.

S201，基于信息提取策略，提取待处理视频包含的关键特征信息。S201, based on an information extraction strategy, extract key feature information contained in the video to be processed.

基于信息提取策略，提取待处理视频包含的关键特征信息，关键特征信息可以用于表征待处理视频的主题，也可以用于表征待处理视频包含的关键对象，也可以用于表征待处理视频的主题，以及待处理视频包含的关键对象。关键特征信息可以包含词特征、人脸特征和物体特征等。Based on the information extraction strategy, the key feature information contained in the video to be processed is extracted. The key feature information can be used to characterize the subject of the video to be processed, the key objects contained in the video to be processed, and the characteristics of the video to be processed. The subject, and the key objects that the video to process contains. The key feature information can include word features, face features, and object features.

提取待处理视频包含的关键特征信息的方法有多种，下面以其中的三种为例进行介绍。There are many methods for extracting key feature information contained in the video to be processed, and three of them are used as examples to introduce below.

方法一：method one:

获取待处理视频的发布信息和字幕文件，并提取发布信息和字幕文件包含的关键词的词特征。Obtain the release information and subtitle files of the video to be processed, and extract the word features of the keywords contained in the release information and subtitle files.

在获得待处理视频之后，服务器可以获取待处理视频的发布信息和字幕文件。待处理视频的发布信息可以包括待处理视频的发布者、发布时间、标题、标签、参与的话题等。服务器可以根据待处理视频的标题、标签和参与的话题等，确定待处理视频的发布信息包含的关键词。服务器在获得待处理视频的字幕文件之后，可以确定待处理视频的字幕文件包含的关键词。从而，关键词可以在一定程度上表征待处理视频的主题，以及包含的关键对象。After obtaining the video to be processed, the server may obtain release information and subtitle files of the video to be processed. The publishing information of the to-be-processed video may include the publisher of the to-be-processed video, the release time, the title, the tag, the topic of participation, and the like. The server may determine the keywords contained in the release information of the video to be processed according to the title, tag, and topic of participation of the video to be processed. After obtaining the subtitle file of the video to be processed, the server may determine the keywords contained in the subtitle file of the video to be processed. Thus, keywords can characterize the subject of the video to be processed and the key objects contained in it to a certain extent.

关键词可以是标题中的实体，也可以是字幕文件中，包含的数量最多的实体、句子，也可以是存在弹幕数量最多的待处理视频帧对应的字幕，也可以是标题、标签和参与的话题等包含的人名、动物或知识产权IP，也可以是字幕文件中数量最多的人名、动物或知识产权IP等。The keyword can be the entity in the title, or the entity or sentence with the largest number in the subtitle file, or the subtitle corresponding to the video frame to be processed with the largest number of barrages, or the title, tag and participation. The names of people, animals, or intellectual property IPs contained in the topic, etc., can also be the names, animals, or intellectual property IPs with the largest number of people in the subtitle file.

知识产权IP可以是电视剧名称、电影名称或动漫名称等。服务器在获得待处理视频的发布信息包含的关键词之后，可以提取关键词的词特征，词特征通过量化关键词的特征，唯一的表征关键词。The intellectual property IP can be the name of a TV series, a movie or an anime, etc. After obtaining the keywords contained in the release information of the video to be processed, the server may extract the word features of the keywords, and the word features uniquely characterize the keywords by quantifying the features of the keywords.

服务器在获得关键词的词特征之后，可以将词特征作为待处理视频包含的关键特征信息。After obtaining the word feature of the keyword, the server may use the word feature as the key feature information contained in the video to be processed.

方法二：Method Two:

获取待处理视频中的关键视频帧，并提取关键视频帧包含的人脸区域的人脸特征。Acquire key video frames in the video to be processed, and extract the face features of the face region contained in the key video frames.

在获得待处理视频之后，可以对待处理视频进行抽帧处理，获取待处理视频中的关键视频帧，关键视频帧为待处理视频中，表征视频场景切换的视频帧。在获得关键视频帧之后，服务器可以对关键视频帧进行人脸检测处理，确定关键视频帧是否包含的人脸区域。关键视频帧的数量可以是一个，也可以是多个，根据包含人脸区域的关键视频帧，可以获得人脸区域的人脸特征，例如人脸embedding。After the to-be-processed video is obtained, frame extraction can be performed on the to-be-processed video to obtain key video frames in the to-be-processed video, where the key video frames are video frames in the to-be-processed video that represent video scene switching. After obtaining the key video frame, the server may perform face detection processing on the key video frame to determine whether the key video frame contains a face area. The number of key video frames can be one or more. According to the key video frames containing the face region, the face features of the face region, such as face embedding, can be obtained.

服务器在获得关键视频帧包含的人脸区域的人脸特征之后，可以将人脸特征作为待处理视频包含的关键特征信息。After obtaining the face feature of the face region included in the key video frame, the server may use the face feature as the key feature information included in the video to be processed.

方法三：Method three:

获取待处理视频中的关键视频帧，并提取关键视频帧包含的物体区域的物体特征。Acquire key video frames in the video to be processed, and extract object features of object regions contained in the key video frames.

在根据方法二中的介绍，获得关键视频帧之后，服务器可以对关键视频帧进行物体检测处理，确定关键视频帧是否包含的物体区域，物体区域为关键视频帧中包含物体的区域，例如包含动态物体的区域，或包含静态物体的区域。关键视频帧的数量可以是一个，也可以是多个，根据包含物体区域的关键视频帧，可以获得物体区域的物体特征。According to the introduction in Method 2, after obtaining the key video frame, the server can perform object detection processing on the key video frame to determine whether the key video frame contains an object area, and the object area is the area containing objects in the key video frame, such as dynamic Areas of objects, or areas containing static objects. The number of key video frames may be one or multiple, and according to the key video frames containing the object area, the object features of the object area can be obtained.

服务器在获得关键视频帧包含的物体区域的物体特征之后，可以将物体特征作为待处理视频包含的关键特征信息。After obtaining the object feature of the object region included in the key video frame, the server may use the object feature as the key feature information included in the video to be processed.

作为一种实施例，服务器可以仅采用上述方法中的一种方法，将获得的词特征、人脸特征或物体特征，作为关键特征信息。服务器还可以结合使用上述方法，例如，将词特征和人脸特征，作为关键特征信息；又例如，将人脸特征和物体特征，作为关键特征信息；又例如，将词特征和物体特征，作为关键特征信息；又例如，将词特征、人脸特征和物体特征，作为关键特征信息。As an embodiment, the server may use only one of the above methods, and use the obtained word feature, face feature or object feature as key feature information. The server may also use the above methods in combination, for example, taking word features and face features as key feature information; another example, taking face features and object features as key feature information; another example, taking word features and object features as key feature information Key feature information; for another example, word features, face features, and object features are used as key feature information.

S202，获取与关键特征信息相匹配的各个候选素材，并从各个候选素材中选取出满足呈现条件的候选素材，作为目标素材。S202: Obtain each candidate material that matches the key feature information, and select a candidate material that satisfies the presentation condition from each candidate material as a target material.

在获得关键特征信息之后，服务器可以获取与关键特征信息相匹配的各个候选素材。服务器可以从待处理视频包含的各个待处理视频帧中，确定出与关键特征信息相匹配的各个候选素材；也可以从网络资源中收集的各个参考图像中，确定出与关键特征信息相匹配的各个候选素材；也可以从各个待处理视频帧和各个参考图像中，确定出与关键特征信息相匹配的各个候选素材等，具体不做限制。After obtaining the key feature information, the server may obtain each candidate material matching the key feature information. The server can determine each candidate material that matches the key feature information from each to-be-processed video frame contained in the video to be processed; it can also determine each reference image that matches the key feature information from each reference image collected from network resources. Each candidate material; each candidate material that matches the key feature information may also be determined from each to-be-processed video frame and each reference image, which is not specifically limited.

下面以服务器从各个视频帧和各个参考图像中，确定出与关键特征信息相匹配的各个候选素材为例进行介绍。The following description will be given by taking as an example that the server determines each candidate material matching the key feature information from each video frame and each reference image.

服务器可以实时从获取网络资源中收集各个参考图像，也可以以预设时长为周期收集各个参考图像，也可以接收其他设备发送的网络资源中的各个参考图像等，具体不做限制。参考图像可以是需要采购版权内容的图像，服务器通过采购版权内容，获得参考图像。The server can collect each reference image from the acquired network resources in real time, or collect each reference image in a preset period of time, and can also receive each reference image from the network resources sent by other devices, etc., which is not specifically limited. The reference image may be an image for which copyright content needs to be purchased, and the server obtains the reference image by purchasing the copyright content.

服务器可以提取待处理视频包含的各个待处理视频帧各自的视频帧特征，以及提取各个参考图像各自的图像特征。提取视频帧特征和图像特征的方式可以有多种，例如，采用CLIP模型提取；又例如，采用VGG16、Inception系列模型、ResNet等经典模型提取；又例如，采用one-stage的人脸检测网络Retinaface，结合标注任务当中经常出现的亚洲人脸和任务场景当中明星人脸训练得到的Arcface，以及Resnet101模型提取等。The server may extract respective video frame features of each to-be-processed video frame contained in the to-be-processed video, and extract respective image features of each reference image. There are many ways to extract video frame features and image features, for example, using CLIP model extraction; another example, using VGG16, Inception series models, ResNet and other classic models to extract; another example, using one-stage face detection network Retinaface , combined with the Asian face that often appears in the labeling task and the Arcface trained on the star face in the task scene, as well as the Resnet101 model extraction.

在获得各个视频帧特征和各个图像特征之后，服务器可以基于各个视频帧特征和各个图像特征，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选素材。服务器可以基于视频帧特征与关键特征信息之间的相似度大于相似度阈值的待处理视频帧，以及基于图像特征与关键特征信息之间的相似度大于相似度阈值的待处理视频帧，确定与关键特征信息相匹配的各个候选素材。After obtaining each video frame feature and each image feature, the server may determine each candidate material matching the key feature information from each video frame to be processed and each reference image based on each video frame feature and each image feature. The server may, based on the to-be-processed video frames whose similarity between the video frame features and the key feature information is greater than the similarity threshold, and the to-be-processed video frames whose similarity between the image features and the key feature information is greater than the similarity threshold, determine the Each candidate material that matches the key feature information.

服务器还可以基于视频帧特征与关键特征信息之间的相似度，按照相似度由大到小的顺序，排列每个待处理视频帧，以及基于图像特征与关键特征信息之间的相似度，按照相似度由大到小的顺序，排列每个参考图像。服务器基于排列序号排在指定序号之前的待处理视频帧和参考图像，确定与关键特征信息相匹配的各个候选素材。The server can also arrange each video frame to be processed based on the similarity between the video frame features and the key feature information in descending order of similarity, and based on the similarity between the image features and the key feature information, according to Arrange each reference image in descending order of similarity. The server determines each candidate material matching the key feature information based on the video frames to be processed and the reference images whose sequence numbers are arranged before the specified sequence numbers.

例如，以关键特征信息包含词特征、人脸特征和物体特征为例，服务器可以分别确定词特征与各个视频帧特征之间图文相似度，以及词特征与各个图像特征之间的图文相似度。图文相似度可以采用CLIP模型进行计算，实现从各个待处理视频帧和各个参考图像中，检索出与词特征相匹配的候选素材。For example, taking the key feature information including word features, face features, and object features as an example, the server can determine the graphic similarity between the word feature and each video frame feature, as well as the graphic similarity between the word feature and each image feature. Spend. The similarity between pictures and texts can be calculated by using the CLIP model, so as to retrieve candidate materials that match the word features from each video frame to be processed and each reference image.

服务器还可以分别确定人脸特征与各个视频帧特征之间的人脸相似度，以及人脸特征与各个图像特征之间的人脸相似度，由于人脸之间的差异性相对较小，可以通过多个模型结合来确定人脸相似度，提高检索精度，例如采用one-stage的人脸检测网络Retinaface，结合标注任务当中经常出现的亚洲人脸和任务场景当中明星人脸训练得到的Arcface，以及Resnet101模型确定人脸相似度，实现从各个待处理视频帧和各个参考图像中，检索出与人脸特征相匹配的候选素材。The server can also determine the face similarity between the face feature and each video frame feature, as well as the face similarity between the face feature and each image feature. Since the difference between the faces is relatively small, it can be The similarity of faces is determined by the combination of multiple models, and the retrieval accuracy is improved. For example, the one-stage face detection network Retinaface is used, and Arcface is trained by combining Asian faces that often appear in labeling tasks and star faces in task scenes. And the Resnet101 model determines the similarity of the face, and realizes the retrieval of candidate materials that match the face features from each video frame to be processed and each reference image.

服务器还可以分别确定物体特征与各个视频帧特征之间的物体相似度，以及，物体特征与各个图像特征之间的物体相似度。由于物体之间的差异性相对较大，因此可以采用传统的神经网络模型确定物体相似度，实现从各个待处理视频帧和各个参考图像中，检索出与物体特征相匹配的候选素材。请参考图3，服务器可以将物体特征作为查询条件，通过神经网络模型，确定物体特征与各个视频帧特征之间的物体相似度，以及，物体特征与各个图像特征之间的物体相似度，相似度大于阈值则输出为1，相似度小于阈值则输出为0，得到包含0或1的输出向量。输出向量中包含多个元素位置，多个元素位置分别对应各个待处理视频帧和各个参考图像，元素位置上的元素为1，表示对应的待处理视频帧的视频帧特征或参考图像的图像特征，与物体特征之间的相似度大于阈值；元素位置上的元素为0，表示对应的待处理视频帧的视频帧特征或参考图像的图像特征，与物体特征之间的相似度不大于阈值。从而，可以基于输出向量中1对应的待处理视频帧或参考图像，获得候选素材。The server may further determine the object similarity between the object feature and each video frame feature, and the object similarity between the object feature and each image feature. Because the differences between objects are relatively large, the traditional neural network model can be used to determine the similarity of objects, so as to retrieve candidate materials that match the characteristics of objects from each video frame to be processed and each reference image. Please refer to Figure 3, the server can use the object feature as a query condition, through the neural network model, determine the object similarity between the object feature and each video frame feature, and the object similarity between the object feature and each image feature, the similarity If the degree is greater than the threshold, the output is 1, and if the similarity is less than the threshold, the output is 0, and an output vector containing 0 or 1 is obtained. The output vector contains multiple element positions. The multiple element positions correspond to each video frame to be processed and each reference image. The element in the element position is 1, which represents the video frame feature of the corresponding video frame to be processed or the image feature of the reference image. , the similarity with the object feature is greater than the threshold; the element at the element position is 0, indicating that the video frame feature of the corresponding video frame to be processed or the image feature of the reference image, and the similarity between the object feature is not greater than the threshold. Therefore, the candidate material can be obtained based on the to-be-processed video frame or reference image corresponding to 1 in the output vector.

服务器在获得各个图文相似度、各个人脸相似度和各个物体相似度之后，可以基于获得的各个图文相似度、各个人脸相似度和各个物体相似度，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选素材；也可以基于相应图文相似度、人脸相似度和物体相似度的加权和，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选素材。服务器可以将加权和最大的图像作为候选素材，也可以将加权和最大的几张图像均作为候选素材等。例如，采用针对聚类和相似性的Faiss库，为稠密向量提供高效相似度搜索和聚类，支持十亿级别向量的搜索，可以非常高效实现向量的检索和匹配。又例如，采用01向量进行相关性查询，相似记为1，不相似记为0，从而可以检索或匹配出与关键特征信息相匹配的各个候选素材。After the server obtains the similarity of each image and text, the similarity of each face, and the similarity of each object, it can select the similarity of each image and text, the similarity of each face and the similarity of each object based on the obtained similarity of each image and text, the similarity of each face and the similarity of each object. In the reference image, each candidate material that matches the key feature information is determined; it can also be determined from each video frame to be processed and each reference image based on the weighted sum of the corresponding graphic similarity, face similarity, and object similarity. Each candidate material that matches the key feature information. The server may use the image with the largest weighted sum as the candidate material, or may use several images with the largest weighted sum as the candidate material. For example, the Faiss library for clustering and similarity is used to provide efficient similarity search and clustering for dense vectors, support the search of billion-level vectors, and can very efficiently achieve vector retrieval and matching. For another example, the 01 vector is used to perform a correlation query, the similarity is marked as 1, and the dissimilarity is marked as 0, so that each candidate material matching the key feature information can be retrieved or matched.

作为一种实施例，为了减少各个特征占用的存储空间，可以对视频帧特征和图像特征进行降维处理，例如，视频帧特征和图像特征是向量形式，那么可以从浮点数向量转换为01Bit位的01向量，达到降维的目的，减少各个特征占用的存储空间。As an example, in order to reduce the storage space occupied by each feature, dimensionality reduction processing can be performed on the video frame features and image features. For example, if the video frame features and image features are in the form of vectors, they can be converted from floating point vectors to 01Bit bits. 01 vector to achieve the purpose of dimensionality reduction and reduce the storage space occupied by each feature.

作为一种实施例，服务器可以将各个待处理视频帧和各个参考图像中，与关键特征信息相匹配的候选图像，作为候选素材；也可以对与关键特征信息相匹配的候选图像进行裁剪，将裁剪后的候选图像，作为候选素材等。As an embodiment, the server may use candidate images matching the key feature information in each video frame to be processed and each reference image as candidate materials; it may also crop the candidate images matching the key feature information, and Cropped candidate images, as candidate materials, etc.

服务器在基于各个视频帧特征和各个图像特征，从各个待处理视频帧和各个参考图像中，确定出与关键特征信息相匹配的各个候选图像之后，可以分别确定各个候选图像，各自与关键特征信息相匹配的图像区域。服务器可以通过目标检测算法，采用矩形框确定出与关键特征信息相匹配的图像区域；服务器也可以通过边缘识别算法，对与关键特征信息相匹配的目标进行边缘识别，根据目标的边缘确定出与关键特征信息相匹配的图像区域。After the server determines each candidate image that matches the key feature information from each to-be-processed video frame and each reference image based on each video frame feature and each image feature, the server can determine each candidate image, each with the key feature information. matching image area. The server can use the target detection algorithm to determine the image area that matches the key feature information by using a rectangular frame; the server can also use the edge recognition algorithm to perform edge recognition on the target matching the key feature information, and determine the target according to the edge of the target. The key feature information is matched to the image area.

在确定出各个与关键特征信息相匹配的图像区域之后，服务器可以基于各个图像区域，分别对各个候选图像进行裁剪处理，获得各个候选素材。通过去除候选图像中不必要的元素，使得获得的候选素材能够更加精准的表达待处理视频的主题、待处理视频包含的关键对象。After each image area matching the key feature information is determined, the server may, based on each image area, perform cropping processing on each candidate image to obtain each candidate material. By removing unnecessary elements in the candidate image, the obtained candidate material can more accurately express the subject of the video to be processed and the key objects contained in the video to be processed.

在获得与关键特征信息相匹配的各个候选素材之后，服务器可以从各个候选素材中选取出满足呈现条件的候选素材，作为目标素材。从各个候选素材中选取出满足呈现条件的候选素材的方法有多种，例如，服务器可以基于清晰度评估策略，对各个候选素材进行清晰度评估，确定各个候选素材各自的清晰度评估值。服务器可以将清晰度评估值大于清晰度阈值的候选素材，作为目标素材；服务器也可以基于清晰度评估值，对各个候选素材进行排序，将排在指定序号之前的候选素材作为目标素材等。After obtaining each candidate material matching the key feature information, the server may select a candidate material that satisfies the presentation condition from the various candidate materials as the target material. There are various methods for selecting candidate materials that meet the presentation conditions from each candidate material. For example, the server may perform a definition evaluation on each candidate material based on a definition evaluation strategy, and determine the respective definition evaluation value of each candidate material. The server may use candidate materials whose definition evaluation value is greater than the definition threshold as the target material; the server may also sort each candidate material based on the definition evaluation value, and use the candidate material before the specified serial number as the target material, etc.

又例如，基于内容质量评估策略，对各个候选素材进行内容质量评估，确定各个候选素材各自的内容质量评估值。服务器可以将内容质量评估值大于内容质量阈值的候选素材，作为目标素材；服务器也可以基于内容质量评估值，对各个候选素材进行排序，将排在指定序号之前的候选素材作为目标素材等。对各个候选素材进行内容质量评估可以将涉及广告推销的内容、违法违规的内容等评估为较低的内容质量评估值，使得根据内容质量评估值确定出的目标素材中，不存在涉及广告推销的内容、违法违规的内容。For another example, based on the content quality evaluation strategy, the content quality evaluation is performed on each candidate material, and the respective content quality evaluation value of each candidate material is determined. The server may use the candidate materials whose content quality evaluation value is greater than the content quality threshold as the target material; the server may also sort each candidate material based on the content quality evaluation value, and use the candidate material before the specified serial number as the target material, etc. The content quality evaluation of each candidate material can evaluate the content related to advertising promotion, illegal content, etc. as a lower content quality evaluation value, so that among the target materials determined according to the content quality evaluation value, there is no content related to advertising promotion. Content, illegal content.

作为一种实施例，服务器可以在获得各个清晰度评估值和各个内容质量评估值之后，对相应的清晰度评估值和内容质量评估值进行加权求和，获得各个加权和。服务器在获得各个加权和之后，可以将加权和符合大于呈现阈值的候选素材，作为目标素材；服务器也可以基于加权和对各个候选素材进行排序，将排在指定序号之前的候选素材，作为目标素材。As an embodiment, after obtaining each definition evaluation value and each content quality evaluation value, the server may perform a weighted sum on the corresponding definition evaluation value and content quality evaluation value to obtain each weighted sum. After obtaining each weighted sum, the server can take the candidate material whose weighted sum is greater than the presentation threshold as the target material; the server can also sort each candidate material based on the weighted sum, and take the candidate material before the specified serial number as the target material .

S203，将目标素材作为初始封面图像的画中画图像，合成目标素材与初始封面图像，生成待处理视频的目标封面图像。S203, using the target material as a picture-in-picture image of the initial cover image, and synthesizing the target material and the initial cover image to generate a target cover image of the video to be processed.

服务器在获得目标素材之后，可以将目标素材作为初始封面图像的画中画图像，合成目标素材与初始封面图像，生成待处理视频的目标封面图像。请参考图4a，为待处理视频的初始封面图像。该初始封面图像仅包含一个女子正在说话的场景，无法准确地表征该待处理视频的看点等。网络对象根据初始封面图像，无法得知待处理视频的主题或其中包含的关键对象，很大可能不会点击进入该待处理视频进行观看，从而错过感兴趣的视频，也有可能网络对象点击进入该待处理视频进行观看，在观看视频后，才得知该待处理视频为不感兴趣的视频，从而网络对象需要点击进入多个视频进行观看，才能够观看到感兴趣的视频。After obtaining the target material, the server may use the target material as a picture-in-picture image of the initial cover image, synthesize the target material and the initial cover image, and generate the target cover image of the video to be processed. Please refer to FIG. 4a, which is the initial cover image of the video to be processed. The initial cover image only contains a scene where a woman is speaking, and cannot accurately represent the point of view of the video to be processed. According to the initial cover image, the network object cannot know the subject of the video to be processed or the key objects contained in it. It is very likely that the network object will not click to enter the pending video for viewing, thus missing the video of interest. It is also possible that the network object clicks to enter the video. Watch the video to be processed. After watching the video, it is known that the video to be processed is a video of no interest. Therefore, the network object needs to click to enter multiple videos for viewing before watching the video of interest.

在将目标素材作为初始封面图像的画中画图像，合成目标素材与初始封面图像，生成待处理视频的目标封面图像之后，请参考图4b，目标封面图像包含一个女子正在说话的场景，以及明星A说长得不像的目标素材，那么目标封面图像可以表征该待处理视频中，女子说了一件与她外貌不相符的事情，同时，还可以表征该视频中，有明星A的参与。从而，网络对象可以准确地获得该待处理视频所表达的主题，以及待处理视频中包含的关键对象，使得网络对象可以更加准确地通过目标封面图像，获取到感兴趣的视频进行观看。After taking the target material as the picture-in-picture image of the initial cover image, synthesizing the target material and the initial cover image, and generating the target cover image of the video to be processed, please refer to Figure 4b, the target cover image contains a scene where a woman is talking, and a star A said that the target material does not look alike, then the target cover image can represent that in the to-be-processed video, the woman said something that does not match her appearance, and at the same time, it can also represent the participation of star A in the video. Therefore, the network object can accurately obtain the subject expressed by the to-be-processed video and the key objects contained in the to-be-processed video, so that the network object can more accurately obtain the video of interest for viewing through the target cover image.

作为一种实施例，将目标素材作为初始封面图像的画中画图像，合成目标素材与初始封面图像的方法有多种，例如，将目标素材覆盖在初始封面图像中的指定位置之上，合成目标素材与初始封面图像；又例如，将目标素材覆盖在初始封面图像中不包含目标对象的位置之上，合成目标素材与初始封面图像；又例如，获取初始封面图像包含的目标对象，将目标素材和目标对象按照预存的封面模板进行重新排版，合成目标素材与初始封面图像等，具体不做限制。As an example, using the target material as the picture-in-picture image of the initial cover image, there are various methods for synthesizing the target material and the initial cover image. The target material and the initial cover image; another example, overlay the target material on the position that does not contain the target object in the initial cover image, and synthesize the target material and the initial cover image; another example, obtain the target object contained in the initial cover image, The material and target objects are rearranged according to the pre-stored cover template, and the target material and the initial cover image are synthesized, and there are no specific restrictions.

下面以将目标素材覆盖在初始封面图像中不包含目标对象的位置之上，合成目标素材与初始封面图像的过程为例进行介绍。The following is an example of the process of synthesizing the target material and the initial cover image by covering the target material on the position that does not contain the target object in the initial cover image.

服务器可以检测初始封面图像，确定初始封面图像包含的目标对象，例如服务器将初始封面图像作为已训练的目标检测模型的输入，获得目标检测模型输出的初始封面图像包含的目标对象。目标检测模型输出的初始封面图像包含的目标对象可以是以矩形框进行标注，也可以是以目标对象的边缘进行标注等，具体不做限制。The server may detect the initial cover image and determine the target object contained in the initial cover image. For example, the server uses the initial cover image as the input of the trained target detection model, and obtains the target object contained in the initial cover image output by the target detection model. The target object included in the initial cover image output by the target detection model may be marked with a rectangular frame, or marked with the edge of the target object, etc., which is not specifically limited.

服务器可以基于目标对象在初始封面图像中的位置，将初始封面图像划分为目标区域和非目标区域。目标区域为初始封面图像中包含目标对象的区域，可以是矩形区域，也可以是以目标对象的边缘围成的区域。非目标区域为初始封面图像中，除了目标区域以外的区域。The server may divide the initial cover image into a target area and a non-target area based on the position of the target object in the initial cover image. The target area is the area containing the target object in the initial cover image, which may be a rectangular area or an area enclosed by the edges of the target object. The non-target area is the area other than the target area in the original cover image.

服务器可以基于非目标区域的形状尺寸，调整目标素材的尺寸。服务器可以以目标素材占满非目标区域为目标，调整目标素材的尺寸，例如，非目标区域为矩形区域，目标素材的形状为圆形，那么可以以矩形区域的长边，作为圆形的直径，调整目标素材的尺寸。The server can adjust the size of the target material based on the shape and size of the non-target area. The server can adjust the size of the target material with the target material filling the non-target area. For example, if the non-target area is a rectangular area and the shape of the target material is a circle, the long side of the rectangular area can be used as the diameter of the circle. , adjust the size of the target material.

如果目标素材包含的内容，与目标区域包含的目标对象为相同的对象，那么服务器可以以相对于目标区域包含的目标对象放大指定倍数为目标，调整目标素材的尺寸。请参考图4c，目标素材为目标区域包含的一个目标对象的人脸区域，那么可以将目标素材放大为原来的2倍，以突出该目标对象的面部表情。由于该场景为一个搞笑场景，通过突出该目标对象的面部表情，可以达到增强喜剧效果的目的，提高代入感，使得目标封面图像能够准确地表征该待处理视频的搞笑主题。If the content contained in the target material is the same object as the target object contained in the target area, the server may adjust the size of the target material by aiming to enlarge the target object by a specified multiple relative to the target object contained in the target area. Referring to Fig. 4c, the target material is the face area of a target object contained in the target area, then the target material can be enlarged to 2 times to highlight the facial expression of the target object. Since the scene is a funny scene, by highlighting the facial expressions of the target object, the purpose of enhancing the comedy effect and the sense of substitution can be achieved, so that the target cover image can accurately represent the funny theme of the video to be processed.

服务器还可以先确定目标区域占初始封面图像的比例，再调整目标素材的尺寸，使得目标素材占初始封面图像的比例，与目标区域占初始封面图像的比例相同等。The server may also first determine the proportion of the target area to the initial cover image, and then adjust the size of the target material so that the proportion of the target material to the initial cover image is the same as the proportion of the target area to the initial cover image.

在调整目标素材的尺寸之后，获得调整后的目标素材。服务器可以将调整后的目标素材，覆盖在初始封面图像中非目标区域之上，生成待处理视频的目标封面图像。After adjusting the size of the target material, the adjusted target material is obtained. The server may overlay the adjusted target material on the non-target area in the initial cover image to generate the target cover image of the video to be processed.

作为一种实施例，服务器确定初始封面图像包含的目标对象之后，可以对初始封面图像进行裁剪处理，获得包含目标对象的目标对象素材。服务器可以结合预存的封面模板，以及获得的目标素材和目标对象素材，生成目标封面图像。在一些情况下，初始封面图像的背景较为复杂，容易造成获得的目标封面图像内容混乱的问题，因此，可以将初始封面图像中的目标对象提取出来，结合清晰简洁的封面模板，再将目标素材作为画中画图像，叠加封面模板之上，生成目标封面图像。As an embodiment, after determining the target object included in the initial cover image, the server may perform cropping processing on the initial cover image to obtain target object material including the target object. The server may generate a target cover image by combining the pre-stored cover template, and the obtained target material and target object material. In some cases, the background of the initial cover image is complex, which may easily cause confusion in the content of the obtained target cover image. Therefore, the target object in the initial cover image can be extracted, combined with a clear and concise cover template, and then the target material can be extracted. As a picture-in-picture image, superimposed on top of the cover template to generate the target cover image.

本申请实施例中，可以通过分布式的各个系统，调用各个服务，协同工作，实现本申请实施例提供的生成视频封面图像的方法，请参考图5。In the embodiments of the present application, various distributed systems can be used to invoke various services and work together to implement the method for generating a video cover image provided by the embodiments of the present application, please refer to FIG. 5 .

网络对象可以通过内容提供端上传待处理视频，进行视频的发布。内容提供端包括PGC、UGC、多频道网络(Multi-Channel Network，MCN)，或专业用户生产内容(Professional Generated Content+User Generated Content，PUGC)等内容生产形式。网络对象通过移动终端，或通过客户端调用后端应用程序编程接口(ApplicationProgramming Interface，API)系统，上传待处理视频，向服务器提供本地或者拍摄的视频内容或者撰写的自媒体文章或者图集等。网络对象可以在上传待处理视频的同时，上传对应的封面图像，也可以通过服务器在待处理视频中选取初始封面图像。The network object can upload the video to be processed through the content provider to publish the video. Content providers include PGC, UGC, Multi-Channel Network (MCN), or professional user-generated content (Professional Generated Content+User Generated Content, PUGC) and other content production forms. The network object invokes the back-end Application Programming Interface (API) system through the mobile terminal or the client, uploads the video to be processed, and provides the server with local or captured video content or self-media articles or atlases written. The network object can upload the corresponding cover image while uploading the video to be processed, or can select the initial cover image from the video to be processed through the server.

内容提供端通过和上下行内容接口服务的通讯，先获取上传服务器接口地址，然后上传待处理视频，调用内容存储服务，将待处理视频存储在内容数据库中。上下行内容接口服务还可以从内容提供端获取待处理视频的标题，发布者，摘要，封面图，发布时间等。在调用内容存储服务时，还可以将待处理视频的元信息，比如视频文件大小，封面图链接，码率，文件格式，标题，发布时间，作者，是否原创的标记，以及是否首发等信息存储在内容数据库中。上下行内容接口服务可以将在内容数据库中存储的数据，提交给调度中心服务，以便调度中心服务进行后续的内容处理和流转。The content provider first obtains the interface address of the upload server through communication with the upstream and downstream content interface services, then uploads the video to be processed, calls the content storage service, and stores the video to be processed in the content database. The upstream and downstream content interface services can also obtain the title, publisher, abstract, cover image, publishing time, etc. of the video to be processed from the content provider. When calling the content storage service, you can also store the meta information of the video to be processed, such as video file size, cover image link, bit rate, file format, title, release time, author, whether it is an original mark, and whether it is first published. in the content database. The upstream and downstream content interface services can submit the data stored in the content database to the dispatch center service, so that the dispatch center service can perform subsequent content processing and circulation.

从而，网络对象在搜索视频时，内容消费端可以和内容分发出口服务通讯，获取被搜索视频对应的索引信息。通过和内容存储服务通讯，从内容数据库下载索引信息对应的被搜索视频的流媒体文件。从而，可以通过本地播放器播放流媒体文件，或与边缘部署的CDN服务通讯，呈现图文数据。Therefore, when a network object searches for a video, the content consumer can communicate with the content distribution export service to obtain index information corresponding to the searched video. By communicating with the content storage service, the streaming media file of the searched video corresponding to the index information is downloaded from the content database. Thus, streaming media files can be played through local players, or communicated with CDN services deployed at the edge to present graphic and text data.

内容消费端可以将上传和下载过程当中，网络对象浏览视频的行为数据，阅读速度，完成率，阅读时间，卡顿，加载时间，播放点击等上报给服务器，以使服务器可以基于上报的数据为网络对象提供更加人性化的服务。The content consumer can report the behavior data, reading speed, completion rate, reading time, freeze, loading time, play clicks, etc. of network objects browsing videos to the server during the uploading and downloading process, so that the server can be based on the reported data. Network objects provide more user-friendly services.

内容消费端可以通过Feeds流方式浏览视频，针对低质量内容提供直接举报和反馈的入口，该入口直接和人工审核系统对接，操作人员通过人工审核系统进行确认和复核，可以作为后续针对网络对象上传的视频进行过滤的过程中，机器模型质量过滤的样本数据。The content consumer can browse videos through Feeds streaming, and provide a direct report and feedback portal for low-quality content. This portal is directly connected to the manual review system, and operators can confirm and review through the manual review system, which can be used as a follow-up upload for network objects The sample data for the quality filtering of the machine model during the video filtering process.

调度中心服务主要包括机器处理和前述的人工审核处理，其中，机器处理可以进行各种质量判断，比如低质量内容过滤；还可以进行标记内容标签，比如内容分类，话题信息等；还可以进行内容排重等。调度中心服务获得的处理结果可以写入内容数据库中。The dispatch center service mainly includes machine processing and the aforementioned manual review processing. Among them, machine processing can perform various quality judgments, such as low-quality content filtering; it can also mark content tags, such as content classification, topic information, etc.; Rearrangement, etc. The processing results obtained by the dispatch center service can be written into the content database.

人工审核处理可以通过人工审核系统调用人工审核服务实现。在人工审核过程当中，人工审核系统会读取内容数据库当中的信息，同时人工审核的结果和状态也会回传进入内容数据库中。因此，内容数据库还包括人工审核过程中对内容的分类，包括一，二，三级别分类和标签信息，比如，一个讲解品牌手机的视频，一级分科是科技，二级分类是智能手机，三级分类是国内手机，标签信息是品牌，型号等。调度中心服务可以依照内容数据库中各个视频的一级分类的标注信息，进行不同调性增强模板策略指定等。The manual review process can be implemented by calling the manual review service through the manual review system. During the manual review process, the manual review system will read the information in the content database, and the results and status of the manual review will also be returned to the content database. Therefore, the content database also includes the classification of content during the manual review process, including first, second, and third-level classification and label information, for example, a video explaining a brand mobile phone, the first category is technology, the second category is smartphones, and the third category is smartphones. The class classification is the domestic mobile phone, and the label information is the brand, model, etc. The dispatch center service can specify different tonal enhancement template policies according to the label information of the first-level classification of each video in the content database.

人工审核服务可以是一个WEB系统，在链路上将机器过滤的结果作为输入，对机器过滤的结果进行人工确认和复核，将复核的结果写入内容数据库记录下来，同时可以通过这里人工复核的结果来在线评估机器过滤模型的实际效果。The manual review service can be a WEB system that takes the results of machine filtering as input on the link, manually confirms and reviews the results of machine filtering, and writes the review results into the content database for records. The results are used to evaluate the actual effect of the machine filtering model online.

调度中心服务在进行内容排重时，可以调用内容排重服务，内容排重服务主要包括标题去重，封面图的图片去重，内容正文去重，以及视频指纹和音频指纹去重。去重的过程通常是将图文内容标题和正文向量化，采用simmhash及BERT正文向量。在对图片向量去重时，对于视频内容抽取视频指纹和音频指纹构建向量，然后计算向量之间的距离，比如欧式距离，来确定是否重复。内容排重可以减少内容的审核量和确保同样内容在推荐分发池只有一份，保障用户体验。The scheduling center service can call the content reordering service when performing content reordering. The content reordering service mainly includes title removal, cover image removal, content text removal, and video fingerprint and audio fingerprint removal. The process of deduplication is usually to vectorize the title and text of the graphic content, using simmhash and BERT text vector. When deduplicating the image vector, the video fingerprint and the audio fingerprint are extracted from the video content to construct a vector, and then the distance between the vectors, such as the Euclidean distance, is calculated to determine whether it is repeated. Content sorting can reduce the amount of content review and ensure that there is only one copy of the same content in the recommended distribution pool, ensuring user experience.

调度中心服务主要负责视频和图文内容流转的整个调度过程，通过上下行内容接口服务接收上传的待处理视频，然后从内容数据库中获取视频的元信息。调度中心服务可以在接收上传的待处理视频时，调用画中画服务，为待处理视频生成目标封面图像，并存储与内容数据库中；也可以在内容消费端搜索到某一待处理视频时，调用画中画服务，为待处理视频生成目标封面图像，在通过内容消费端进行呈现等。The scheduling center service is mainly responsible for the entire scheduling process of video and graphic content flow. It receives uploaded videos to be processed through the upstream and downstream content interface services, and then obtains the meta information of the videos from the content database. The dispatch center service can call the picture-in-picture service when receiving the uploaded video to be processed, generate the target cover image for the video to be processed, and store it in the content database; Call the picture-in-picture service to generate the target cover image for the video to be processed, and present it through the content consumer.

画中画服务可以调用画面素材提取服务，基于待处理视频的关键特征信息，确定与关键特征信息相匹配的候选图像，调用选择和截取封面图服务，其中，封面图截取服务主要通过原图尺寸和裁剪后的目标尺寸，利用对候选图像的人物检测，主体检测，OCR文字识别，进行相应的截图，获得候选素材。The picture-in-picture service can call the picture material extraction service, determine the candidate images that match the key feature information based on the key feature information of the video to be processed, and call the selection and capture service of the cover image, where the cover image capture service mainly uses the original image size. and the cropped target size, use the person detection, subject detection, and OCR text recognition of the candidate image to take corresponding screenshots to obtain candidate materials.

在获得候选素材之后，可以调用选择和截取封面图服务，其中，封面图选择服务主要是对候选素材按照基础的质量特性比如清晰度，美观度，不适合图片，马赛克，低俗色情等进行过滤和筛选，去掉一些不适合做封面的低质量候选素材。After obtaining the candidate material, you can call the service of selecting and intercepting the cover image. The cover image selection service mainly filters and selects the candidate material according to basic quality characteristics such as clarity, aesthetics, unsuitable for pictures, mosaics, vulgar pornography, etc. Screen and remove some low-quality candidates that are not suitable for cover.

在基于各个候选素材，选取满足呈现条件的目标素材之后，获得的目标素材可以存储于增强图片素材库中，以便用于后续视频的封面图像。增强图片素材库用于保存图片增强的候选集，包括视频内容的抽帧的内容和采购版权内容构成。在获得目标素材之后，可以调用模板数据库，选取目标模板。画中画服务可以基于目标模板，合成目标素材和待处理视频的初始封面图像，生成待处理视频的目标封面图像。初始封面图像可以是内容提供端上传的封面图像，也可以是调用选择和截取封面图服务从待处理视频的各个视频帧中选取的封面图像等，具体不做限制。After selecting the target material that meets the presentation conditions based on each candidate material, the obtained target material can be stored in the enhanced picture material library, so as to be used for the cover image of the subsequent video. The enhanced image material library is used to save a candidate set for image enhancement, including the content of frame extraction of video content and the composition of purchased copyright content. After obtaining the target material, the template database can be called to select the target template. The picture-in-picture service can generate the target cover image of the video to be processed by synthesizing the target material and the initial cover image of the video to be processed based on the target template. The initial cover image may be the cover image uploaded by the content provider, or the cover image selected from each video frame of the video to be processed by invoking the selection and interception cover image service, which is not specifically limited.

模板数据库核心合成原则为，画面主体元素不应该被遮挡住，可以利用主体目标检测确定几个非目标区域，比如左侧，右侧或者上侧还是下侧策略，具体可以由实际情况来确定。模板数据库可以智能增强服务通讯，提供策略展现，如果有文字，还包括合成文字的字体和样式配置策略等。The core synthesis principle of the template database is that the main elements of the screen should not be blocked. The main target detection can be used to determine several non-target areas, such as the left, right or upper or lower strategies, which can be determined by the actual situation. The template database can intelligently enhance service communication, provide policy presentation, and if there is text, it also includes font and style configuration policies for synthesized text.

智能画中画服务可以和调度中心服务通讯，完成本申请实施例提供的生成视频封面图像的方法。调度中心服务包括提取标题关键词和实体词，然后是通过和画面素材提取服务通讯，完成目标素材的筛选和匹配，最后是生成目标封面图像，通过目标素材达到输出增强效果。The intelligent picture-in-picture service can communicate with the dispatch center service to complete the method for generating a video cover image provided by the embodiment of the present application. The dispatch center service includes extracting title keywords and entity words, then communicating with the screen material extraction service to complete the screening and matching of the target material, and finally generating the target cover image to achieve the output enhancement effect through the target material.

下面对本申请实施例提供的生成视频封面图像的方法进行示例介绍，请参考图6a。The following describes an example of the method for generating a video cover image provided by the embodiment of the present application, please refer to FIG. 6a.

客户端在获得待处理视频，以及待处理视频的初始封面图像之后，可以基于待处理视频，以及待处理视频的发布信息、字幕文件，或发布信息和字幕文件，提取待处理视频包含的关键特征信息。关键特征信息可以包括词特征、人脸特征或物体特征中的一种或多种，例如关键特征信息是文字、人名、动物等实体的特征。After obtaining the video to be processed and the initial cover image of the video to be processed, the client can extract the key features contained in the video to be processed based on the video to be processed and the release information and subtitle file of the video to be processed, or the release information and subtitle file information. The key feature information may include one or more of word features, face features, or object features. For example, the key feature information is the features of entities such as words, names, and animals.

在获得词特征、人脸特征或物体特征中的一种或多种之后，客户端可以从各个参考图像，以及各个待处理视频帧中，确定与关键特征信息相匹配的各个候选图像，并对各个候选图像进行目标识别处理，将各个候选图像中的目标裁剪出来，获得各个候选素材。After obtaining one or more of word features, face features or object features, the client can determine each candidate image matching the key feature information from each reference image and each video frame to be processed, and apply Target recognition processing is performed on each candidate image, and the target in each candidate image is cropped to obtain each candidate material.

在获得与关键特征信息相匹配的各个候选素材之后，客户端可以从各个候选素材中选取出满足呈现条件的候选素材，作为目标素材。例如，依次基于清晰度评估策略和内容质量评估策略，对各个候选素材进行清晰度评估以及内容质量评估，选取其中较为清晰，且内容质量较高的候选素材。客户端在获得选取出的较为清晰，且内容质量较高的候选素材之后，还可以对这些候选素材进行一些其他后处理，获得目标素材，例如，根据初始封面图像中，不包含目标对象的区域，调整这些候选素材的尺寸，获得目标素材，使得在将目标素材作为画中画图像，添加到初始封面图像上时，可以既不影响初始封面图像所表达的内容，又可以进一步表达待处理视频的主题或包含的关键对象，并对待处理视频的主题或包含的关键对象起到强调的作用。After obtaining each candidate material that matches the key feature information, the client can select a candidate material that satisfies the presentation condition from each candidate material as a target material. For example, based on the definition evaluation strategy and the content quality evaluation strategy, the definition evaluation and the content quality evaluation are performed on each candidate material, and the candidate material with relatively clear and high content quality is selected. After obtaining the selected candidate materials with relatively clear and high content quality, the client can also perform some other post-processing on these candidate materials to obtain the target materials. For example, according to the area of the initial cover image that does not contain the target object , adjust the size of these candidate materials to obtain the target material, so that when the target material is added as a picture-in-picture image to the initial cover image, it can not affect the content expressed by the initial cover image, and can further express the video to be processed The subject or key objects contained in the video are emphasized, and the subject or key subject contained in the video to be processed plays a role of emphasis.

目标素材的数量可以是一个，也可以是多个，目标素材作为画中画图像，添加到初始封面图像上的方式有多种，可以是随机添加，也可以是以占满初始封面图像中的非目标区域为目标进行添加，也可以将多个目标素材作为一个整体，添加到初始封面图像中，也可以通过获取预存的添加模板，将相应的目标素材添加到模板指定的位置上等，具体不做限制。The number of target materials can be one or more. The target material, as a picture-in-picture image, can be added to the initial cover image in various ways. It can be added randomly, or it can be filled in the initial cover image. The non-target area is added for the target, or multiple target materials can be added as a whole to the initial cover image, or the corresponding target material can be added to the position specified by the template by acquiring the pre-stored addition template, etc. No restrictions.

下面以一个短视频为例，对本申请实施例提供的生成视频封面图像的方法进行示例介绍。The following takes a short video as an example to introduce an example of the method for generating a video cover image provided by the embodiment of the present application.

例如，一个短视频通过标题介绍其看点，“小浣熊在沙发上吃葡萄，发现碗里没有之后，它的反应我能笑一年”，标题为了完整准确地表达语义，一般字数较多，且通常位于不显眼的位置，请参考图6b，通过短视频包含的一个视频帧，仅能够表达小浣熊在沙发上吃葡萄的场景，无法直观地准确地向网络对象传达短视频的主题，即小浣熊吃葡萄时，以及吃完葡萄时反应对比。For example, a short video introduces its point of view through the title, "The little raccoon eats grapes on the sofa and finds that there are no grapes in the bowl. I can laugh for a year." And usually located in an inconspicuous position, please refer to Figure 6b, a video frame contained in the short video can only express the scene of the little raccoon eating grapes on the sofa, and cannot intuitively and accurately convey the subject of the short video to the network object, namely A comparison of the reactions of raccoons when they ate grapes and when they finished eating grapes.

服务器在获得该短视频，确定该短视频的初始封面图像之后，请参考图6c，可以基于该短视频的标题、该短视频包含的视频帧，提取该短视频的关键特征信息。关键特征信息可以包括关键词的词特征，即标题中的实体“小浣熊”的词特征，还可以包括关键对象的物体特征，即视频帧中的小浣熊的物体区域的物体特征。After obtaining the short video and determining the initial cover image of the short video, referring to FIG. 6c, the server can extract the key feature information of the short video based on the title of the short video and the video frames contained in the short video. The key feature information may include the word feature of the keyword, that is, the word feature of the entity "little raccoon" in the title, and the object feature of the key object, that is, the object feature of the object area of the little raccoon in the video frame.

服务器可以以关键词“小浣熊”的词特征，以及关键对象小浣熊的物体区域作为查询主体，在该短视频包含的视频帧中，确定与关键词相匹配的候选图像，以及，与关键对象相匹配的候选图像，获得各个候选图像。The server can use the word feature of the keyword "little raccoon" and the object area of the key object "little raccoon" as the subject of the query, and in the video frames included in the short video, determine candidate images that match the keyword, and the key object. Each candidate image is obtained by matching the candidate images.

在获得各个候选图像之后，请参考图6d，服务器可以对各个候选图像进行去重处理，并根据各个候选图像各自与关键词相匹配的图像区域，或各自与关键对象相匹配的图像区域，对各个候选图像进行裁剪处理，获得各个候选素材。After obtaining each candidate image, please refer to FIG. 6d, the server can perform deduplication processing on each candidate image, and according to the image area of each candidate image that matches the keyword, or the image area that matches the key object, the Each candidate image is cropped to obtain each candidate material.

服务器可以基于清晰度评估策略，对各个候选素材进行清晰度评估，确定各个候选素材各自的清晰度评估值，并基于内容质量评估策略，对各个候选素材进行内容质量评估，确定各个候选素材各自的内容质量评估值。最后基于相应清晰度评估值和内容质量评估值的加权和，选取加权和最大的候选素材，作为目标素材。The server may perform a definition evaluation on each candidate material based on the definition evaluation strategy, determine the respective definition evaluation value of each candidate material, and based on the content quality evaluation strategy, perform content quality evaluation on each candidate material, and determine the respective quality of each candidate material. Content quality evaluation value. Finally, based on the weighted sum of the corresponding clarity evaluation value and the content quality evaluation value, the candidate material with the largest weighted sum is selected as the target material.

服务器在获得初始封面图像和目标素材之后，请参考图6e，可以对初始封面图像进行目标检测处理，确定初始封面图像中不包含目标对象的非目标区域，即初始封面图像中，小浣熊所在区域以外的区域。After the server obtains the initial cover image and the target material, referring to Figure 6e, the server can perform target detection processing on the initial cover image to determine the non-target area that does not contain the target object in the initial cover image, that is, the area where the raccoon is located in the initial cover image outside the area.

服务器可以将目标素材覆盖在非目标区域之上，并结合预存的封面模板，合成目标素材和初始封面图像，生成目标封面图像。The server can overlay the target material on the non-target area, and combine the pre-stored cover template to synthesize the target material and the initial cover image to generate the target cover image.

基于同一发明构思，本申请实施例提供一种生成视频封面图像的装置，能够实现前述的生成视频封面图像的方法对应的功能。请参考图7，该装置包括获取模块701和处理模块702，其中：Based on the same inventive concept, an embodiment of the present application provides an apparatus for generating a video cover image, which can implement the functions corresponding to the foregoing method for generating a video cover image. Referring to FIG. 7, the apparatus includes an acquisition module 701 and a processing module 702, wherein:

获取模块701：用于获取待处理视频，以及待处理视频的初始封面图像；Obtaining module 701: used to obtain the video to be processed and the initial cover image of the video to be processed;

处理模块702：基于待处理视频和初始封面图像，生成待处理视频的目标封面图像；Processing module 702: Based on the video to be processed and the initial cover image, generate a target cover image of the video to be processed;

其中，目标封面图像是在初始封面图像上，添加画中画图像生成的，画中画图像与待处理视频的内容相匹配。The target cover image is generated by adding a picture-in-picture image to the initial cover image, and the picture-in-picture image matches the content of the video to be processed.

在一种可能的实施例中，处理模块702具体用于：In a possible embodiment, the processing module 702 is specifically configured to:

基于信息提取策略，提取待处理视频包含的关键特征信息，其中，关键特征信息用于表征：待处理视频的主题，以及包含的关键对象；Based on the information extraction strategy, extract the key feature information contained in the video to be processed, wherein the key feature information is used to characterize: the subject of the video to be processed, and the key objects contained;

获取与关键特征信息相匹配的各个候选素材，并从各个候选素材中选取出满足呈现条件的候选素材，作为目标素材；Obtain each candidate material that matches the key feature information, and select the candidate material that meets the presentation conditions from each candidate material as the target material;

将目标素材作为初始封面图像的画中画图像，合成目标素材与初始封面图像，生成目标封面图像。The target material is used as the picture-in-picture image of the initial cover image, and the target material and the initial cover image are synthesized to generate the target cover image.

获取各个参考图像，其中，参考图像是从网络资源中收集的图像；obtaining each reference image, wherein the reference image is an image collected from network resources;

提取待处理视频包含的各个待处理视频帧各自的视频帧特征，以及提取各个参考图像各自的图像特征；Extracting the respective video frame features of each to-be-processed video frame contained in the video to be processed, and extracting the respective image features of each reference image;

基于各个视频帧特征和各个图像特征，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选素材。Based on each video frame feature and each image feature, each candidate material matching the key feature information is determined from each video frame to be processed and each reference image.

在一种可能的实施例中，关键特征信息包含：词特征、人脸特征和物体特征；In a possible embodiment, the key feature information includes: word feature, face feature and object feature;

则处理模块702具体用于：Then the processing module 702 is specifically used for:

分别确定词特征与各个视频帧特征之间，以及词特征与各个图像特征之间的图文相似度；Determine the graphic similarity between the word feature and each video frame feature, as well as between the word feature and each image feature;

分别确定人脸特征与各个视频帧特征之间，以及人脸特征与各个图像特征之间的人脸相似度；Determine the face similarity between the face feature and each video frame feature, and between the face feature and each image feature;

分别确定物体特征与各个视频帧特征之间，以及，物体特征与各个图像特征之间的物体相似度；Determine the object similarity between the object feature and each video frame feature, and between the object feature and each image feature;

基于获得的图文匹配度、各个人脸匹配度和各个物体匹配度，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选素材。Based on the obtained image-text matching degree, each face matching degree and each object matching degree, each candidate material matching the key feature information is determined from each to-be-processed video frame and each reference image.

在一种可能的实施例中，处理模块702采用如下方法获得关键特征信息：In a possible embodiment, the processing module 702 uses the following method to obtain the key feature information:

获取待处理视频的发布信息和字幕文件，并提取发布信息和字幕文件包含的关键词的词特征；Obtain the release information and subtitle files of the video to be processed, and extract the word features of the keywords contained in the release information and subtitle files;

获取待处理视频中的关键视频帧，并提取关键视频帧包含的人脸区域的人脸特征，以及包含的物体区域的物体特征，其中，关键视频帧为待处理视频中，表征视频场景切换的视频帧；Acquire key video frames in the video to be processed, and extract the face features of the face region contained in the key video frames and the object features of the object regions contained in the key video frames, wherein the key video frames are in the video to be processed and represent the video scene switching. video frame;

将获得的词特征、人脸特征和物体特征，作为关键特征信息。The obtained word features, face features and object features are used as key feature information.

基于各个视频帧特征和各个图像特征，从各个待处理视频帧和各个参考图像中，确定与关键特征信息相匹配的各个候选图像；Based on each video frame feature and each image feature, from each to-be-processed video frame and each reference image, determine each candidate image that matches the key feature information;

分别确定各个候选图像，各自与关键特征信息相匹配的图像区域；Determine each candidate image separately, and each image area that matches the key feature information;

基于各个图像区域，分别对各个候选图像进行裁剪处理，获得各个候选素材。Based on each image area, each candidate image is cropped to obtain each candidate material.

基于清晰度评估策略，对各个候选素材进行清晰度评估，确定各个候选素材各自的清晰度评估值；Based on the sharpness evaluation strategy, evaluate the sharpness of each candidate material, and determine the respective sharpness evaluation value of each candidate material;

基于内容质量评估策略，对各个候选素材进行内容质量评估，确定各个候选素材各自的内容质量评估值；Based on the content quality evaluation strategy, perform content quality evaluation on each candidate material, and determine the content quality evaluation value of each candidate material;

获得各个清晰度评估值和各个内容质量评估值的加权和，并将加权和符合大于呈现阈值的候选素材，作为目标素材。A weighted sum of each clarity evaluation value and each content quality evaluation value is obtained, and the weighted sum conforms to the candidate material greater than the presentation threshold as the target material.

检测初始封面图像，确定初始封面图像包含的目标对象；Detect the initial cover image, and determine the target object contained in the initial cover image;

基于目标对象在初始封面图像中的位置，将初始封面图像划分为目标区域和非目标区域；Divide the initial cover image into target area and non-target area based on the position of the target object in the initial cover image;

基于非目标区域的形状尺寸，调整目标素材的尺寸；Adjust the size of the target material based on the shape and size of the non-target area;

将调整后的目标素材，覆盖在初始封面图像中非目标区域之上，生成待处理视频的目标封面图像。The adjusted target material is overlaid on the non-target area in the initial cover image to generate the target cover image of the video to be processed.

请参照图8，上述生成视频封面图像的装置可以运行在计算机设备800上，数据存储程序的当前版本和历史版本以及数据存储程序对应的应用软件可以安装在计算机设备800上，该计算机设备800包括处理器880以及存储器820。在一些实施例中，该计算机设备800可以包括显示单元840，显示单元840包括显示面板841，用于显示由用户交互操作界面等。Please refer to FIG. 8 , the above-mentioned device for generating a video cover image can run on a computer device 800, and the current version and historical version of the data storage program and application software corresponding to the data storage program can be installed on the computer device 800, and the computer device 800 includes Processor 880 and memory 820. In some embodiments, the computer device 800 may include a display unit 840 including a display panel 841 for displaying an interface and the like for user interaction.

在一种可能的实施例中，可以采用液晶显示器(Liquid Crystal Display，LCD)或有机发光二极管OLED(Organic Light-Emitting Diode)等形式来配置显示面板841。In a possible embodiment, the display panel 841 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an organic light-emitting diode (OLED) (Organic Light-Emitting Diode).

处理器880用于读取计算机程序，然后执行计算机程序定义的方法，例如处理器880读取数据存储程序或文件等，从而在该计算机设备800上运行数据存储程序，在显示单元840上显示对应的界面。处理器880可以包括一个或多个通用处理器，还可包括一个或多个DSP(Digital Signal Processor，数字信号处理器)，用于执行相关操作，以实现本申请实施例所提供的技术方案。The processor 880 is used to read the computer program, and then execute the method defined by the computer program, for example, the processor 880 reads the data storage program or file, etc., so as to run the data storage program on the computer device 800, and display the corresponding data on the display unit 840. interface. The processor 880 may include one or more general-purpose processors, and may further include one or more DSPs (Digital Signal Processors, digital signal processors), which are configured to perform related operations to implement the technical solutions provided by the embodiments of the present application.

存储器820一般包括内存和外存，内存可以为随机存储器(RAM)，只读存储器(ROM)，以及高速缓存(CACHE)等。外存可以为硬盘、光盘、USB盘、软盘或磁带机等。存储器820用于存储计算机程序和其他数据，该计算机程序包括各客户端对应的应用程序等，其他数据可包括操作系统或应用程序被运行后产生的数据，该数据包括系统数据(例如操作系统的配置参数)和用户数据。本申请实施例中程序指令存储在存储器820中，处理器880执行存储器820中的程序指令，实现前文图论述的任意的一种生成视频封面图像的方法。The memory 820 generally includes internal memory and external memory. The internal memory can be random access memory (RAM), read only memory (ROM), and cache memory (CACHE). The external storage can be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive. The memory 820 is used to store computer programs and other data, the computer programs include application programs corresponding to each client, and the other data may include data generated after the operating system or the application program is executed, and the data includes system data (such as operating system data). configuration parameters) and user data. In this embodiment of the present application, the program instructions are stored in the memory 820, and the processor 880 executes the program instructions in the memory 820 to implement any method for generating a video cover image discussed in the preceding figures.

上述显示单元840用于接收输入的数字信息、字符信息或接触式触摸操作/非接触式手势，以及产生与计算机设备800的用户设置以及功能控制有关的信号输入等。具体地，本申请实施例中，该显示单元840可以包括显示面板841。显示面板841例如触摸屏，可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在显示面板841上或在显示面板841的操作)，并根据预先设定的程式驱动相应的连接装置。The above-mentioned display unit 840 is used for receiving input digital information, character information or contact touch operation/non-contact gesture, and generating signal input related to user settings and function control of the computer device 800 . Specifically, in this embodiment of the present application, the display unit 840 may include a display panel 841 . The display panel 841 is, for example, a touch screen, which can collect the user's touch operations on or near it (such as the user's operations on the display panel 841 or on the display panel 841 using a finger, a stylus, and any other suitable objects or accessories), and according to preset The specified program drives the corresponding connection device.

在一种可能的实施例中，显示面板841可包括触摸检测装置和触摸控制器两个部分。其中，触摸检测装置检测玩家的触摸方位，并检测触摸操作带来的信号，将信号传送给触摸控制器；触摸控制器从触摸检测装置上接收触摸信息，并将它转换成触点坐标，再送给处理器880，并能接收处理器880发来的命令并加以执行。In a possible embodiment, the display panel 841 may include two parts, a touch detection device and a touch controller. Among them, the touch detection device detects the touch orientation of the player, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and sends it to the touch controller. To the processor 880, and can receive the command sent by the processor 880 and execute it.

其中，显示面板841可以采用电阻式、电容式、红外线以及表面声波等多种类型实现。除了显示单元840，在一些实施例中，计算机设备800还可以包括输入单元830，输入单元830可以包括图像输入设备831和其他输入设备832，其中其他输入设备可以但不限于包括物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。Among them, the display panel 841 can be realized by various types such as resistive type, capacitive type, infrared ray and surface acoustic wave. In addition to the display unit 840, in some embodiments, the computer device 800 may also include an input unit 830, and the input unit 830 may include an image input device 831 and other input devices 832, wherein the other input devices may include, but are not limited to, a physical keyboard, function keys One or more of (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, etc.

除以上之外，计算机设备800还可以包括用于给其他模块供电的电源890、音频电路860、近场通信模块870和RF电路810。计算机设备800还可以包括一个或多个传感器850，例如加速度传感器、光传感器、压力传感器等。音频电路860具体包括扬声器861和麦克风862等，例如计算机设备800可以通过麦克风862采集用户的声音，进行相应的操作等。In addition to the above, computer device 800 may also include power supply 890 for powering other modules, audio circuitry 860, near field communication module 870, and RF circuitry 810. Computer device 800 may also include one or more sensors 850, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 860 specifically includes a speaker 861, a microphone 862, and the like. For example, the computer device 800 can collect the user's voice through the microphone 862, and perform corresponding operations.

作为一种实施例，处理器880的数量可以是一个或多个，处理器880和存储器820可以是耦合设置，也可以是相对独立设置。As an embodiment, the number of processors 880 may be one or more, and the processors 880 and the memory 820 may be coupled or relatively independent.

作为一种实施例，图8中的处理器880可以用于实现如图7中的获取模块701和处理模块702的功能。As an embodiment, the processor 880 in FIG. 8 may be used to implement the functions of the acquiring module 701 and the processing module 702 in FIG. 7 .

作为一种实施例，图8中的处理器880可以用于实现前文论述的服务器或终端设备对应的功能。As an embodiment, the processor 880 in FIG. 8 may be used to implement the functions corresponding to the server or the terminal device discussed above.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：移动存储设备、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, execute Including the steps of the above-mentioned method embodiment; and the aforementioned storage medium includes: a mobile storage device, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk and other various A medium on which program code can be stored.

或者，本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，例如，通过计算机程序产品体现，该计算机程序产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括：移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of the present invention is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of software products in essence or in the parts that make contributions to the prior art, for example, embodied in a computer program product, and the computer program product is stored in a storage The medium includes several instructions for causing a computer device to perform all or part of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk and other mediums that can store program codes.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. a method for generating a video cover image, is characterized in that, comprising:

Obtain the video to be processed, and the initial cover image of the video to be processed;

generating a target cover image of the video to be processed based on the video to be processed and the initial cover image;

The target cover image is generated by adding a picture-in-picture image to the initial cover image, and the picture-in-picture image matches the content of the video to be processed.

2. The method according to claim 1, wherein generating a target cover image of the video to be processed based on the video to be processed and the initial cover image, comprising:

Based on the information extraction strategy, extract the key feature information contained in the video to be processed, wherein the key feature information is used to represent: the subject of the video to be processed, and the key objects contained;

Acquire each candidate material that matches the key feature information, and select a candidate material that satisfies the presentation conditions from the various candidate materials as the target material;

Using the target material as a picture-in-picture image of the initial cover image, synthesizing the target material and the initial cover image to generate the target cover image.

3. The method according to claim 2, wherein acquiring each candidate material matching the key feature information comprises:

obtaining each reference image, wherein the reference image is an image collected from a network resource;

Extracting the respective video frame features of each to-be-processed video frame contained in the video to be processed, and extracting the respective image features of the respective reference images;

Based on each video frame feature and each image feature, from each of the video frames to be processed and each of the reference images, each candidate material matching the key feature information is determined.

4. The method according to claim 3, wherein the key feature information comprises: word feature, face feature and object feature;

Then, based on each video frame feature and each image feature, from the video frames to be processed and the reference images, determine each candidate material that matches the key feature information, including:

Determine the graphic similarity between the word feature and each video frame feature, and between the word feature and each image feature, respectively;

Respectively determine the facial similarity between the facial feature and each video frame feature, and between the facial feature and each image feature;

respectively determining the object similarity between the object feature and each video frame feature, and between the object feature and each image feature;

Based on the obtained image-text matching degree, each face matching degree and each object matching degree, each candidate material matching the key feature information is determined from each of the video frames to be processed and each of the reference images.

5. The method according to claim 4, wherein the key feature information is obtained by adopting the following method:

Obtain the release information and the subtitle file of the video to be processed, and extract the word features of the keywords contained in the release information and the subtitle file;

Acquiring key video frames in the video to be processed, and extracting the facial features of the face region included in the key video frames, and the object features of the object regions included, wherein the key video frames are the to-be-processed In the video, the video frame representing the video scene switching;

The obtained word features, face features and object features are used as the key feature information.

6 . The method according to claim 3 , wherein, based on each video frame feature and each image feature, from the video frames to be processed and the reference images, it is determined that the key feature information matches the key feature information. 7 . candidate materials, including:

Based on each video frame feature and each image feature, from each of the video frames to be processed and each of the reference images, determine each candidate image that matches the key feature information;

Determining the respective candidate images, each of which matches the image area with the key feature information;

Based on the respective image regions, the respective candidate images are cropped to obtain respective candidate materials.

7. The method according to claim 2, characterized in that, selecting candidate materials that meet the presentation conditions from the respective candidate materials as target materials, comprising:

Based on the sharpness evaluation strategy, perform sharpness evaluation on the respective candidate materials, and determine the respective sharpness evaluation values of the respective candidate materials;

Based on the content quality evaluation strategy, perform content quality evaluation on each candidate material, and determine the respective content quality evaluation value of each candidate material;

A weighted sum of each clarity evaluation value and each content quality evaluation value is obtained, and a candidate material whose weighted sum is greater than the presentation threshold is used as the target material.

8 . The method according to claim 2 , wherein the target material is used as a picture-in-picture image of the initial cover image, and the target material and the initial cover image are synthesized to generate the target cover image. 9 . ,include:

Detecting the initial cover image, and determining the target object contained in the initial cover image;

dividing the initial cover image into a target area and a non-target area based on the position of the target object in the initial cover image;

Adjusting the size of the target material based on the shape and size of the non-target area;

The adjusted target material is overlaid on the non-target area in the initial cover image to generate the target cover image.

9. A device for generating a video cover image, comprising:

Acquisition module: used to acquire the video to be processed and the initial cover image of the video to be processed;

Processing module: based on the video to be processed and the initial cover image, generate a target cover image of the video to be processed;

10. A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 8 are implemented.

11. A computer equipment, characterized in that, comprising:

memory for storing program instructions;

The processor is configured to call the program instructions stored in the memory, and execute the method according to any one of claims 1 to 8 according to the obtained program instructions.

12. A computer-readable storage medium, wherein the storage medium stores computer-executable instructions, the computer-executable instructions being used to cause a computer to execute the method according to any one of claims 1 to 8 .