CN114302174A

CN114302174A - Video editing method and device, computing equipment and storage medium

Info

Publication number: CN114302174A
Application number: CN202111679091.2A
Authority: CN
Inventors: 张云栋; 刘程
Original assignee: Shanghai IQIYI New Media Technology Co Ltd
Current assignee: Shanghai IQIYI New Media Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08
Anticipated expiration: 2041-12-31
Also published as: CN114302174B

Abstract

The application discloses a video clipping method, a video clipping device, a computing device and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining an original video to be processed, identifying a plurality of key positions and transition positions in the original video, wherein the key positions are used for indicating video clips in the original video, so that the plurality of video clips can be obtained by segmenting the original video according to the plurality of key positions and the transition positions, and splicing the video clips to obtain a target video based on the plurality of video clips, wherein the playing time length of the generated target video is less than the playing time length of the original video. Since the target video is generated according to the transition position, the video content of each clip segment in the target video at the start position and/or the end position can be generally made complete. Therefore, when the user watches the target video, the user usually considers that the consistency of the appearance is high because the content of the video is complete, so that the experience of watching the clipped target video by the user can be improved.

Description

Video editing method, device, computing device and storage medium

技术领域technical field

本申请涉及视频处理技术领域，特别是涉及一种视频剪辑方法、装置、计算设备及存储介质。The present application relates to the technical field of video processing, and in particular, to a video editing method, apparatus, computing device and storage medium.

背景技术Background technique

实际应用场景中，针对播放时长较长的视频，通常可以对该视频进行剪辑，以生成播放时长相对较短、包含核心视频内容的剪辑视频。比如，在互联网视频网站的脱口秀类综艺正片下方，通常会发布一些根据正片所剪辑得到的笑点集锦视频片段，以供观众快速观看全篇笑点片段。In practical application scenarios, for a video with a long playback duration, the video can usually be edited to generate a clipped video with a relatively short playback duration and including core video content. For example, under the feature film of a talk show variety show on an Internet video website, some video clips of the laughing point collection obtained by editing the feature film are usually released, so that the audience can quickly watch the whole laughing point clip.

由于采用人工剪辑的方式生成剪辑视频，不仅会使得人力成本较高，而且人工剪辑视频的效率通常也较低。因此，可以采用人工智能(Artificial Intelligence，AI)剪辑的方式，自动生成剪辑视频。但是，基于AI所生成的剪辑视频，通常存在观感不连贯的问题，如剪辑视频中存在某个人物一句话没有说完就被截断等，影响用户对于剪辑视频的观看体验。Since the edited video is generated by manual editing, not only the labor cost is high, but also the efficiency of manually editing the video is usually low. Therefore, an artificial intelligence (Artificial Intelligence, AI) editing method can be used to automatically generate a clipped video. However, the video clips generated based on AI usually have the problem of incoherent look and feel. For example, there is a character in the clip video that is cut off without finishing a sentence, which affects the user's viewing experience of the clip video.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种视频剪辑方法、装置、计算设备及存储介质，旨提高自动生成的剪辑视频的观感连贯性，以此提高用户对于剪辑视频的观看体验。Embodiments of the present application provide a video editing method, apparatus, computing device, and storage medium, which aim to improve the look-and-feel consistency of the automatically generated edited video, thereby improving the user's viewing experience of the edited video.

第一方面，本申请实施例提供了一种视频剪辑方法，所述方法包括：In a first aspect, an embodiment of the present application provides a video editing method, and the method includes:

获取待处理的原始视频；Get the raw video to be processed;

识别所述原始视频中的多个关键位置以及转场位置，所述关键位置用于指示所述原始视频中的视频片段；Identifying multiple key positions and transition positions in the original video, where the key positions are used to indicate video clips in the original video;

根据所述多个关键位置以及所述转场位置，从所述原始视频中切分得到多个视频片段；According to the plurality of key positions and the transition positions, a plurality of video clips are obtained by segmenting the original video;

基于所述多个视频片段，拼接得到目标视频，所述目标视频的播放时长小于所述原始视频的播放时长。Based on the plurality of video clips, a target video is obtained by splicing, and the playback duration of the target video is shorter than the playback duration of the original video.

在一种可能的实施方式中，所述识别所述原始视频中的转场位置，包括：In a possible implementation manner, the identifying the transition position in the original video includes:

计算所述原始视频中相邻的第一帧图像与第二帧图像之间的相似度；Calculate the similarity between the adjacent first frame images and the second frame images in the original video;

当所述第一帧图像与所述第二帧图像之间的相似度小于预设阈值时，将所述第一帧图像或所述第二帧图像在所述原始视频中的位置确定为所述转场位置。When the similarity between the first frame image and the second frame image is less than a preset threshold, the position of the first frame image or the second frame image in the original video is determined as the the transition location.

在一种可能的实施方式中，所述根据所述多个关键位置以及所述转场位置，从所述原始视频中切分得到多个视频片段，包括：In a possible implementation manner, according to the multiple key positions and the transition positions, multiple video clips are obtained by segmenting the original video, including:

根据所述原始视频中的多个关键位置，确定所述原始视频中多个候选视频片段对应的起始分割点以及终止分割点；According to a plurality of key positions in the original video, determine the start division point and the end division point corresponding to the plurality of candidate video segments in the original video;

确定所述原始视频中与目标候选视频片段的起始分割点之间的距离不超过第一预设距离的多帧第一视频图像中是否包括转场位置，以及所述原始视频中与所述目标候选视频片段的终止分割点之间的距离不超过第二预设距离的多帧第二视频图像中是否包括转场位置，所述目标候选视频片段为所述多个候选视频片段中的任一候选视频片段；Determine whether a transition position is included in the multi-frame first video images whose distance from the original video and the starting segmentation point of the target candidate video segment does not exceed a first preset distance, and whether the original video is different from the Whether the distance between the termination points of the target candidate video segment does not exceed the second preset distance includes transition positions in the multi-frame second video images, and the target candidate video segment is any of the plurality of candidate video segments. a candidate video segment;

当所述多帧第一视频图像中包括转场位置，和/或，所述多帧第二视频图像中包括转场位置时，根据所述多帧第一视频图像中的转场位置和/或所述多帧第二视频图像中的转场位置，从所述原始视频中切分得到所述目标候选视频片段。When the multi-frame first video images include transition positions, and/or, the multi-frame second video images include transition positions, according to the transition positions in the multi-frame first video images and/or Or at the transition positions in the second video images of the multiple frames, the target candidate video segment is obtained by segmenting the original video.

在一种可能的实施方式中，所述原始视频为第一类型的视频，所述原始视频中的关键位置通过所述原始视频中的音频特征进行识别。In a possible implementation manner, the original video is a first type of video, and key positions in the original video are identified by audio features in the original video.

在一种可能的实施方式中，所述原始视频中的关键位置对应的音频内容为笑声和/或掌声，所述识别所述原始视频中的多个关键位置，包括：In a possible implementation manner, the audio content corresponding to the key positions in the original video is laughter and/or applause, and the identifying multiple key positions in the original video includes:

将所述原始视频输入至人工智能AI模型，得到所述AI模型输出的所述原始视频中的多个关键位置，所述AI模型预先通过带有笑声标记和/或掌声标记的样本视频完成训练；Inputting the original video into an artificial intelligence AI model to obtain multiple key positions in the original video output by the AI model, and the AI model is pre-completed with sample videos marked with laughter and/or applause train;

或，将所述原始视频中的音频数据与笑声和/或掌声对应的音频数据进行声纹特征的匹配，得到声纹特征匹配的所述多个关键位置。Or, performing voiceprint feature matching between the audio data in the original video and the audio data corresponding to laughter and/or applause to obtain the multiple key positions matched by the voiceprint feature.

在一种可能的实施方式中，所述原始视频为第二类型的视频，所述原始视频中的关键位置通过所述原始视频中的图像特征进行识别。In a possible implementation manner, the original video is a second type of video, and key positions in the original video are identified by image features in the original video.

在一种可能的实施方式中，所述识别所述原始视频中的多个关键位置，包括：In a possible implementation manner, the identifying multiple key positions in the original video includes:

从所述原始视频中确定出多个初始关键位置；determining a plurality of initial key positions from the original video;

利用光学字符识别技术对所述多个初始关键位置进行调整，得到所述多个关键位置，以使得每个所述关键位置为字幕开始显示或字幕结束显示的位置。The multiple initial key positions are adjusted by using the optical character recognition technology to obtain the multiple key positions, so that each of the key positions is the position where the subtitles start to be displayed or the subtitles end to be displayed.

第二方面，本申请实施例还提供了一种视频剪辑装置，所述装置包括：In a second aspect, an embodiment of the present application further provides a video editing device, the device comprising:

获取模块，用于获取待处理的原始视频；The acquisition module is used to acquire the raw video to be processed;

位置识别模块，用于识别所述原始视频中的多个关键位置以及转场位置，所述关键位置用于指示所述原始视频中的视频片段；a position identification module for identifying multiple key positions and transition positions in the original video, where the key positions are used to indicate video clips in the original video;

切分模块，用于根据所述多个关键位置以及所述转场位置，从所述原始视频中切分得到多个视频片段；A segmentation module, configured to obtain a plurality of video clips by segmentation from the original video according to the multiple key positions and the transition positions;

拼接模块，用于基于所述多个视频片段，拼接得到目标视频，所述目标视频的播放时长小于所述原始视频的播放时长。A splicing module, configured to splicing a target video based on the plurality of video clips, where the playback duration of the target video is shorter than the playback duration of the original video.

在一种可能的实施方式中，所述位置识别模块，包括：In a possible implementation, the location identification module includes:

计算单元，用于计算所述原始视频中相邻的第一帧图像与第二帧图像之间的相似度；a calculation unit, used for calculating the similarity between the adjacent first frame images and the second frame images in the original video;

第一确定单元，用于当所述第一帧图像与所述第二帧图像之间的相似度小于预设阈值时，将所述第一帧图像或所述第二帧图像在所述原始视频中的位置确定为所述转场位置。A first determining unit, configured to, when the similarity between the first frame image and the second frame image is less than a preset threshold, place the first frame image or the second frame image in the original The position in the video is determined as the transition position.

在一种可能的实施方式中，所述切分模块，包括：In a possible implementation, the segmentation module includes:

第二确定单元，用于根据所述原始视频中的多个关键位置，确定所述原始视频中多个候选视频片段对应的起始分割点以及终止分割点；a second determining unit, configured to determine, according to a plurality of key positions in the original video, a start split point and an end split point corresponding to a plurality of candidate video segments in the original video;

第三确定单元，用于确定所述原始视频中与目标候选视频片段的起始分割点之间的距离不超过第一预设距离的多帧第一视频图像中是否包括转场位置，以及所述原始视频中与所述目标候选视频片段的终止分割点之间的距离不超过第二预设距离的多帧第二视频图像中是否包括转场位置，所述目标候选视频片段为所述多个候选视频片段中的任一候选视频片段；a third determining unit, configured to determine whether a transition position is included in the multi-frame first video images whose distances from the original video and the starting segmentation point of the target candidate video segment do not exceed the first preset distance, and Whether a transition position is included in the multi-frame second video images whose distance between the original video and the termination point of the target candidate video segment does not exceed the second preset distance, and the target candidate video segment is the multi-frame second video image. any candidate video segment among the candidate video segments;

切分单元，用于当所述多帧第一视频图像中包括转场位置，和/或，所述多帧第二视频图像中包括转场位置时，根据所述多帧第一视频图像中的转场位置和/或所述多帧第二视频图像中的转场位置，从所述原始视频中切分得到所述目标候选视频片段。The segmentation unit is configured to, when the multi-frame first video images include transition positions, and/or, the multi-frame second video images include transition positions, according to the multi-frame first video images and/or the transition position in the multi-frame second video images, the target candidate video segment is obtained by segmenting the original video.

在一种可能的实施方式中，所述原始视频中的关键位置对应的音频内容为笑声和/或掌声，所述位置识别模块，包括：In a possible implementation manner, the audio content corresponding to key positions in the original video is laughter and/or applause, and the position identification module includes:

第一识别单元，用于将所述原始视频输入至人工智能AI模型，得到所述AI模型输出的所述原始视频中的多个关键位置，所述AI模型预先通过带有笑声标记和/或掌声标记的样本视频完成训练；The first identification unit is used to input the original video into an artificial intelligence AI model, and obtain a plurality of key positions in the original video output by the AI model, and the AI model is pre-marked with laughter and/or Or applause marked sample videos to complete the training;

或，or,

第二识别单元，用于将所述原始视频中的音频数据与笑声和/或掌声对应的音频数据进行声纹特征的匹配，得到声纹特征匹配的所述多个关键位置。The second identification unit is configured to perform voiceprint feature matching between the audio data in the original video and the audio data corresponding to laughter and/or applause to obtain the multiple key positions matched by the voiceprint feature.

第四确定单元，用于从所述原始视频中确定出多个初始关键位置；a fourth determining unit, configured to determine a plurality of initial key positions from the original video;

调整单元，用于利用光学字符识别技术对所述多个初始关键位置进行调整，得到所述多个关键位置，以使得每个所述关键位置为字幕开始显示或字幕结束显示的位置。The adjustment unit is configured to adjust the multiple initial key positions by using the optical character recognition technology to obtain the multiple key positions, so that each of the key positions is the position where the subtitles start to be displayed or the subtitles end to be displayed.

第三方面，本申请实施例还提供了一种计算设备，该计算设备可以包括处理器以及存储器：In a third aspect, an embodiment of the present application further provides a computing device, and the computing device may include a processor and a memory:

所述存储器用于存储计算机程序；the memory is used to store computer programs;

所述处理器用于根据所述计算机程序执行上述第一方面以及第一方面中任一种实施方式所述的方法。The processor is configured to execute the method according to the first aspect and any one of the implementation manners of the first aspect according to the computer program.

第四方面，本申请实施例还提供了一种计算机可读存储介质，其特征在于，所述计算机可读存储介质用于存储计算机程序，所述计算机程序用于执行上述第一方面以及第一方面中任一种实施方式所述的方法。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the first aspect and the first The method of any embodiment of the aspect.

在本申请实施例的上述实现方式中，获取待处理的原始视频，并识别该原始视频中的多个关键位置以及转场位置，该关键位置用于指示原始视频中的视频片段，从而可以根据该多个关键位置以及转场位置，从原始视频中切分得到多个视频片段，并基于该多个视频片段拼接得到目标视频，其中，所生成的目标视频的播放时长小于原始视频的播放时长。In the above implementation manner of the embodiment of the present application, the original video to be processed is acquired, and a plurality of key positions and transition positions in the original video are identified, and the key positions are used to indicate the video clips in the original video, so that the The multiple key positions and transition positions are divided into multiple video clips from the original video, and a target video is obtained by splicing based on the multiple video clips, wherein the playback duration of the generated target video is shorter than the playback duration of the original video. .

由于原始视频中的转场位置，通常表征了一段连续的视频内容在该转场位置处终结，因此，根据转场位置生成目标视频时，通常可以使得目标视频中的各个剪辑片段在起始位置和/或终止位置处的视频内容较为完整。如此，用户在观看该目标视频时，通常会因为视频内容的完整而认为观感连贯性较高，从而可以提高用户观看剪辑出的目标视频的体验。Due to the transition position in the original video, it usually indicates that a continuous video content ends at the transition position. Therefore, when the target video is generated according to the transition position, each clip in the target video can usually be made at the starting position. and/or the video content at the end position is relatively complete. In this way, when the user watches the target video, the user usually thinks that the video content is complete and has a high look and feel consistency, so that the user's experience of watching the edited target video can be improved.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请中记载的一些实施例，对于本领域普通技术人员来讲，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations described in the present application. For example, for those skilled in the art, other drawings can also be obtained from these drawings.

图1为本申请实施例中一示例性应用场景示意图；FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;

图2为本申请实施例中一种视频剪辑方法的流程示意图；2 is a schematic flowchart of a video editing method in an embodiment of the present application;

图3为本申请实施例中一种视频剪辑装置的结构示意图；3 is a schematic structural diagram of a video editing device in an embodiment of the application;

图4为本申请实施例中一种计算设备的硬件结构示意图。FIG. 4 is a schematic diagram of a hardware structure of a computing device according to an embodiment of the present application.

具体实施方式Detailed ways

参见图1，为本申请实施例提供的一种应用场景示意图。在图1所示应用场景中，客户端101可以与计算设备102存在通信连接。并且，客户端101可以接收用户(如视频剪辑人员)提供的视频，并将其发送给计算设备102；计算设备102用于对接收到的一个或者多个视频进行AI剪辑，生成剪辑视频，并将该剪辑视频通过客户端101呈现给用户。Referring to FIG. 1 , it is a schematic diagram of an application scenario provided by an embodiment of the present application. In the application scenario shown in FIG. 1 , the client 101 may have a communication connection with the computing device 102 . In addition, the client 101 can receive the video provided by the user (such as a video editor), and send it to the computing device 102; the computing device 102 is used to perform AI editing on one or more received videos, generate a clipped video, and The clipped video is presented to the user through the client 101 .

其中，计算设备102，是指具有数据处理能力的设备，例如可以是终端、服务器等。客户端101可以应用于独立于计算设备102的物理设备中。例如当计算设备102通过服务器实现时，客户端101可以运行于用户侧的用户终端等设备。或者，客户端101也可以运行于计算设备102上。The computing device 102 refers to a device with data processing capability, such as a terminal, a server, and the like. Client 101 may be implemented in a physical device separate from computing device 102 . For example, when the computing device 102 is implemented by a server, the client 101 may run on a device such as a user terminal on the user side. Alternatively, client 101 may also run on computing device 102 .

实际应用时，计算设备102基于预设的AI算法所生成的剪辑视频，通常存在观感不连贯的问题，从而会影响用户对于剪辑视频的观感。比如，假设原始视频中包括一段对话，其中，人物A问道：“你最近都没有去健身房锻炼么”，人物B回答道：“没有”，而计算设备102基于AI算法可能会在人物A说话结束的位置处开始截断，则剪辑得到的视频片段中，仅包括人物B的回答内容(“没有”)，这使得用户在观看剪辑视频中人物B的说话内容“没有”时，会觉得不知所云，即不知道人物B是针对什么内容回答“没有”。又比如，人物A连续说了多句话，而计算设备102基于AI算法仅截取了该人物A说的部分语句的视频片段，从而导致该人物A的演讲内容存在明显缺失。In practical application, the video clip generated by the computing device 102 based on the preset AI algorithm usually has the problem of incoherent look and feel, which will affect the user's look and feel of the clip video. For example, suppose the original video includes a dialogue in which Person A asks, "Have you not been to the gym recently?" Person B replies: "No," and the computing device 102 may speak to Person A based on an AI algorithm. The truncation starts at the end position, then the edited video clip only includes the answer content (“no”) of character B, which makes the user feel confused when watching the content of character B’s speech “no” in the edited video Cloud, that is, I don't know what character B is answering "no" to. For another example, character A speaks multiple sentences continuously, but the computing device 102 only intercepts video clips of some sentences spoken by the character A based on the AI algorithm, resulting in obvious lack of speech content of the character A.

基于此，本申请实施例提供了一种视频剪辑方法，旨在提高生成的剪辑视频的观感连贯性，以此提高用户针对剪辑视频的观看体验。具体实现时，计算设备102获取待处理的原始视频，并识别该原始视频中的多个关键位置以及转场位置，该关键位置用于指示原始视频中的视频片段，从而计算设备102可以根据该多个关键位置以及转场位置，从原始视频中切分得到多个视频片段，并基于该多个视频片段拼接得到目标视频，其中，所生成的目标视频的播放时长小于原始视频的播放时长。Based on this, an embodiment of the present application provides a video editing method, which aims to improve the look-and-feel consistency of the generated edited video, so as to improve the user's viewing experience for the edited video. In specific implementation, the computing device 102 acquires the original video to be processed, and identifies multiple key positions and transition positions in the original video, where the key positions are used to indicate video segments in the original video, so that the computing device 102 can A plurality of key positions and transition positions are obtained by dividing a plurality of video clips from the original video, and a target video is obtained by splicing based on the plurality of video clips, wherein the playback duration of the generated target video is shorter than the playback duration of the original video.

由于原始视频中的转场位置，通常表征了一段连续的视频内容在该转场位置处终结，因此，根据转场位置生成目标视频时(也即生成前述的剪辑视频)，通常可以使得目标视频中的各个剪辑片段在起始位置和/或终止位置处的视频内容较为完整。如此，用户在观看该目标视频时，通常会因为视频内容的完整而认为观感连贯性较高，从而可以提高用户观看剪辑出的目标视频的体验。Due to the transition position in the original video, it usually indicates that a continuous video content ends at the transition position. Therefore, when the target video is generated according to the transition position (that is, the aforementioned clip video is generated), it is usually possible to make the target video The video content at the start position and/or the end position of each clip in the video is relatively complete. In this way, when the user watches the target video, the user usually thinks that the video content is complete and has a high look and feel consistency, so that the user's experience of watching the edited target video can be improved.

需要说明的是，本实施例中的视频，是指同时具有图像与音频内容的视频，即一份视频文件中，不仅包括连续多帧的视频图像，还包括与该视频图像同步的音频数据。It should be noted that the video in this embodiment refers to a video having both image and audio content, that is, a video file includes not only video images of multiple consecutive frames, but also audio data synchronized with the video images.

可以理解的是，图1所示的应用场景的架构仅是本申请实施例提供的一个示例，实际应用时，本申请实施例也可以应用于其它可适用的场景中，如计算设备102可以自动从互联网中获取一个或者多个视频，并通过上述实现方式自动生成各个视频对应的剪辑视频等。总之，本申请实施例可以应用于任何可适用的场景中，而不局限于上述场景示例。It can be understood that the architecture of the application scenario shown in FIG. 1 is only an example provided by the embodiment of the present application. In practical application, the embodiment of the present application can also be applied to other applicable scenarios. For example, the computing device 102 can automatically Obtain one or more videos from the Internet, and automatically generate clip videos corresponding to each video through the above implementation manner. In conclusion, the embodiments of the present application can be applied to any applicable scenario, and are not limited to the above scenario examples.

为使本申请的上述目的、特征和优点能够更加明显易懂，下面将结合附图对本申请实施例中的各种非限定性实施方式进行示例性说明。显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本申请保护的范围。In order to make the above objects, features and advantages of the present application more clearly understood, various non-limiting implementations in the embodiments of the present application will be exemplarily described below with reference to the accompanying drawings. Obviously, the described embodiments are some, but not all, embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

参阅图2，图2示出了本申请实施例中一种视频剪辑方法流程示意图，该方法可以应用于图1所示的应用场景中，或者可以应用于其它可适用的应用场景等。为便于说明与理解，下面应用于图1所示的应用场景为例进行说明。该方法具体可以包括：Referring to FIG. 2, FIG. 2 shows a schematic flowchart of a video editing method in an embodiment of the present application. The method may be applied to the application scenario shown in FIG. 1, or may be applied to other applicable application scenarios. For ease of description and understanding, the following description is given by applying the application scenario shown in FIG. 1 as an example. Specifically, the method may include:

S201：获取待处理的原始视频。S201: Obtain the original video to be processed.

为便于区分和描述，本实施例中将需要进行剪辑的视频称之为原始视频，而对于剪辑生成的视频，称之为目标视频。For the convenience of distinction and description, in this embodiment, the video to be edited is referred to as the original video, and the video generated by the clip is referred to as the target video.

在一种可能的实施方式中，原始视频可以是由用户提供给计算设备102。具体地，客户端101可以向用户呈现视频导入界面，从而用户可以在该视频导入界面上通过执行相应的操作，将原始视频导入该客户端101。然后，客户端101可以将用户提供的原始视频，通过与计算设备102之间的网络连接，将其传输给计算设备102。In one possible implementation, the original video may be provided to computing device 102 by the user. Specifically, the client 101 can present a video import interface to the user, so that the user can import the original video into the client 101 by performing corresponding operations on the video import interface. Then, the client 101 can transmit the original video provided by the user to the computing device 102 through the network connection with the computing device 102 .

而在另一种可能的实施方式中，原始视频也可以是由计算设备102从互联网中获取。例如，用户可以通过客户端101向计算设备102发送生成剪辑视频的指令，从而计算设备102可以基于该指令，从互联网中下载特定类型的视频，如下载脱口秀类型的视频或者相声类型的视频、下载观察类综艺的视频等，并将这些视频作为原始视频，以便后续对这些原始视频进行剪辑处理。In another possible implementation, the original video may also be obtained by the computing device 102 from the Internet. For example, the user can send an instruction to generate a clip video to the computing device 102 through the client 101, so that the computing device 102 can download a specific type of video from the Internet based on the instruction, such as downloading a talk show type video or a cross talk type video, Download videos of viewing variety shows, etc., and use these videos as original videos for subsequent editing of these original videos.

值得注意的是，计算设备102所获取的原始视频，可以是一个视频，也可以是多个视频，如计算设备102可以基于多个原始视频，剪辑生成一个目标视频等，本实施例对此并不进行限定。为便于理解与说明，本实施例中，以原始视频为一个视频为例进行说明，当原始视频包括多个视频时，其实现方式与本实施例类似，其区别在于后续拼接的多个视频片段来源于多个不同的原始视频。It is worth noting that the original video acquired by the computing device 102 may be one video or multiple videos. For example, the computing device 102 may generate a target video based on multiple original videos and clipping, etc. This embodiment does not Not limited. For ease of understanding and description, in this embodiment, the original video is taken as an example for description. When the original video includes multiple videos, the implementation is similar to this embodiment, and the difference lies in the subsequent splicing of multiple video segments. Sourced from multiple different raw videos.

S202：识别原始视频中的多个关键位置以及转场位置，该关键位置用于指示原始视频中的视频片段。S202: Identify multiple key positions and transition positions in the original video, where the key positions are used to indicate video segments in the original video.

其中，关键位置，是指用于指示原始视频中的视频片段的位置，在对原始视频进行剪辑时可以基于该关键位置确定所剪辑出的视频片段的起始分割点以及终止分割点。并且，针对不同类型的原始视频，可以采用不同类别的位置作为关键位置。Wherein, the key position refers to the position used to indicate the video clip in the original video, and the start and end split points of the clipped video clip can be determined based on the key position when the original video is edited. And, for different types of original videos, different categories of locations can be used as key locations.

在一种示例中，计算设备102可以通过原始视频中的音频特征识别关键位置。比如，当原始视频为第一类型的视频时，如脱口秀、相声等搞笑类型的视频等，由于实际应用场景中，具有笑声和/或掌声的视频内容，通常会更加吸引用户，因此，计算设备102可以将该原始视频中音频内容为笑声和/或掌声的位置确定为关键位置。In one example, computing device 102 may identify key locations through audio features in the raw video. For example, when the original video is the first type of video, such as talk show, cross talk and other funny type videos, due to the actual application scenario, the video content with laughter and/or applause is usually more attractive to users. Therefore, The computing device 102 may determine a position in the original video where the audio content is laughter and/or applause as a key position.

在一种识别关键位置的实施方式中，计算设备102可以利用AI模型识别原始视频中的多个关键位置。具体实现时，计算设备102中可以预先配置有AI模型，并且该AI模型预先通过具有“笑声”和/或“掌声”标记的视频样本完成训练，从而完成训练的AI模型可以识别出视频中的“笑声”和/或“掌声”。因此，针对第一类型的原始视频，计算设备102可以将该原始视频输入至完成训练的AI模型中，得到该AI模型输出的原始视频中的多个关键位置。In one embodiment of identifying key locations, computing device 102 may utilize an AI model to identify multiple key locations in the raw video. When specifically implemented, the computing device 102 may be pre-configured with an AI model, and the AI model is pre-trained through video samples marked with "laughter" and/or "applause", so that the trained AI model can identify "Laughter" and/or "Applause". Therefore, for the original video of the first type, the computing device 102 may input the original video into the AI model that has completed the training, and obtain multiple key positions in the original video output by the AI model.

而在另一种识别关键位置的实施方式中，计算设备102可以通过比对声纹特征的方式确定原始视频中的多个关键位置。具体地，计算设备102可以获取具有“笑声”和/或“掌声”内容的音频数据，并提取出该将“笑声”和/或“掌声”的声纹特征，然后，计算设备102可以将声纹特征与原始视频中的音频数据对应的声纹特征进行逐段比对，并将声纹特征一致的音频数据位置，确定为关键位置，以此确定出原始视频中的多个关键位置。In another embodiment of identifying key positions, the computing device 102 may determine multiple key positions in the original video by comparing voiceprint features. Specifically, the computing device 102 may acquire audio data with the content of "laughter" and/or "applause", and extract the voiceprint features of the "laughter" and/or "applause", and then the computing device 102 may Compare the voiceprint feature with the voiceprint feature corresponding to the audio data in the original video segment by segment, and determine the location of the audio data with the same voiceprint feature as the key location, so as to determine multiple key locations in the original video. .

需要说明的是，上述确定关键位置的实现方式进作为一些示例性说明，实际应用时，计算设备102也可以是通过其它方式确定出原始视频中的关键位置，本实施例对此并不进行限定。It should be noted that the above-mentioned implementation manner of determining the key position is provided as some exemplary descriptions. In practical application, the computing device 102 may also determine the key position in the original video by other means, which is not limited in this embodiment. .

而在另一种示例中，计算设备102可以通过原始视频中的图像特征识别关键位置。比如，当原始视频为第二类型的视频时，如观察类综艺的视频等，可以将原始视频的其中一条字幕的字幕开始显示和/或字幕显示结束的位置确定为关键位置。当然，在其它实施例中，关键位置也可以是其它可能的实现方式，本实施例对此并不进行限定。In yet another example, computing device 102 may identify key locations through image features in the raw video. For example, when the original video is a video of the second type, such as a video of viewing variety shows, the position where the subtitles of one of the subtitles in the original video start to be displayed and/or the subtitles are displayed may end may be determined as the key position. Certainly, in other embodiments, the key position may also be other possible implementation manners, which are not limited in this embodiment.

在一种可能的实施方式中，计算设备102可以从原始视频中确定出多个初始关键位置，并通过光学字符识别(Optical Character Recognition，OCR)技术对多个初始关键位置进行调整，得到多个关键位置，以使得每个关键位置为字幕开始显示或字幕结束显示的位置。例如，计算设备102可以按照播放时长(如30秒等)随机选取两个位置作为一段视频的起始位置以及终止位置，得到该段视频对应的两个初始关键位置。然后，计算设备102可以利用OCR技术识别该起始位置处的视频图像中字幕，并将该字幕在原始视频中的开始显示位置作为关键位置。并且，计算设备102还可以利用OCR技术识别该终止位置处的视频图像中字幕，并将该字幕在原始视频中的结束显示位置作为关键位置等。如此，计算设备102可以通过上述方式确定出多个视频片段分别对应的关键位置。In a possible implementation, the computing device 102 may determine a plurality of initial key positions from the original video, and adjust the plurality of initial key positions through an optical character recognition (Optical Character Recognition, OCR) technology to obtain a plurality of initial key positions. Key positions, so that each key position is the position where the subtitles start to be displayed or the subtitles end to be displayed. For example, the computing device 102 may randomly select two positions as the start position and the end position of a video according to the playback duration (eg, 30 seconds, etc.), and obtain two initial key positions corresponding to the video. Then, the computing device 102 can identify the subtitle in the video image at the starting position by using the OCR technology, and use the starting display position of the subtitle in the original video as a key position. In addition, the computing device 102 can also use the OCR technology to identify the subtitle in the video image at the end position, and use the end display position of the subtitle in the original video as a key position and the like. In this way, the computing device 102 can determine the key positions corresponding to the multiple video clips in the above manner.

转场位置，是指原始视频中的人物类型发生切换的位置，如由表演者切换为观众或者嘉宾席等的位置，或者由表演者切换至不包括人物的场景的位置，可以是因为拍摄镜头的切换而导致拍摄的人物类型发生切换等。实际应用时，当原始视频中的人物类型发生切换时，通常表示与该人物相关的一段视频内容暂时结束，比如该人物所演讲的一个论据/论点结束时，通常可以将拍摄镜头切换至观众席以拍摄观众对于该人物演讲内容所做出的反应(如开怀大笑、点头、摇头等)，因此，计算设备102在对原始视频进行剪辑时，可以利用该转场位置确定所剪辑出的视频片段的边界点。当然，在其它可能的实施例中，转场位置，也可以是原始视频中的其它信息发生切换的位置，如场景的切换等，本实施例对此并不进行限定。The transition position refers to the position where the character type in the original video is switched, such as the position where the performer switches to the audience or the guest seat, or the position where the performer switches to a scene that does not include characters, which may be due to the shooting scene. The switching of the camera leads to the switching of the type of person to be photographed, etc. In practical applications, when the character type in the original video is switched, it usually means that a piece of video content related to the character ends temporarily. For example, when an argument/argument made by the character ends, the shooting lens can usually be switched to the auditorium. In order to capture the audience's reaction to the speech content of the character (such as laughing, nodding, shaking his head, etc.), the computing device 102 can use the transition position to determine the edited video when editing the original video. The boundary point of the fragment. Of course, in other possible embodiments, the transition position may also be a position where other information in the original video is switched, such as scene switching, which is not limited in this embodiment.

在一种可能的实施方式中，计算设备102可以通过比较相邻的两帧图像之间的差异，确定转场位置。具体实现时，计算设备102可以计算原始视频中相邻的第一帧图像与第二帧图像之间的相似度，并且，当第一帧图像与第二帧图像之间的显示度小于预设阈值时，计算设备102可以将第一帧图像或者第二帧图像在原始视频中的位置确定为转场位置。比如，第一帧图像为拍摄表演者所得到的图像，第二帧图像为拍摄观众所得到的图像等。其中，计算设备102可以通过遍历计算的方式，确定出原始视频中的所有转场位置，如计算设备102可以通过依次比较连续两个相邻的视频图像之间的图像相似度，来确定转场点。或者，计算设备102可以将关键位置附近的多帧图像进行相似度计算，以确定关键位置附近是否包括转场位置。In a possible implementation, the computing device 102 may determine the transition position by comparing the difference between two adjacent frames of images. During specific implementation, the computing device 102 may calculate the similarity between the adjacent first frame image and the second frame image in the original video, and when the display degree between the first frame image and the second frame image is smaller than the preset When the threshold is set, the computing device 102 may determine the position of the first frame image or the second frame image in the original video as the transition position. For example, the first frame of image is an image obtained by photographing a performer, and the second frame image is an image obtained by photographing an audience. The computing device 102 can determine all transition positions in the original video by traversing the calculation method. For example, the computing device 102 can determine the transition by sequentially comparing the image similarity between two consecutive adjacent video images. point. Alternatively, the computing device 102 may perform similarity calculation on multiple frames of images in the vicinity of the key position to determine whether the transition position is included in the vicinity of the key position.

示例性地，计算设备102在计算两帧图像之间相似度时，可以先将这两帧图像分别缩小至8像素*8像素的尺寸，即缩小后的每帧图像均具有64个像素。这一步的作用是为了去除图像的细节，只保留图像中的结构/明暗等基本信息，降低后续的计算量。然后，计算设备102可以对缩小后的两帧图像进行灰度处理，并分别计算出每帧图像的平均灰度值(即每帧图像中的64个灰度值的平均值)。接着，计算设备102将每帧图像中各个像素的灰度值与该帧图像对应的平均灰度值进行比较，并且，当像素的灰度值大于或者等于平均灰度值时，将该像素标记为1，而当像素的灰度值小于平均灰度值时，将该像素标记为0，从而每帧图像中的64个像素按照统一规则进行组合，可以生成64位的哈希值(由1以及0组成)，该哈希值可以作为该帧图像的指纹。这样，计算设备102可以比较两帧图像分别对应的64位哈希值，当这两个哈希值中存在差异的位数超过预设值(如5等)时，计算设备102确定这两帧图像相似度较小，而当这两个哈希值中存在差异的位数不超过该预设值时，计算设备102确定这两帧图像相似度较大。实际应用时，计算设备102也可以是通过其它方式确定两帧图像之间的相似度，本实施例对此并不进行限定。Exemplarily, when calculating the similarity between two frames of images, the computing device 102 may first reduce the two frames of images to a size of 8 pixels*8 pixels, that is, each frame of images after the reduction has 64 pixels. The function of this step is to remove the details of the image, and only retain the basic information such as structure/light and shade in the image to reduce the amount of subsequent calculations. Then, the computing device 102 may perform grayscale processing on the reduced two frames of images, and respectively calculate the average grayscale value of each frame image (ie, the average value of 64 grayscale values in each frame image). Next, the computing device 102 compares the gray value of each pixel in each frame of image with the average gray value corresponding to the frame image, and, when the gray value of the pixel is greater than or equal to the average gray value, marks the pixel is 1, and when the gray value of the pixel is less than the average gray value, the pixel is marked as 0, so that the 64 pixels in each frame of image are combined according to uniform rules, and a 64-bit hash value can be generated (by 1 and 0), the hash value can be used as the fingerprint of the frame image. In this way, the computing device 102 can compare the 64-bit hash values corresponding to the two frames of images, and when the number of bits that differ between the two hash values exceeds a preset value (eg, 5, etc.), the computing device 102 determines that the two frames The similarity of the images is relatively small, and when the number of bits that differ between the two hash values does not exceed the preset value, the computing device 102 determines that the similarity of the two frames of images is relatively high. In practical application, the computing device 102 may also determine the similarity between two frames of images in other ways, which is not limited in this embodiment.

S203：根据识别出的多个关键位置以及转场位置，从原始视频中切分得到多个视频片段。S203: According to the identified multiple key positions and transition positions, segment the original video to obtain multiple video segments.

本实施例中，计算设备102可以根据多个关键位置以及转场位置，对原始视频进行切分，得到多个视频片段。In this embodiment, the computing device 102 may segment the original video according to multiple key positions and transition positions to obtain multiple video segments.

在一种可能的实施方式中，计算设备102可以先根据该原始视频中的多个关键位置，确定原始视频中的多个候选视频片段对应的起始分割点以及终止分割点。其中，起始分割点，是指候选视频片段的起始点，该起始分割点处的视频图像即为该候选视频片段的第一帧图像。相应的，终止分割点，是指该候选视频片段的终止点，该终止分割点处的视频图像即为该候选视频片段的最后一帧图像。例如，当原始视频为第一类型的视频时，计算设备102可以将关键位置的前15(或者其它数值)秒处的播放位置确定为候选视频片段的起始分割点，将该关键位置的后1(或者其它数值)秒处的播放位置确定为该候选视频片段的终止分割点。In a possible implementation manner, the computing device 102 may first determine, according to multiple key positions in the original video, start and end division points corresponding to multiple candidate video segments in the original video. The starting segmentation point refers to the starting point of the candidate video segment, and the video image at the starting segmentation point is the first frame image of the candidate video segment. Correspondingly, the termination division point refers to the termination point of the candidate video segment, and the video image at the termination division point is the last frame image of the candidate video segment. For example, when the original video is a video of the first type, the computing device 102 may determine the playback position at the first 15 (or other numerical values) seconds of the key position as the starting split point of the candidate video segment, and the post-key position The playback position at 1 (or other numerical value) seconds is determined as the termination split point of the candidate video segment.

然后，计算设备102可以确定原始视频中与目标候选视频片段的起始分割点之间的距离不超过第一预设距离的多帧第一视频图像中是否包括转场位置，以及原始视频中与该目标候选视频片段的终止分割点之间的距离不超过的第二预设距离的多帧第二视频图像中是否包括转场位置，该目标候选视频片段为上述多个候选视频片段中的任一候选视频片段。其中，与起始分割点之间的距离，可以表征为视频图像在原始视频中的位置与该起止分割点之间包括的视频帧的数量(或者这些视频帧数量对应的播放时长等)，并且，视频帧的数量越大，表征距离越远，视频帧的数量越小，表征距离越近。相应的，第一预设距离，例如可以是预设数量的视频帧数(或者这些视频帧数量对应的播放时长等)，如连续450帧图像等。类似的，与终止分割点之间的距离，可以表征为视频图像在原始视频中的位置与该起止分割点之间包括的视频帧的数量，或者表征为这些视频帧数量对应的播放时长等。相应的，第二预设距离，可以是预设数量的视频帧数，或者这些视频帧数量对应的播放时长等。其中，第一预设距离的大小与第二预设距离的大小可以相同，也可以是不同。Then, the computing device 102 may determine whether a transition position is included in the multi-frame first video images whose distance from the starting segmentation point of the target candidate video segment does not exceed the first preset distance in the original video, and whether the transition position is included in the original video. Whether a transition position is included in the multi-frame second video images of which the distance between the termination points of the target candidate video segment does not exceed the second preset distance, the target candidate video segment is any of the above-mentioned candidate video segments a candidate video segment. Wherein, the distance from the starting split point can be characterized as the number of video frames included between the position of the video image in the original video and the starting and ending split points (or the playback duration corresponding to the number of these video frames, etc.), and , the larger the number of video frames, the farther the representation distance, the smaller the number of video frames, the closer the representation distance. Correspondingly, the first preset distance may be, for example, a preset number of video frames (or a playback duration corresponding to the number of video frames, etc.), such as 450 consecutive frames of images. Similarly, the distance from the end split point can be represented by the number of video frames included between the position of the video image in the original video and the start and end split point, or the playback duration corresponding to the number of video frames. Correspondingly, the second preset distance may be a preset number of video frames, or a playback duration corresponding to the number of video frames. The size of the first preset distance and the size of the second preset distance may be the same or different.

当多帧第一视频图像中包括转场位置，和/或，多帧第二视频图像中包括转场位置时，计算设备102可以根据多帧第一视频图像中的转场位置和/或多帧第二视频图像中的转场位置，从原始视频中切分得到目标候选视频片段。具体的，当多帧第一视频图像中包括转场位置时，计算设备102可以将目标候选视频片段的起始分割点更新为该多帧第一视频图像中的转场位置，并根据该转场位置以及目标候选视频片段的终止分割点，从原始视频中切分得到目标候选视频片段。当多帧第二视频图像中内包括转场位置时，计算设备102可以将目标候选视频片段的终止分割点更新为该多帧第二视频图像中的转场位置，并根据该转场位置以及目标候选视频片段的起始分割点，从原始视频中切分得到目标候选视频片段。当多帧第一视频图像中以及多帧第二视频图像中均包括转场位置时，计算设备102可以将目标候选视频片段的起始分割点更新为该多帧第一视频图像中的转场位置，将目标候选视频片段的终止分割点更新为该多帧第二视频图像中的转场位置，并根据这两个转场位置，从原始视频中切分得到目标候选视频片段。如此，计算设备102可以根据多个关键位置以及转场位置，切分得到多个视频片段。When the multiple frames of first video images include transition positions, and/or the multiple frames of second video images include transition positions, the computing device 102 Frame the transition position in the second video image, and segment the original video to obtain the target candidate video segment. Specifically, when the multi-frame first video image includes a transition position, the computing device 102 may update the starting segmentation point of the target candidate video segment to the transition position in the multi-frame first video image, and according to the transition position The field position and the end segmentation point of the target candidate video segment are segmented from the original video to obtain the target candidate video segment. When a transition position is included in the multiple frames of the second video image, the computing device 102 may update the end split point of the target candidate video segment to the transition position in the multiple frames of the second video image, and according to the transition position and The starting segmentation point of the target candidate video segment, which is segmented from the original video to obtain the target candidate video segment. When both the multi-frame first video images and the multi-frame second video images include transition positions, the computing device 102 may update the starting segmentation point of the target candidate video segment to the transition in the multi-frame first video images position, update the end split point of the target candidate video segment to the transition position in the multi-frame second video image, and segment the original video to obtain the target candidate video segment according to the two transition positions. In this way, the computing device 102 can obtain multiple video clips by segmenting according to multiple key positions and transition positions.

由于计算设备102可以根据转场位置对目标候选视频片段的起始分割点和/或终止分割点进行调整，因此，基于该转场位置所剪辑得到的视频片段中，通常可以包括相对完整的视频内容。如此，当用户观看该视频片段时，可以避免出现观感连贯性较低的问题。比如，假设原始视频中包括一段关于人物A以及人物B的对话，人物A问道：“你最近都没有去健身房锻炼么”，人物B回答道：“没有”，并且，在人物B说完之后出现镜头转场，则计算设备102可以根据识别的转场位置，将视频片段的终止分割点确定为人物B说完“没有”之后，这样，用户后续在观看该视频片段时，由于该视频片段包括人物A以及人物B的对话内容，因此，该视频片段的观感连贯性相对较高。Since the computing device 102 can adjust the start split point and/or the end split point of the target candidate video clip according to the transition position, the video clip edited based on the transition position can usually include a relatively complete video content. In this way, when the user watches the video clip, the problem of low consistency of look and feel can be avoided. For example, suppose the original video includes a conversation about Person A and Person B, Person A asks: "Have you been to the gym recently?" Person B replies: "No", and, after Person B finishes speaking When a scene transition occurs, the computing device 102 can determine the end split point of the video clip as after the character B has said "no" according to the identified transition position. Including the dialogue content of character A and character B, therefore, the visual continuity of this video clip is relatively high.

S204：基于多个视频片段，拼接得到目标视频，该目标视频的播放时长小于原始视频的播放时长。S204: Based on a plurality of video clips, a target video is obtained by splicing, and the playback duration of the target video is shorter than the playback duration of the original video.

在从原始视频中剪辑出包括多个视频片段后，计算设备102可以对该多个视频片段进行拼接，生成目标视频。其中，计算设备102可以按照各个视频片段在原始视频中的播放顺序进行顺序拼接，或者可以采用其它顺序进行拼接等，本实施例对此并不进行限定。After editing and including multiple video segments from the original video, the computing device 102 may splicing the multiple video segments to generate a target video. The computing device 102 may perform sequential splicing according to the playing order of each video clip in the original video, or may perform splicing in other sequences, etc., which is not limited in this embodiment.

实际应用时，剪辑生成的目标视频的播放时长具有一定的限制。比如，对于播放时长为2小时的原始视频，对其进行剪辑生成的目标视频的播放时长可以不超过10分钟。因此，通过视频剪辑所生成的目标视频的播放时长，通常小于原始视频的播放时长。In practical applications, the playback duration of the target video generated by clipping is limited. For example, for an original video with a playback duration of 2 hours, the playback duration of a target video generated by editing it may not exceed 10 minutes. Therefore, the playback duration of the target video generated by the video clip is usually shorter than the playback duration of the original video.

进一步的，若切分得到的多个视频片段对应的播放总时长大于所要生成的目标视频的最大播放时长时，计算设备102从多个视频片段中，挑选部分视频片段来生成目标视频。比如，计算设备102可以选取该多个视频片段中播放时长相对较长的部分视频片段来生成目标视频，或者可以随机选择部分视频片段来生成目标视频等，本实施例对此并不进行限定。Further, if the total playing duration corresponding to the multiple video clips obtained by segmentation is greater than the maximum playing duration of the target video to be generated, the computing device 102 selects some video clips from the multiple video clips to generate the target video. For example, the computing device 102 may select some video segments with a relatively long playback duration among the plurality of video segments to generate the target video, or may randomly select some video segments to generate the target video, etc., which is not limited in this embodiment.

本实施例中，由于原始视频中的转场位置，通常表征了一段连续的视频内容在该转场位置处终结，因此，根据转场位置生成目标视频时，通常可以使得目标视频中的各个剪辑片段在起始位置和/或终止位置处的视频内容较为完整。如此，用户在观看该目标视频时，通常会因为视频内容的完整而认为观感连贯性较高，从而可以提高用户观看剪辑出的目标视频的体验。In this embodiment, due to the transition position in the original video, it usually indicates that a piece of continuous video content ends at the transition position. Therefore, when the target video is generated according to the transition position, each clip in the target video can usually be The video content of the segment is relatively complete at the start position and/or end position. In this way, when the user watches the target video, the user usually thinks that the video content is complete and has a high look and feel consistency, so that the user's experience of watching the edited target video can be improved.

为便于理解与说明，下面以原始视频分别为脱口秀类综艺视频和观察类综艺视频为例进行举例说明。For ease of understanding and explanation, the following is an example of the original video being a talk show variety show video and an observation variety show video.

计算设备102在获取到该脱口秀类综艺视频后，可以通过AI模型识别该脱口秀视频中存在笑声和/或掌声的位置，或者通过声纹特征比对的方式确定该脱口秀视频中存在笑声和/或掌声的关键位置等。然后，针对每个关键位置，计算设备102可以进一步确定出该关键位置的前15秒处的播放位置(即起始分割点)以及后3秒处的播放位置(终止分割点)，并识别该关键位置至前15秒处的播放位置之间的多帧第一视频图像中是否包括转场位置，以及识别该关键位置至后3秒处的播放位置之间的多帧第二视频图像中是否包括转场位置。如果多帧第一视频图像中包括转场位置，则计算设备102可以将该视频片段的起始分割点调整为该转场位置，和/或，如果多帧第二视频图像中包括转场位置，则将该视频片段转场位置的终止分割点调整为该转场位置，从而基于调整后的起始分割点和/或终止分割点从原始视频中切分出该视频片段。如此，针对多个关键位置，计算设备102可以切分出来多个包括笑声和/或掌声的视频片段。最后，计算设备102可以对切分出来的多个视频片段进行拼接，生成脱口秀视频集锦，也即用户所期望的剪辑视频。After acquiring the talk show variety show video, the computing device 102 can identify the location where laughter and/or applause exists in the talk show video through the AI model, or determine that the talk show video exists in the talk show video by comparing voiceprint features. Key locations for laughter and/or applause, etc. Then, for each key position, the computing device 102 may further determine the playback position at the first 15 seconds of the key position (ie the start split point) and the playback position at the last 3 seconds (the end split point), and identify the Whether the transition position is included in the multi-frame first video images between the key position and the playback position in the first 15 seconds, and whether the multi-frame second video image between the key position and the playback position in the next 3 seconds is identified Including transition positions. If a transition position is included in the multiple frames of the first video image, the computing device 102 may adjust the starting split point of the video clip to the transition position, and/or if the multiple frames of the second video image include a transition position , the end split point of the transition position of the video clip is adjusted to the transition position, so that the video clip is cut out from the original video based on the adjusted start split point and/or end split point. As such, for multiple key locations, the computing device 102 may segment multiple video segments including laughter and/or applause. Finally, the computing device 102 may splicing the segmented video segments to generate a talk show video collection, that is, a clip video desired by the user.

计算设备102在获取到观察类综艺视频后，可以从该观察类综艺视频中截取预设时长(如30秒播放时长)的候选视频片段，该候选视频片段的起始分割点以及终止分割点可以通过OCR技术进行初步确定，具体可以是通过OCR技术，将某条字幕开始显示的位置作为候选视频片段的起始分割点，将另一条字幕结束显示的位置作为该候选视频片段的终止分割点，并且该起始分割点与终止分割点之间的多帧视频图像对应的播放时长接近该预设时长。然后，计算设备102可以识别观察类综艺视频中与该起始分割点之间的距离不超过第一预设距离(如90帧视频图像等)的多帧第一视频图像中是否包括转场位置，以及识别观察类综艺视频中与该终止分割点之间的距离不超过第二预设距离(如60帧视频图像等)的多帧第一视频图像中是否包括转场位置，并且，如果多帧第一视频图像中包括转场位置，则计算设备102可以将该视频片段的起始分割点调整为该转场位置，和/或，如果多帧第二视频图像中包括转场位置，则将该视频片段转场位置的终止分割点调整为该转场位置，从而基于调整后的起始分割点和/或终止分割点从原始视频中切分出视频片段。如此可以从观察类综艺视频中切分出多个视频片段。最后，计算设备102可以对切分出来的多个视频片段进行拼接，生成观察类综艺视频的剪辑视频。After acquiring the observation type variety show video, the computing device 102 can intercept a candidate video segment with a preset duration (such as a 30-second playback duration) from the observation type variety show video, and the start and end segmentation points of the candidate video segment can be. Preliminarily determined by OCR technology, specifically, through OCR technology, the position where a certain subtitle starts to be displayed is used as the starting segmentation point of the candidate video segment, and the position where another subtitle is displayed at the end is used as the termination segmentation point of the candidate video segment, And the playback duration corresponding to the multi-frame video images between the start split point and the end split point is close to the preset duration. Then, the computing device 102 can identify whether a transition position is included in the multiple frames of first video images whose distances from the start dividing point in the observed variety show video do not exceed a first preset distance (eg, 90 frames of video images, etc.). , and identify whether a transition position is included in the multi-frame first video images in which the distance between the viewing variety video and the termination point does not exceed the second preset distance (such as 60 frames of video images, etc.) If the frame of the first video image includes a transition position, the computing device 102 may adjust the starting split point of the video clip to the transition position, and/or, if multiple frames of the second video image include the transition position, then The end split point of the transition position of the video clip is adjusted to the transition position, so that the video clip is split from the original video based on the adjusted start split point and/or end split point. In this way, multiple video clips can be cut out from the observation type variety show video. Finally, the computing device 102 may splicing the divided video segments to generate a clipped video of the viewing variety video.

此外，本申请实施例还提供了一种视频剪辑装置。参阅图3，图3示出了本申请实施例中一种视频剪辑装置的结构示意图，该视频剪辑装置300包括：In addition, the embodiments of the present application also provide a video editing apparatus. Referring to FIG. 3, FIG. 3 shows a schematic structural diagram of a video editing apparatus in an embodiment of the present application. The video editing apparatus 300 includes:

获取模块301，用于获取待处理的原始视频；an acquisition module 301, configured to acquire the original video to be processed;

位置识别模块302，用于识别所述原始视频中的多个关键位置以及转场位置，所述关键位置用于指示所述原始视频中的视频片段；a position identification module 302, configured to identify multiple key positions and transition positions in the original video, where the key positions are used to indicate video clips in the original video;

切分模块303，用于根据所述多个关键位置以及所述转场位置，从所述原始视频中切分得到多个视频片段；A segmentation module 303, configured to segment the original video to obtain multiple video segments according to the multiple key positions and the transition positions;

拼接模块304，用于基于所述多个视频片段，拼接得到目标视频，所述目标视频的播放时长小于所述原始视频的播放时长。The splicing module 304 is used for splicing and obtaining a target video based on the plurality of video clips, where the playback duration of the target video is shorter than the playback duration of the original video.

在一种可能的实施方式中，所述位置识别模块302，包括：In a possible implementation manner, the location identification module 302 includes:

在一种可能的实施方式中，所述切分模块303，包括：In a possible implementation manner, the segmentation module 303 includes:

切分单元，用于当所述多帧第一视频图像中包括转场位置，和/或，所述多帧第二视频图像中包括转场位置时，根据所述多帧第一视频图像中的转场位置和/或所述多帧第二视频图像中的转场位置，从所述原始视频中切分得到所述目标候选视频片段。The segmentation unit is configured to, when the multi-frame first video images include transition positions, and/or, the multi-frame second video images include transition positions, according to the multi-frame first video images and/or the transition position in the multi-frame second video images, and obtain the target candidate video segment by segmenting the original video.

在一种可能的实施方式中，所述原始视频中的关键位置对应的音频内容为笑声和/或掌声，所述位置识别模块302，包括：In a possible implementation manner, the audio content corresponding to the key positions in the original video is laughter and/or applause, and the position identification module 302 includes:

或，or,

需要说明的是，上述装置各模块、单元之间的信息交互、执行过程等内容，由于与本申请实施例中方法实施例基于同一构思，其带来的技术效果与本申请实施例中方法实施例相同，具体内容可参见本申请实施例前述所示的方法实施例中的叙述，此处不再赘述。It should be noted that the information exchange, execution process and other contents between the modules and units of the above-mentioned apparatus are based on the same concept as the method embodiments in the embodiments of the present application, and the technical effects brought by them are the same as those of the method embodiments in the embodiments of the present application. The examples are the same, and the specific content can refer to the descriptions in the method embodiments shown in the foregoing embodiments of the present application, and details are not repeated here.

此外，本申请实施例还提供了一种计算设备。参阅图4，图4示出了本申请实施例中一种计算设备的硬件结构示意图，该计算设备400可以包括处理器401以及存储器402。In addition, the embodiments of the present application also provide a computing device. Referring to FIG. 4 , FIG. 4 shows a schematic diagram of a hardware structure of a computing device in an embodiment of the present application. The computing device 400 may include a processor 401 and a memory 402 .

其中，所述存储器402，用于存储计算机程序；Wherein, the memory 402 is used to store computer programs;

所述处理器401，用于根据所述计算机程序执行以下步骤：The processor 401 is configured to perform the following steps according to the computer program:

获取待处理的原始视频；Get the raw video to be processed;

在一种可能的实施方式中，所述处理器401，具体用于根据所述计算机程序执行以下步骤：In a possible implementation manner, the processor 401 is specifically configured to perform the following steps according to the computer program:

在一种可能的实施方式中，所述原始视频中的关键位置对应的音频内容为笑声和/或掌声，所述处理器401，具体用于根据所述计算机程序执行以下步骤：In a possible implementation manner, the audio content corresponding to the key position in the original video is laughter and/or applause, and the processor 401 is specifically configured to perform the following steps according to the computer program:

将所述原始视频输入至人工智能AI模型，得到所述AI模型输出的所述原始视频中的多个关键位置，所述AI模型预先通过带有笑声标记和/或掌声标记的样本视频完成训练；Inputting the original video into an artificial intelligence AI model to obtain multiple key positions in the original video output by the AI model, and the AI model is completed in advance through sample videos marked with laughter and/or applause train;

另外，本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质用于存储计算机程序，所述计算机程序用于执行上述方法实施例中所述的方法。In addition, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method described in the foregoing method embodiment.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到上述实施例方法中的全部或部分步骤可借助软件加通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如只读存储器(英文：read-only memory，ROM)/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者诸如路由器等网络通信设备)执行本申请各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that all or part of the steps in the methods of the above embodiments can be implemented by means of software plus a general hardware platform. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as read-only memory (English: read-only memory, ROM)/RAM, magnetic disk, An optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the methods described in various embodiments or some parts of the embodiments of the present application.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。The various embodiments in this specification are described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in one place , or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

以上所述仅是本申请示例性的实施方式，并非用于限定本申请的保护范围。The above descriptions are only exemplary embodiments of the present application, and are not intended to limit the protection scope of the present application.

Claims

1. a video editing method, is characterized in that, described method comprises:

Get the raw video to be processed;

Identifying multiple key positions and transition positions in the original video, where the key positions are used to indicate video clips in the original video;

According to the plurality of key positions and the transition positions, a plurality of video clips are obtained by segmenting the original video;

Based on the plurality of video clips, a target video is obtained by splicing, and the playback duration of the target video is shorter than the playback duration of the original video.

2. The method according to claim 1, wherein the identifying the transition position in the original video comprises:

calculating the similarity between the adjacent first frame images and the second frame images in the original video;

When the similarity between the first frame image and the second frame image is less than a preset threshold, the position of the first frame image or the second frame image in the original video is determined as the the transition location.

3. The method according to claim 1, characterized in that, according to the multiple key positions and the transition positions, obtaining multiple video clips from the original video by dividing, comprising:

According to a plurality of key positions in the original video, determine the start division point and the end division point corresponding to the plurality of candidate video segments in the original video;

Determine whether a transition position is included in the multi-frame first video images whose distance from the original video and the starting segmentation point of the target candidate video segment does not exceed a first preset distance, and whether the original video is different from the Whether the distance between the termination points of the target candidate video segment does not exceed the second preset distance includes transition positions in the multi-frame second video images, and the target candidate video segment is any of the plurality of candidate video segments. a candidate video segment;

When the multi-frame first video images include transition positions, and/or, the multi-frame second video images include transition positions, according to the transition positions in the multi-frame first video images and/or Or at the transition positions in the second video images of the multiple frames, the target candidate video segment is obtained by segmenting the original video.

4. The method according to claim 1, wherein the original video is a first type of video, and key positions in the original video are identified by audio features in the original video.

5. The method according to claim 4, wherein the audio content corresponding to the key positions in the original video is laughter and/or applause, and the identifying a plurality of key positions in the original video includes :

Inputting the original video into an artificial intelligence AI model to obtain multiple key positions in the original video output by the AI model, and the AI model is pre-completed with sample videos marked with laughter and/or applause train;

Or, performing voiceprint feature matching between the audio data in the original video and the audio data corresponding to laughter and/or applause to obtain the multiple key positions matched by the voiceprint feature.

6 . The method according to claim 1 , wherein the original video is of the second type, and key positions in the original video are identified by image features in the original video. 7 .

7. The method according to claim 6, wherein the identifying a plurality of key positions in the original video comprises:

determining a plurality of initial key positions from the original video;

The multiple initial key positions are adjusted by using the optical character recognition technology to obtain the multiple key positions, so that each of the key positions is the position where the subtitles start to be displayed or the subtitles end to be displayed.

8. A video editing device, wherein the device comprises:

The acquisition module is used to acquire the raw video to be processed;

a position identification module for identifying multiple key positions and transition positions in the original video, where the key positions are used to indicate video clips in the original video;

A segmentation module, configured to obtain a plurality of video clips by segmentation from the original video according to the multiple key positions and the transition positions;

A splicing module, configured to splicing a target video based on the plurality of video clips, where the playback duration of the target video is shorter than the playback duration of the original video.

9. A computing device, wherein the device comprises a processor and a memory:

the memory is used to store computer programs;

The processor is adapted to perform the method of any one of claims 1-7 according to the computer program.

10. A computer-readable storage medium, wherein the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method of any one of claims 1-7.