CN119011750A

CN119011750A - Method for editing video, related device and computer program product

Info

Publication number: CN119011750A
Application number: CN202410993921.6A
Authority: CN
Inventors: 李亘杰
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2024-07-23
Filing date: 2024-07-23
Publication date: 2024-11-22

Abstract

The application provides a video editing method, a related device and a computer program product, wherein a group of video fragments corresponding to a target object are extracted from a source video based on voiceprint information of the target object, wherein the video fragments in the group of video fragments correspond to primary voice content of the target object in the source video; determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips; and ordering at least part of the target video fragments in the group of video fragments according to the content tags to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object. Therefore, the editing quality and efficiency of the video can be improved.

Description

Method for editing video, related device and computer program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a method and apparatus for editing video, an electronic device, a computer readable medium, and a computer program product.

Background

With the development of computer technology, people are gradually entering a network and a digital society. In this process, video is becoming an integral part of people's lives. Through mobile internet technology, users can view video provided by video platforms, other users, online, or via downloads through the video platforms.

In such a context, the video may be selected to be clipped to highlight the video content for purposes of facilitating quick understanding of the video, acquisition of the video content, and the like by the user. Therefore, the viewing efficiency of the user is improved, and the viewing experience of the user is improved. In such a context, it is interesting and urgent how to more efficiently and quality complete video editing, and to efficiently and valuable provide editing video.

Disclosure of Invention

Aspects of the present application provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for editing video, capable of quickly determining that a clip main line completes editing with a "target object in video" as a clip reference dimension. And the arrangement and the play sequence of each video clip can be adjusted based on the content labels of the clipped video clips, so that the clipped video can more intuitively present and express the source video content in a highlighting and fitting manner according to the requirements of users, and the watching experience of the users is improved.

In one aspect of the present application, there is provided a method of editing video, comprising: extracting a group of video clips corresponding to the target object from the source video based on voiceprint information of the target object, wherein the video clips in the group of video clips correspond to primary voice content sent by the target object in the source video; determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips; and ordering at least part of the target video fragments in the group of video fragments according to the content tags to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object.

In another aspect of the present application, there is provided an apparatus for editing video, comprising: a video clip extraction unit configured to extract a set of video clips corresponding to the target object from the source video based on voiceprint information of the target object, wherein video clips in the set of video clips correspond to primary voice content of the target object in the source video; a content tag determination unit configured to determine content tags corresponding to the video clips, respectively, based on a text transcription result of the voice content of the video clips: and the video editing generation unit is configured to sort at least part of the target video fragments in the group of video fragments according to the content tags to obtain a sorting result, wherein the sorting result is the video editing for the target object.

In another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of editing video as provided above.

In another aspect of the application, a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement a method of editing video as provided above is provided.

In another aspect of the application, a computer program product comprising a computer program having stored thereon computer program instructions which, when executed by a processor, enable the method of editing video as provided above.

In the scheme provided by the embodiment of the application, a group of video clips corresponding to a target object are extracted from a source video based on voiceprint information of the target object, wherein the video clips in the group of video clips correspond to primary voice content sent by the target object in the source video; determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips; and ordering at least part of the target video fragments in the group of video fragments according to the content tags to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object. Thus, it is possible to quickly determine that the clip main line completes the clip with the "target object in video" as the clip reference dimension. And the arrangement and the play sequence of each video clip can be adjusted based on the content labels of the clipped video clips, so that the clipped video can more intuitively present and express the source video content in a highlighting and fitting manner according to the requirements of users, and the watching experience of the users is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a video editing process according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a video editing process under an application scene according to another embodiment of the present application;

FIG. 3 is a schematic diagram of an apparatus for editing video according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device suitable for implementing the solution in the embodiment of the present application.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In one exemplary configuration of the application, the terminal, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer program instructions, data structures, modules of the program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

As discussed above, it is interesting and urgent how to more efficiently and quality complete video editing, and efficiently and valuable provide editing video.

In some schemes, video clips rely on manual tagging, or on the results of complete semantic analysis of the video. For example, after the complete semantic content of the video is acquired, "individual events" included in the video are determined, and then editing is performed based on the occurrence positions of the individual events.

In such a manner, not only may the manual marking be performed with a large amount of labor, but also the individual aesthetic and standard deviation may be faced, and it is difficult to unify the editing standard. In a manner of relying on complete semantic analysis of video, a large amount of resources are often consumed for complete semantic analysis of video, and the requirements on performance and resource configuration are too high.

In view of the above, the embodiment of the application provides a method for editing video, which is based on voiceprint information of a target object, and extracts a group of video clips corresponding to the target object from source video, wherein the video clips in the group of video clips correspond to primary voice content sent by the target object in the source video; determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips; and ordering at least part of the target video fragments in the group of video fragments according to the content tags to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object. Thus, it is possible to quickly determine that the clip main line completes the clip with the "target object in video" as the clip reference dimension. And the arrangement and the play sequence of each video clip can be adjusted based on the content labels of the clipped video clips, so that the clipped video can more intuitively present and express the source video content in a highlighting and fitting manner according to the requirements of users, and the watching experience of the users is improved.

In an actual scenario, the execution body of the method may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, various terminal devices such as a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

When the execution subject is software, the execution subject may be installed in the above-listed electronic device, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

Fig. 1 shows a video editing process 100 according to an embodiment of the present application, where the process 100 includes at least the following processing steps:

(step S101) extracting a group of video clips corresponding to the target object from the source video based on the voiceprint information of the target object.

In an embodiment of the present application, each object included in a source video (i.e., a video to be clipped, a video as a material of the clip video, which may be generally provided by a service platform side, or uploaded by an editing user) may be identified after the source video is acquired. Such as a character object, etc. Then, for one of the target objects (for example, a candidate object included in the source video may be provided to the editing user, and then the object is determined based on the user's selection), clipping is performed to generate a clipped video corresponding to the target object.

It should be appreciated that in some scenarios, the execution subject may also execute the clipping process discussed later in this disclosure separately and respectively for multiple (or each) "object" based on different needs, and separately obtain the clipped video respectively corresponding to each object.

After determining the indicated object (i.e., the target object), the execution subject may acquire voiceprint information of the target object. Voiceprint information includes, but is not limited to, frequency, tempo, timbre, volume, pronunciation habits, etc., which combine to form unique sound features for each individual, i.e., "voiceprint information" (or "voiceprint").

The executing body then extracts a set of video clips corresponding to the target object from the source video based on the voiceprint information. The video clips in a set of video clips correspond to one-time speech content of the target object in the source video, or, in other words, a single "speaking behavior" of the target object. For example, for the target object, "there is a single speaking or speaking behavior" in the video content located in the X1 play position to the X2 play position (X2 is greater than X1) in the source video, the executing body may extract the video content located in the X1 play position to the X2 play position as one video clip.

In some embodiments, to improve the extraction efficiency of the video clip, the executing body may directly select the marking result of the timestamp based on the play position to complete the "extraction". For example, for a video clip extracted by it, it may be possible to use the "timestamp" to complete the marking, indicating that the portion of video content is to be a video clip, by adding a timestamp at both the X1 play position and the X2 play position.

Typically, each video clip in a set of video clips may be selected to be determined sequentially according to the order in which the source video is played. For example, the executing body may perform time stamping and recording correspondingly once every time the executing body detects a speaking behavior of the target object in the process of traversing the source video.

For example, the executing body may segment and mark a video clip each time it detects that the target object starts speaking to the video content of which the current speaking ends in the process of traversing the source video content. For example, the executing body may continuously monitor the speaking condition of the target object with the start of the speaking of the target object being detected, and determine the end point of the current position and segment a video clip when the duration of the speaking stopping of the target object is equal to a trusted threshold in the current process. Then, when the next speech of the target object is continuously detected, a new starting point is redetermined to segment the next video segment, so that the process is repeated until the traversing of the source video is completed.

Thus, the problems of repeated video clips, frame cutting errors and the like which are potentially existed by single identification are avoided while a plurality of video clips can be continuously obtained.

In some embodiments, to enhance the extraction effect, the executing body may also select a preprocessing operation, such as noise reduction, filtering and amplifying, before extracting the video segment, so as to enhance the quality of the source video.

S102, determining content labels corresponding to the video clips based on text transcription results of voice contents of the video clips.

In an embodiment of the present application, after obtaining a set of video clips based on the above step S101, the execution body may determine content tags corresponding to the video clips, respectively, based on text transcription results of the voice content included in each video clip, for the video clips.

In some embodiments, the text transcription result of the voice content of the video clip is derived based on the following: dividing the voice content of the video clip into a plurality of short-time frames; determining phonemes corresponding to the short time frames based on the mel frequency cepstrum coefficients of the short time frames; and determining a text transcription result of the video segment based on the probability distribution of the phonemes and a pre-determined upper and lower text word sequence prediction library.

Specifically, after determining the video clip, the executing body splits the voice content (or voice signal) included in the video clip to obtain a plurality of short-time frames with preset time length.

Then, the subject is performed to extract acoustic features of the short-time frame, for example, mel-frequency cepstrum coefficients corresponding to the short-time frame. Therefore, the mel frequency cepstrum coefficient is used as an acoustic feature to describe the audio characteristics which can be perceived by human ears, and the accuracy of the text transcription result of the voice content is improved.

Next, the execution body may take as input the extracted acoustic features (e.g., mel-frequency cepstral coefficients) into a trained acoustic model (e.g., an acoustic model built based on a recurrent neural network RNN, long-short term memory network LSTM, or transducer) to map the acoustic features to phoneme or sub-phoneme states in the speech, and decode the phoneme or sub-phoneme state sequences into the most probable word sequences based on the information of the decoder in combination with the acoustic model and a predetermined contextual word sequence prediction library.

The acoustic model provides a probability distribution of phoneme or sub-phoneme states, and a pre-determined contextual word sequence prediction library may be used to predict word sequences that are most likely to occur in a given context based on a large amount of text data (e.g., the prediction process is implemented by a language model using the pre-determined contextual word sequence prediction library). Finally, after the video clip is processed by a decoder, a text transfer result of the video clip is obtained.

Alternatively or additionally, after obtaining a string of word sequences of the text transcription result, it may also be chosen to perform result optimization by, for example, removing punctuation marks that are not desired to be used, converting numbers to text form, etc., to improve the quality of the transcription result.

In some embodiments, the content tag includes at least one of: content abstract tags, content emotion tags, content intention tags.

The content summary tag may be determined based on a "content summary" of the text transcription result, for example, the text transcription result may be processed by a semantic recognition model or the like, to obtain a "summary" of the text transcription result and a "thumbnail content". Accordingly, the execution body may take "abstract" of the text transfer result and "thumbnail content" as "content abstract tag" of the video clip.

Content emotion tags may be determined based on emotion analysis of the text transcription results, e.g., an executing subject may likewise utilize a pre-trained model to determine emotion polarity (e.g., happiness, dullness, anger, etc.) of the text transcription results based on analysis of content semantics and/or auxiliary information such as mood words, symbols, etc. Accordingly, the executing body may select "content emotion tag" having "emotion polarity" as the video clip.

In some embodiments, to better analyze the "emotion polarity," the executing subject may also refer to the voice content (e.g., the expression mood, rhythm, etc. in the voice content) corresponding to the video clip during the text recognition result, thereby analyzing the "emotion polarity" more accurately.

Content intent tags may generally be determined based on dialog content of a target object with other objects, e.g., an executing subject may "infer" logical relationships between speakers, potential intent, and information not directly expressed based on the interaction process between two or dialogs. Accordingly, the execution subject takes the result of this "inference" as the "content intention label" of the video clip.

Thus, the executing entity can utilize the content tags to provide information assistance to facilitate and efficiently characterize information (e.g., at least one of content, emotion, intent) of the video clip in respective analysis dimensions.

In some embodiments, in the case where video clips are sequentially collected based on the play order of the source video, in order to avoid excessive fragmentation of the video splitting due to splitting fine granularity setting or the like (e.g., video clip a and then adjacent video clip B, possibly a combination of both of which is video content corresponding to a complete conversation of the target object, wherein the video clip a and video clip B are "undesirably" split into two video clips because the source video inserts some special effects between the video clips a and B), the executing body may also choose to compare the content tags of the extracted sequentially adjacent video clips (i.e., the two video clips are "adjacent" in the play time order corresponding to the source video) after obtaining the content tags to merge the played adjacent video clips. Accordingly, for ease of understanding, two video clips whose similarity of content tags satisfies a preset requirement may be described as "similar video clips".

In particular, the executing body may select a manner of matching, comparing whether the respective labels of the two video clips satisfy the similarity requirement (e.g., whether the labels are identical) to determine whether the two should be merged. For example, if the content digest tag, the emotion tag, and the content intent tag are identical, the executing entity may choose to merge the two to reduce "fragmentation".

In some embodiments, in order to reduce difficulty and pressure of setting the labels, the preset requirement that the similarity meets may be determined as that the "semantic similarity" result of the two labels meets a preset requirement standard (for example, the similarity is greater than or equal to a predetermined similarity threshold). For example, for a content summary tag, the execution body may determine, after determining the semantics of the content summary tag corresponding to each of the two video clips, whether the two video clips are actually the same based on a semantic comparison manner.

Similarly, for the case where there are a plurality of different types of content tags, the execution subject may map the matching, comparison result of the respective categories (e.g., content digest tag, content emotion tag, content intention tag) to scores and then set to satisfy the preset requirement based on whether or not the respective score addition result is equal to or greater than a predetermined score threshold. That is, the execution subject may select to determine that the similarity satisfies a preset requirement in the case where the addition result is equal to or greater than a predetermined score threshold, and combine the two similar video clips.

S103, sorting at least part of target video clips in the group of video clips according to the content tags to obtain a sorting result.

In an embodiment of the present application, after the executing body obtains the content tag of the video clip based on the above step S102, at least a part of the video clips from the group of video clips may be selected as target video clips based on the content tag, and then the clip video may be composed and obtained by sorting the target video clips. Alternatively, after completing the sorting of the selected target video clips, the execution body may use the sorting result of the target video clips as the clip video for the target object.

It should be appreciated that the executing entity may take all of the video clips in the set of video clips as target video clips or take a portion of the video clips in the set of video clips as target video cheaply based on differences in scene, requirements, to generate the clip video.

For example, the executing body may select the target video clip based on a preset filtering rule. For example, video clips based on which emotion polarity is the target polarity constitute clip video, and video clips based on which content is intended to be "YY" constitute clip video.

In general, the execution body may determine a play order based on the play order of video clips in the source video, and arrange the video clips based on the play order to compose and generate clip videos.

For example, in the case where video clips E, F, G and H sequentially extracted from the source video according to the play order are included in a group of video clips, the execution body may select the video clip E, F, G as a target video clip to obtain a clip video (for example, the video clips E, F, G may be sequentially played in the clip video).

Then, the method for editing the video extracts a group of video clips corresponding to the target object from the source video based on the voiceprint information of the target object, wherein the video clips in the group of video clips correspond to primary voice content sent by the target object in the source video; determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips; and ordering at least part of the target video fragments in the group of video fragments according to the content tags to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object. In such a way, the target object in the video is used as the clipping reference dimension to quickly determine the clipping main line to clip, and the arrangement and the play sequence of each video clip are adjusted based on the content label of the clipped video clip, so that the clipped video can more intuitively present and express the source video content in a more prominent and fitting way as required by a user, and the viewing experience of the user is improved.

In some embodiments, the executing body may also provide clip videos of different viewing effects by adjusting the manner in which video clips are combined (or, alternatively, the order in which the respective videos are played). For example, in some scenarios, for the above example, the execution body may choose to combine the individual video clips in the manner of video clip F, G, E, resulting in a clip video.

For ease of understanding, the arrangement or policy referred to by the executing entity in determining the manner of combining video segments may be described as a "storyline". That is, the story line may sequentially indicate what "content tags" each video clip that is desired to be inserted should have. For example, in the L story line, it may be indicated that a video clip having an "X" content summary tag is inserted in a first order and a video clip having an "X" content summary tag, a "Y" content emotion tag is inserted in a second order.

Accordingly, the executing body can meet different arrangement requirements based on different "story lines". For example, content "having a happy emotional polarity" is first put in, and then content "having a flat emotional polarity" is put in, so as to satisfy the dialog content in which the target object is preferentially presented in the clip video in the "happy state".

In some embodiments, the executing entity may present the content tags to the editing user after determining the content tags to which the video clips each correspond (e.g., by using a display component that interacts locally with the editing user or by using a display component that communicates with the terminal device used by the editing user after communicating with the terminal device used by the user via a pre-configured communication link). Accordingly, the user may personalize a "storyline" (which is referred to as a first storyline for ease of description) based on these content tags. For example, the editing user may select content tags that are desired to be used, and the order of the ranking of the respective content tags that are used, based on his own desires, needs, and return the content tags, the order of ranking, as a first storyline to the execution subject.

Then, in response to receiving the first storyline for the content tags, the executing body accordingly ranks at least part of the target video clips (i.e., video clips including the content tags selected by the editing user) in the group of video clips based on the reference ranking result of the content tags indicated by the first storyline (e.g., content tags indicated by respective positions in the first storyline are referred to, and corresponding content tag video clips are inserted), generates a ranking result, and obtains a clip video (i.e., in the process of executing S103 described above, ranking results are generated, and clip videos are obtained in this manner).

By the method, editing operation can be completed for editing users through selection and arrangement of content labels, and operation efficiency of the editing users is improved. And in such a way, the executing entity can generate the clip video by referring to the clip requirement of the editing user, so that the quality of the clip video is improved.

In some embodiments, the executing body may further reduce user consumption of the editing user by presetting the storyline. For example, the execution subject may select some of the history clip videos for reference in advance by analyzing other history clip videos that have completed clipping (for example, may be a history clip video for reference determined in such a manner that the play heat of the clip video is higher than a preset heat threshold, the preference matching degree of the clip video and the editing user is equal to or higher than a preset matching degree threshold, or the like).

Then, the execution subject "identifies" a story line (which is described as a second story line for convenience of description) used when the history clip video for reference is clipped, based on analysis of the content tags employed by these history clip videos for reference, the reference ranking results of the content tags (e.g., by backtracking the clip process of the history clip video for reference, by reading the content tags of the respective video clips included therein).

The executing body may then store, maintain, and provide locally to the editing user, a set of at least one second storyline as it performs editing. It should be appreciated that the second storylines of the set of second storylines correspond to different ordered results of the content tags, respectively, thereby avoiding confusing, and burdening the user by providing a "repeated second storyline".

Further, in performing the above S103, for example, the execution body may alternatively select to, in response to receiving a selection result for a target second storyline of the set of second storylines, accordingly sort at least some of the target video segments of the set of video segments based on a reference sorting result of the content tags indicated by the target second storyline, resulting in a sorted result, a clip video. That is, after the editing user selects a target second storyline that is desired to be used, the execution subject may generate a clip video according to the reference ranking result of the content tags indicated by the target second storyline. Thus, the editing user can complete the clip based on the second storyline that has been used historically, to further reduce user consumption.

In some scenarios, depending on the content tags and rankings involved in the second storylines, the executing entity may also add identifying information based on what the different second storylines desire to highlight "content" to provide a reference, e.g., to prioritize emotions, to prioritize drama "climax" (which may be selected based on how the content semantics of the video segments are closest to the subject semantics of the source video), to "climax parts" that are more natural in emotion (gradual change process where there is no significant mutation in emotion polarity), and so forth.

In some embodiments, the editing user may also provide "for reference" clip video according to his own needs for situations where the second storyline provided by the executing body may not meet the editing user's clip needs, or where the executing body does not provide the second storyline, or the like. For convenience of description, an edited video provided by an editing user for reference may be referred to as a "reference clip video".

For example, the editing user may use the clip video from which he/she previously clips as a reference clip video to provide a "story line", or the editing user may use the "(history) clip video that he/she considers to" satisfy his/her own demand "as a reference clip video to be provided to the execution subject.

Accordingly, the execution body determines a corresponding story line (which may be referred to as a third story line for convenience of description) based on a recognition result of the arrangement order of content tags used in the reference clip video (similar to the manner of determining the second story line described above, and not repeated here) in response to receiving the reference clip video provided by the editing user.

Then, when executing S103, the executing body may also alternatively select, based on the reference ranking result of the content tag indicated by the third story line, to accordingly rank at least part of the target video clips in the group of video clips, so as to obtain a ranking result, and generate a clip video. Thus, the user can instruct the executing body to execute video clip by means of the reference clip video so as to provide the desired used 'story line' for the executing body at low cost, and the consumption of the user is reduced.

On the basis of any of the above embodiments, in the process of executing the above S102, the executing body may alternatively or additionally select to determine content tags corresponding to the video clips based on the text transcription result of the voice content of the video clips using the large language model. Thus, the execution performance is improved to determine the content tag more accurately, efficiently and with high quality.

A large language model (Large Language Model, LLM for short) is an artificial intelligence model that aims at understanding and generating human language. And, the LLM can perform processing operations accordingly based on what it understands to get corresponding processing results.

LLMs are characterized by a large scale, which can often include a large number of parameters to help them learn complex patterns in linguistic data. These models are typically based on deep learning architectures, such as translators, which help them provide better processing performance on various NLP tasks.

As discussed above, in embodiments of the present application, the execution body may also choose to utilize LLM to improve processing performance. For example, in determining content tags, LLM may be used for determining content tags to which video clips each correspond. For example, LLM may be trained in advance such that it has the ability to determine content tags for each video clip based on text transcription results of the voice content of the video clip. In this case, the execution subject may instruct the text transcription result based on the voice content of the video clip based on the pre-configured guide word, guide tag, or the like, to determine the content tag to which each of the video clips corresponds. For example, the guide word may be: please generate a short summary for each video clip, summarizing core information, key views or decisions. The summary should be succinct, highlighting the most informative part of the fragment, thereby generating a content summary tag using LLM.

Similarly, the guide word may also be "please analyze the emotion changes of a particular person in the following dialog content, and summarize the dominant emotion in each dialog with several keywords. Concern the fluctuations of emotion, the transition of mood, and key events that may affect emotion "for generating content emotion tags. As another example, the guidance times may be "in-depth analysis dialog, understanding logical relationships between speakers, potential intent, and information not directly expressed. Please point out the key turning points in the dialog, implicit meaning, and explain how they affect the trend of the dialog "for generating content intent tags.

In some embodiments, for LLM, it may be configured by default to omit the "bootstrap". For example, for the purpose that the executing body may instruct text transcription results of voice content based on video clips based on pre-configured guide words, guide tags, etc., and determine content tags corresponding to the video clips, the LLM may determine content tags corresponding to the video clips based on default configuration, with the understanding that the executing body may instruct text transcription results of voice content based on video clips based on pre-configured guide words, guide tags, etc., as a matter of course. Thus, in a manner of enabling the LLM to stably and directionally execute the subject by default configuration, the text transcription result of the voice content based on the video clips can be indicated based on the pre-configured guide word, guide label, and the like, and the content labels corresponding to the video clips are determined.

For deepening understanding, the application also provides a specific implementation scheme in combination with a specific application scene. Referring to fig. 2, fig. 2 is a schematic diagram illustrating a video editing process 200 under an application scenario according to another embodiment of the application. In process 200, a clip video 260 is produced by an executing subject for a target object 215 in a source video 210. The process 200 is specifically as follows:

the execution body first extracts a set of video clips 220 corresponding to the target object 215 from the source video 210 based on the voiceprint information of the target object 215. Illustratively, video clips 221 through 22N are included in a set of video clips 220, where N is a positive integer.

Next, the execution body performs text transcription on the voice text contents in the video clips 221 to 22N, and correspondingly obtains text transcription results corresponding to the respective video clips. For example, the text transcription results corresponding to the video clips 221 to 22N are shown in fig. 2, and are text transcription results 231 to 23N (for example, the text transcription result corresponding to the video clip 221 is the text transcription result 231. Moreover, the execution body processes the text transcription results 231 to 23N by using the LLM 240 to obtain a set of content tags 250. The set of content tags 250 includes content tags 251 to 25N corresponding to the video clips 221 to 22N, respectively (for example, the content tag corresponding to the video clip 221 is the content tag 251).

Finally, the executing entity uses at least some of the target video segments in the set of video segments 220, and content tags corresponding to those target video segments in the set of content tags 250 (i.e., content tags 251 through 25N), to arrive at a final ranking result (i.e., to arrive at the clip video 260).

The embodiment of the application also provides a device for editing video, and the structure of the device is shown as a device 300 in fig. 3. The apparatus 300 includes: a video clip extraction unit 310 configured to extract a set of video clips corresponding to the target object from the source video based on voiceprint information of the target object, wherein video clips in the set of video clips correspond to primary voice content of the target object in the source video; a content tag determining unit 320 configured to determine content tags corresponding to the video clips, respectively, based on a text transcription result of the voice content of the video clips: the video clip generating unit 330 is configured to sort at least part of the target video clips in the group of video clips according to the content tags, so as to obtain a sorting result, wherein the sorting result is the video clip for the target object.

In some embodiments, the video clips are sequentially captured based on a play order of the source video, and the apparatus 300 further includes: the video segment merging unit is configured to merge and extract two similar video segments adjacent in order based on the matching result of the content tags, wherein the similarity of the content tags corresponding to the two similar video segments meets the preset requirement.

In some embodiments, the apparatus 300 further comprises: a content tag presentation unit configured to present a content tag to an editing user; and clip video generation unit 330 is further configured to, in response to receiving the first storyline for the content tags, sort at least some of the target video clips in the set of video clips accordingly based on the reference sort result of the content tags indicated by the first storyline, resulting in a sort result.

In some embodiments, the apparatus 300 further comprises: a story line presentation unit configured to present a set of second story lines to the editing user, wherein the second story lines in the set of second story lines correspond to different ordering results of the content tags, respectively; and clip video generation unit 330 is further configured to, in response to receiving a selection result for a target second storyline of the set of second storylines, sort at least some of the target video clips of the set of video clips accordingly based on the reference sort result of the content tags indicated by the target second storyline, resulting in a sort result.

In some embodiments, the apparatus 300 further comprises: a story line determination unit configured to determine a third story line based on a result of identifying an arrangement order of content tags used in a reference clip video in response to receiving the reference clip video provided by the editing user; and clip video generation unit 330 is further configured to sort at least some of the target video clips in the set of video clips accordingly based on the reference sorting result of the content tags indicated by the third storyline, resulting in a sorting result.

In some embodiments, the content tag determination unit 320 is further configured to determine content tags corresponding to the video clips, respectively, based on text transcription results of the voice content of the video clips using the large language model.

Based on the same inventive concept, there is also provided an electronic device, a readable storage medium, and a computer program product in the embodiments of the present application, and a corresponding method of the electronic device may be the method of editing video in the foregoing embodiments, and the principle of solving the problem is similar to that of the method. The electronic equipment provided by the embodiment of the application comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods and/or aspects of the various embodiments of the application described above.

The electronic device may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, various terminal devices such as a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, and the network device may be implemented, for example, to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 4 shows a structure of an electronic device suitable for implementing the method and/or the technical solution in the embodiment of the present application, the electronic device 400 includes a central processing unit (CPU, central Processing Unit) 401, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage portion 408 into a random access Memory (RAM, random Access Memory) 403. In the RAM 403, various programs and data required for the system operation are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An Input/Output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, a touch panel, a microphone, an infrared sensor, and the like; an output portion 407 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), an LED display, an OLED display, and a speaker; a storage portion 408 comprising one or more computer-readable media of hard disk, optical disk, magnetic disk, semiconductor memory, etc.; and a communication section 409 including a network interface card such as a LAN (local area network ) card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet.

In particular, the methods and/or embodiments of the present application may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 401.

Still further embodiments of the present application provide a computer readable storage medium and a computer program product having computer program instructions stored thereon that are executable by a processor to implement the methods and/or aspects of any one or more of the embodiments of the present application described above.

In particular, the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, a system, apparatus, or device that includes, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., a module, a division of elements, merely a division of a logic function, and there may be additional divisions of an actual implementation, as examples of elements, e.g., multiple elements or page components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module and unit in each embodiment of the present application may be integrated into one processing module and unit, or each module and unit may exist alone physically, or two or more units may be integrated into one module and unit. The integrated units can be realized in a form of hardware or a form of hardware plus a software functional module or a unit.

The integrated modules, units described above, implemented in the form of software functional modules, units, may be stored in a computer readable storage medium. The software functional modules, units described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods of the various embodiments of the application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method of editing video, comprising:

Extracting a group of video clips corresponding to a target object from a source video based on voiceprint information of the target object, wherein the video clips in the group of video clips correspond to primary voice content sent by the target object in the source video;

determining content labels corresponding to the video clips respectively based on text transcription results of voice contents of the video clips;

And ordering at least part of the target video fragments in the group of video fragments according to the content tag to obtain an ordering result, wherein the ordering result is a clip video aiming at the target object.

2. The method of claim 1, wherein the content tag comprises at least one of:

content abstract tags, content emotion tags, content intention tags.

3. The method of claim 2, the video clips being sequentially captured based on a play order of the source video, further comprising:

And merging and extracting two similar video clips adjacent in sequence based on the matching result of the content tags, wherein the similarity of the content tags corresponding to the two similar video clips meets the preset requirement.

4. The method of claim 1, further comprising:

presenting the content tag to an editing user; and

The step of sorting at least part of the target video clips in the group of video clips according to the content tag to obtain a sorting result comprises the following steps:

responsive to receiving a first storyline for the content tags, ranking at least some of the target video segments in the set of video segments accordingly based on a reference ranking result of the content tags indicated by the first storyline, resulting in a ranking result.

5. The method of claim 1, further comprising:

Presenting a set of second storylines to the editing user, wherein the second storylines in the set of second storylines respectively correspond to different ordering results of the content tags; and

Responsive to receiving a selection result for a target second storyline of the set of second storylines, ranking at least a portion of the target video segments of the set of video segments accordingly based on a reference ranking result of content tags indicated by the target second storyline, resulting in a ranking result.

6. The method of claim 1, further comprising:

in response to receiving a reference clip video provided by an editing user, determining a third storyline based on a result of identifying an order of arrangement of content tags used in the reference clip video; and

and sequencing at least part of the target video clips in the group of video clips based on the reference sequencing result of the content tags indicated by the third story line, thereby obtaining a sequencing result.

7. The method of claim 1, wherein the text transcription result of the voice content of the video clip is derived based on:

dividing the voice content of the video clip into a plurality of short-time frames;

Determining a phoneme corresponding to the short-time frame based on the Mel frequency cepstrum coefficient of the short-time frame;

and determining a text transcription result of the video segment based on the probability distribution of the phonemes and a pre-determined upper and lower text word sequence prediction library.

8. The method of any of claims 1-7, wherein the determining content tags for each of the video clips based on text transcription results of the voice content of the video clips comprises:

And determining the content labels corresponding to the video clips respectively based on the text transcription result of the voice content of the video clips by using a large language model.

9. An apparatus for editing video, comprising:

A video clip extraction unit configured to extract a set of video clips corresponding to a target object from a source video based on voiceprint information of the target object, wherein video clips in the set of video clips correspond to primary voice content emitted by the target object in the source video;

A content tag determination unit configured to determine content tags corresponding to the video clips, respectively, based on a text transcription result of the voice content of the video clips:

And the video editing generation unit is configured to sort at least part of the target video fragments in the group of video fragments according to the content tag to obtain a sorting result, wherein the sorting result is the video editing for the target object.

10. An electronic device, the electronic device comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

11. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.