[go: up one dir, main page]

CN114302174B - Video editing method, device, computing equipment and storage medium - Google Patents

Video editing method, device, computing equipment and storage medium Download PDF

Info

Publication number
CN114302174B
CN114302174B CN202111679091.2A CN202111679091A CN114302174B CN 114302174 B CN114302174 B CN 114302174B CN 202111679091 A CN202111679091 A CN 202111679091A CN 114302174 B CN114302174 B CN 114302174B
Authority
CN
China
Prior art keywords
video
original video
original
frame
transition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111679091.2A
Other languages
Chinese (zh)
Other versions
CN114302174A (en
Inventor
张云栋
刘程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai IQIYI New Media Technology Co Ltd
Original Assignee
Shanghai IQIYI New Media Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai IQIYI New Media Technology Co Ltd filed Critical Shanghai IQIYI New Media Technology Co Ltd
Priority to CN202111679091.2A priority Critical patent/CN114302174B/en
Publication of CN114302174A publication Critical patent/CN114302174A/en
Application granted granted Critical
Publication of CN114302174B publication Critical patent/CN114302174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Television Signal Processing For Recording (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

The application discloses a video editing method, a device, a computing device and a storage medium, which comprise the steps of obtaining an original video to be processed, identifying a plurality of key positions and transition positions in the original video, wherein the key positions are used for indicating video clips in the original video, so that the plurality of video clips can be obtained by cutting the original video according to the plurality of key positions and the transition positions, and a target video is obtained by splicing the plurality of video clips, wherein the playing duration of the generated target video is smaller than that of the original video. Because the target video is generated according to the transition position, the video content of each clip in the target video at the starting position and/or the ending position can be generally complete. Therefore, when the user watches the target video, the user usually considers that the continuity of the look and feel is higher because of the completeness of the video content, and the experience of watching the clipped target video by the user can be improved.

Description

Video editing method, device, computing equipment and storage medium
Technical Field
The present application relates to the field of video processing technologies, and in particular, to a video editing method, apparatus, computing device, and storage medium.
Background
In an actual application scene, aiming at a video with a longer playing duration, the video can be usually clipped to generate a clipped video with a relatively shorter playing duration and containing core video content. For example, under a talk show variety feature of an internet video website, some smile-point video clips are usually released, which are clipped according to the feature, so that a viewer can quickly watch the whole smile-point clip.
Because the video editing mode is adopted to generate the video editing, not only can the labor cost be high, but also the efficiency of manually editing the video is generally low. Thus, the video clips may be automatically generated in an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) clipping manner. However, the video clip generated based on AI generally has a problem of incongruous look and feel, such as that a person in the video clip is cut off without speaking, which affects the viewing experience of the user on the video clip.
Disclosure of Invention
The embodiment of the application provides a video editing method, a device, computing equipment and a storage medium, aiming at improving the continuity of the look and feel of automatically generated editing video so as to improve the viewing experience of a user on the editing video.
In a first aspect, an embodiment of the present application provides a video editing method, including:
Acquiring an original video to be processed;
identifying a plurality of key locations in the original video and a transition location, the key locations being used to indicate video clips in the original video;
According to the key positions and the transition positions, a plurality of video clips are obtained by segmentation from the original video;
and splicing the plurality of video clips to obtain a target video, wherein the playing time length of the target video is smaller than that of the original video.
In one possible implementation, the identifying a transition location in the original video includes:
calculating the similarity between the adjacent first frame image and the second frame image in the original video;
and when the similarity between the first frame image and the second frame image is smaller than a preset threshold value, determining the position of the first frame image or the second frame image in the original video as the transition position.
In one possible implementation manner, the slicing a plurality of video segments from the original video according to the plurality of key positions and the transition position includes:
Determining a start segmentation point and a stop segmentation point corresponding to a plurality of candidate video segments in the original video according to a plurality of key positions in the original video;
Determining whether a transition position is included in a multi-frame first video image of which the distance between the original video and a start division point of a target candidate video segment is not more than a first preset distance, and whether a transition position is included in a multi-frame second video image of which the distance between the original video and a stop division point of the target candidate video segment is not more than a second preset distance, wherein the target candidate video segment is any one of the candidate video segments;
When the multi-frame first video image comprises a transition position and/or the multi-frame second video image comprises a transition position, the target candidate video segment is obtained by segmentation from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image.
In one possible implementation, the original video is a first type of video, and the key locations in the original video are identified by audio features in the original video.
In one possible implementation, the audio content corresponding to the key locations in the original video is laughter and/or applause, and the identifying the plurality of key locations in the original video includes:
Inputting the original video into an artificial intelligent AI model to obtain a plurality of key positions in the original video output by the AI model, wherein the AI model completes training in advance through a sample video with laughing sound marks and/or applause marks;
or matching the voice print characteristics of the audio data in the original video with the audio data corresponding to laughter and/or applause to obtain the key positions matched with the voice print characteristics.
In one possible implementation, the original video is a second type of video, and the key locations in the original video are identified by image features in the original video.
In one possible implementation, the identifying a plurality of key locations in the original video includes:
Determining a plurality of initial key positions from the original video;
And adjusting the plurality of initial key positions by utilizing an optical character recognition technology to obtain the plurality of key positions, so that each key position is a position for starting displaying the caption or ending displaying the caption.
In a second aspect, an embodiment of the present application further provides a video editing apparatus, where the apparatus includes:
the acquisition module is used for acquiring an original video to be processed;
A position identification module for identifying a plurality of key positions in the original video and a transition position, wherein the key positions are used for indicating video clips in the original video;
the segmentation module is used for segmenting the original video to obtain a plurality of video clips according to the plurality of key positions and the transition positions;
and the splicing module is used for splicing the plurality of video clips to obtain a target video, wherein the playing time length of the target video is smaller than that of the original video.
In one possible embodiment, the location identification module includes:
A calculating unit, configured to calculate a similarity between adjacent first frame images and second frame images in the original video;
And the first determining unit is used for determining the position of the first frame image or the second frame image in the original video as the transition position when the similarity between the first frame image and the second frame image is smaller than a preset threshold value.
In one possible embodiment, the segmentation module includes:
The second determining unit is used for determining starting segmentation points and ending segmentation points corresponding to a plurality of candidate video clips in the original video according to a plurality of key positions in the original video;
A third determining unit, configured to determine whether a transition position is included in a multi-frame first video image in the original video, where a distance between the multi-frame first video image and a start division point of a target candidate video segment is not greater than a first preset distance, and whether a transition position is included in a multi-frame second video image in the original video, where a distance between the multi-frame second video image and a stop division point of the target candidate video segment is not greater than a second preset distance, where the target candidate video segment is any one of the plurality of candidate video segments;
and the segmentation unit is used for segmenting the target candidate video segment from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image when the multi-frame first video image comprises the transition position and/or the multi-frame second video image comprises the transition position.
In one possible implementation, the original video is a first type of video, and the key locations in the original video are identified by audio features in the original video.
In one possible implementation manner, the audio content corresponding to the key position in the original video is laughter and/or applause, and the position identifying module includes:
The first identification unit is used for inputting the original video into an artificial intelligent AI model to obtain a plurality of key positions in the original video output by the AI model, and the AI model completes training in advance through a sample video with laughing sound marks and/or applause marks;
Or alternatively, the first and second heat exchangers may be,
And the second identification unit is used for matching the voiceprint characteristics of the audio data in the original video and the audio data corresponding to laughter and/or applause to obtain the plurality of key positions matched with the voiceprint characteristics.
In one possible implementation, the original video is a second type of video, and the key locations in the original video are identified by image features in the original video.
In one possible embodiment, the location identification module includes:
A fourth determining unit, configured to determine a plurality of initial key positions from the original video;
and the adjusting unit is used for adjusting the plurality of initial key positions by utilizing an optical character recognition technology to obtain the plurality of key positions, so that each key position is a position for starting displaying the caption or ending displaying the caption.
In a third aspect, embodiments of the present application also provide a computing device that may include a processor and a memory:
The memory is used for storing a computer program;
The processor is configured to execute the method according to the first aspect and any implementation manner of the first aspect according to the computer program.
In a fourth aspect, an embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium is configured to store a computer program, where the computer program is configured to perform the method according to any one of the foregoing first aspect and any implementation manner of the first aspect.
In the above implementation manner of the embodiment of the present application, an original video to be processed is obtained, and a plurality of key positions and transition positions in the original video are identified, where the key positions are used to indicate video segments in the original video, so that a plurality of video segments can be obtained by slicing from the original video according to the plurality of key positions and transition positions, and a target video is obtained by stitching based on the plurality of video segments, where a playing duration of the generated target video is less than a playing duration of the original video.
Because the transition position in the original video generally characterizes a segment of continuous video content ending at the transition position, when the target video is generated according to the transition position, the video content of each clip in the target video at the starting position and/or the ending position can be generally complete. Therefore, when the user watches the target video, the user usually considers that the continuity of the look and feel is higher because of the completeness of the video content, and the experience of watching the clipped target video by the user can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.
FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;
FIG. 2 is a flowchart of a video editing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a video editing apparatus according to an embodiment of the present application;
Fig. 4 is a schematic diagram of a hardware structure of a computing device according to an embodiment of the present application.
Detailed Description
Referring to fig. 1, a schematic view of an application scenario is provided in an embodiment of the present application. In the application scenario illustrated in fig. 1, a client 101 may have a communication connection with a computing device 102. And, the client 101 may receive video provided by a user (e.g., a video clipper) and send it to the computing device 102. The computing device 102 is configured to AI clip one or more received video, generate a clip video, and present the clip video to the user via the client 101.
The computing device 102 refers to a device with data processing capability, and may be, for example, a terminal, a server, etc. The client 101 may be applied in a physical device separate from the computing device 102. For example, when the computing device 102 is implemented by a server, the client 101 may operate on a user terminal or the like on the user side. Or client 101 may also be running on computing device 102.
In practical applications, the video clip generated by the computing device 102 based on the preset AI algorithm generally has a problem of discontinuous look and feel, so that the look and feel of the video clip is affected. For example, assuming that a session is included in the original video, in which person A asks "you have not done exercise recently," person B answers "none," and the computing device 102 clips the resulting video clip to include only person B's answer content ("none") based on AI algorithm that may begin cutting at the point where person A's speech ends, the user may feel unaware of the cloud, i.e., does not know what person B answers "none" for when viewing person B's speaking content "none" in the clipped video. For another example, person A has continuously spoken multiple sentences, and computing device 102 intercepts only a video clip of a portion of the sentence spoken by person A based on the AI algorithm, resulting in a significant loss of the speech content of person A.
Based on the above, the embodiment of the application provides a video editing method, which aims to improve the continuity of the look and feel of the generated editing video, so as to improve the viewing experience of a user for the editing video. In particular, when the computing device 102 obtains an original video to be processed, and identifies a plurality of key positions and transition positions in the original video, where the key positions are used to indicate video segments in the original video, so that the computing device 102 can segment the original video according to the plurality of key positions and transition positions to obtain a plurality of video segments, and splice the plurality of video segments to obtain a target video, where a playing duration of the generated target video is less than a playing duration of the original video.
Because the transition position in the original video generally characterizes a continuous piece of video content ending at the transition position, when the target video is generated according to the transition position (i.e., the aforementioned clip video is generated), the video content of each clip in the target video at the start position and/or the end position can be generally complete. Therefore, when the user watches the target video, the user usually considers that the continuity of the look and feel is higher because of the completeness of the video content, and the experience of watching the clipped target video by the user can be improved.
It should be noted that, the video in this embodiment refers to a video having both an image and audio content, that is, a video file, which includes not only video images of consecutive frames but also audio data synchronized with the video images.
It can be appreciated that the architecture of the application scenario shown in fig. 1 is only an example provided by the embodiment of the present application, and the embodiment of the present application may be applied to other applicable scenarios in practical application, for example, the computing device 102 may automatically obtain one or more videos from the internet, and automatically generate clip videos corresponding to the respective videos through the implementation manner described above. In summary, the embodiments of the present application may be applied to any applicable scenario, and are not limited to the scenario examples described above.
In order that the above objects, features and advantages of the present application will be more readily understood, a more particular description of various non-limiting embodiments of the application will be rendered by reference to the appended drawings. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 2, fig. 2 shows a flowchart of a video editing method according to an embodiment of the present application, where the method may be applied to the application scenario shown in fig. 1, or may be applied to other applicable application scenarios, etc. For convenience of explanation and understanding, the following description is given by taking an application scenario shown in fig. 1 as an example. The method specifically comprises the following steps:
S201, acquiring an original video to be processed.
For convenience of distinction and description, the video to be clipped is referred to as an original video, and the video generated for clipping is referred to as a target video in this embodiment.
In one possible implementation, the original video may be provided to the computing device 102 by a user. Specifically, the client 101 may present a video import interface to the user, so that the user may import an original video into the client 101 by performing a corresponding operation on the video import interface. The client 101 may then transmit the original video provided by the user to the computing device 102 over a network connection with the computing device 102.
In yet another possible implementation, the original video may also be obtained by the computing device 102 from the Internet. For example, a user may send instructions to the computing device 102 to generate a clip video through the client 101, so that the computing device 102 may download a particular type of video from the internet, download a talk show type of video or a talk sound type of video, download a view-type variety of video, etc., and take these videos as original videos for subsequent clipping of these original videos.
It should be noted that the original video acquired by the computing device 102 may be one video or may be multiple videos, for example, the computing device 102 may generate a target video based on multiple original videos, clipping, and the like, which is not limited in this embodiment. For easy understanding and explanation, in this embodiment, an original video is taken as an example, and when the original video includes a plurality of videos, the implementation is similar to the embodiment, and the difference is that a plurality of video clips that are spliced later originate from a plurality of different original videos.
S202, identifying a plurality of key positions in the original video and a transition position, wherein the key positions are used for indicating video fragments in the original video.
The key position is used for indicating the position of the video segment in the original video, and the starting segmentation point and the ending segmentation point of the clipped video segment can be determined based on the key position when the original video is clipped. And, for different types of original video, different categories of locations may be employed as key locations.
In one example, the computing device 102 may identify key locations through audio features in the original video. For example, when the original video is a first type of video, such as a talk show, a vocal joke, or a funny type of video, the computing device 102 may determine a location in the original video where the audio content is a vocal joke and/or a vocal joke as a key location because the video content with the vocal joke and/or the vocal joke is generally more attractive to the user in the actual application scenario.
In one implementation of identifying key locations, the computing device 102 may identify a plurality of key locations in the original video using the AI model. In particular implementations, the AI model may be preconfigured in the computing device 102 and trained in advance with the video samples having "laughter" and/or "applause" markers, such that the trained AI model may identify "laughter" and/or "applause" in the video. Thus, for a first type of raw video, the computing device 102 may input the raw video into a trained AI model, resulting in a plurality of key locations in the raw video output by the AI model.
In yet another implementation of identifying key locations, the computing device 102 may determine a plurality of key locations in the original video by comparing voiceprint features. Specifically, the computing device 102 may obtain audio data having "laughter" and/or "applause" content, extract the voiceprint features that will be "laughter" and/or "applause", and then the computing device 102 may segment-wise compare the voiceprint features with voiceprint features corresponding to the audio data in the original video, and determine the locations of the audio data where the voiceprint features are consistent as key locations, thereby determining a plurality of key locations in the original video.
It should be noted that, the implementation of determining the key position is described as some examples, and in practical application, the computing device 102 may determine the key position in the original video by other manners, which is not limited in this embodiment.
In yet another example, computing device 102 may identify key locations through image features in the original video. For example, when the original video is a second type of video, such as a video of a viewing-type variety, a position at which a subtitle of one of the subtitles starts to be displayed and/or a position at which the subtitle display ends may be determined as a key position. Of course, in other embodiments, the key location may be other possible implementations, and the embodiment is not limited thereto.
In one possible implementation, the computing device 102 may determine a plurality of initial key positions from the original video, and adjust the plurality of initial key positions by an optical character recognition (Optical Character Recognition, OCR) technique to obtain a plurality of key positions, such that each key position is a position where a subtitle begins to be displayed or a subtitle ends to be displayed. For example, the computing device 102 may randomly select two positions as a start position and an end position of a video according to a play duration (such as 30 seconds, etc.), so as to obtain two initial key positions corresponding to the video. The computing device 102 may then recognize the subtitle in the video image at the starting location using OCR technology and take the starting display location of the subtitle in the original video as a key location. Also, the computing device 102 may also recognize a subtitle in the video image at the termination position using OCR technology, and take the end display position of the subtitle in the original video as a key position, or the like. In this manner, computing device 102 may determine key locations for each of the plurality of video clips in the manner described above.
The transition position refers to a position where the person type in the original video is switched, for example, a position where the person is switched to a spectator, a guest seat, or the like, or a position where the person is switched to a scene not including the person, and may be a position where the photographed person type is switched due to the switching of the photographing lens. In practice, when the character type in the original video is switched, it is generally indicated that a piece of video content related to the character is temporarily finished, for example, when a discussion/discussion point of a lecture performed by the character is finished, a shooting shot is generally switched to an audience to shoot a reaction (such as laughter, nodding, panning, etc.) of a viewer to the content of the lecture of the character, so the computing device 102 can determine a boundary point of the clipped video segment using the transition position when clipping the original video. Of course, in other possible embodiments, the transition position may also be a position where other information in the original video is switched, such as switching of scenes, which is not limited in this embodiment.
In one possible implementation, the computing device 102 may determine the transition location by comparing the differences between two adjacent frames of images. In particular implementations, the computing device 102 may calculate a similarity between adjacent first and second frame images in the original video, and when the display degree between the first and second frame images is less than a preset threshold, the computing device 102 may determine a position of the first or second frame image in the original video as a transition position. For example, the first frame image is an image obtained by photographing a performer, and the second frame image is an image obtained by photographing a spectator, and so on. The computing device 102 may determine all transition positions in the original video by traversing the calculation, e.g., the computing device 102 may determine the transition point by sequentially comparing image similarities between two consecutive adjacent video images. Or the computing device 102 may perform a similarity calculation on multiple frames of images near the critical location to determine whether the critical location vicinity includes a transition location.
For example, when computing the similarity between two frames of images, the computing device 102 may first reduce the two frames of images to a size of 8 pixels by 8 pixels, respectively, that is, each frame of reduced images has 64 pixels. The function of this step is to remove the details of the image, only keep the basic information such as the structure/brightness in the image, etc., reduce the subsequent calculated amount. The computing device 102 may then perform gray-scale processing on the reduced two-frame images and calculate average gray-scale values for each frame image (i.e., average of 64 gray-scale values in each frame image), respectively. Next, the computing device 102 compares the gray value of each pixel in each frame image with the average gray value corresponding to the frame image, and marks the pixel as 1 when the gray value of the pixel is greater than or equal to the average gray value, and marks the pixel as 0 when the gray value of the pixel is less than the average gray value, so that 64 pixels in each frame image are combined according to a unified rule, and a 64-bit hash value (composed of 1 and 0) can be generated, which can be used as a fingerprint of the frame image. In this way, the computing device 102 may compare the 64-bit hash values corresponding to the two frames of images, where when the number of bits in the two hash values that are different exceeds a preset value (e.g., 5, etc.), the computing device 102 determines that the two frames of images are less similar, and when the number of bits in the two hash values that are different does not exceed the preset value, the computing device 102 determines that the two frames of images are more similar. In practice, the computing device 102 may determine the similarity between two frames of images by other methods, which is not limited in this embodiment.
And S203, segmenting the original video to obtain a plurality of video fragments according to the identified key positions and the transition positions.
In this embodiment, the computing device 102 may segment the original video according to the plurality of key positions and the transition positions to obtain a plurality of video clips.
In one possible implementation, the computing device 102 may first determine a start segmentation point and an end segmentation point for a plurality of candidate video segments in the original video based on a plurality of key locations in the original video. The starting segmentation point refers to a starting point of a candidate video segment, and a video image at the starting segmentation point is a first frame image of the candidate video segment. Correspondingly, the termination division point refers to the termination point of the candidate video segment, and the video image at the termination division point is the last frame image of the candidate video segment. For example, when the original video is a first type of video, the computing device 102 may determine a play position at the first 15 (or other value) seconds of the key position as a start split point of the candidate video segment and a play position at the last 1 (or other value) seconds of the key position as a stop split point of the candidate video segment.
The computing device 102 may then determine whether a transition position is included in a first plurality of frames of video images in the original video that are no more than a first predetermined distance from a start partition point of a target candidate video segment that is any of the plurality of candidate video segments, and whether a transition position is included in a second plurality of frames of video images in the original video that are no more than a second predetermined distance from a stop partition point of the target candidate video segment. The distance between the video frame and the start-stop division point may be characterized as the number of video frames (or the playing time length corresponding to the number of the video frames, etc.) included between the position of the video image in the original video and the start-stop division point, and the larger the number of the video frames, the farther the characterization distance, the smaller the number of the video frames, and the closer the characterization distance. Accordingly, the first preset distance may be, for example, a preset number of video frames (or a playing duration corresponding to the number of video frames, etc.), such as 450 consecutive images. Similarly, the distance from the termination cut point may be characterized by the number of video frames included between the position of the video image in the original video and the start-stop cut point, or by the play time length corresponding to the number of video frames, etc. Correspondingly, the second preset distance may be a preset number of video frames, or a playing duration corresponding to the number of video frames, etc. The first preset distance and the second preset distance may be the same or different.
When the multiple frames of first video images include transition positions and/or the multiple frames of second video images include transition positions, the computing device 102 may segment the original video to obtain target candidate video segments according to the transition positions in the multiple frames of first video images and/or the transition positions in the multiple frames of second video images. Specifically, when the multi-frame first video image includes a transition position, the computing device 102 may update the start segmentation point of the target candidate video segment to the transition position in the multi-frame first video image, and segment the target candidate video segment from the original video according to the transition position and the end segmentation point of the target candidate video segment. When the transition position is included in the multi-frame second video image, the computing device 102 may update the termination segmentation point of the target candidate video segment to the transition position in the multi-frame second video image, and segment the target candidate video segment from the original video according to the transition position and the start segmentation point of the target candidate video segment. When the transition positions are included in the multi-frame first video image and the multi-frame second video image, the computing device 102 may update the start segmentation point of the target candidate video segment to the transition position in the multi-frame first video image, update the end segmentation point of the target candidate video segment to the transition position in the multi-frame second video image, and segment the original video according to the two transition positions to obtain the target candidate video segment. In this manner, computing device 102 may segment multiple video clips according to multiple key locations and transition locations.
Because computing device 102 may adjust the starting segmentation point and/or the ending segmentation point of the target candidate video segment based on the transition location, relatively complete video content may typically be included in the video segments clipped based on the transition location. Thus, when the user views the video clip, the problem of low consistency of look and feel can be avoided. For example, assuming that the original video includes a session about person A and person B, person A asks "you have not done gym exercise recently," person B answers "none," and that a shot transition occurs after person B has finished speaking, computing device 102 may determine the termination split point of the video clip as after person B has finished speaking "none" based on the identified transition location, so that the user may then have a relatively high consistency of look and feel for the video clip when viewing the video clip because the video clip includes the session content of person A and person B.
S204, based on the plurality of video clips, splicing to obtain a target video, wherein the playing time length of the target video is smaller than that of the original video.
After clipping out the video clips from the original video, the computing device 102 may splice the video clips to generate the target video. The computing device 102 may sequentially splice the video clips according to the playing order of the video clips in the original video, or may splice the video clips in other orders, which is not limited in this embodiment.
In practical application, the playing time of the target video generated by clipping has a certain limit. For example, for an original video whose play duration is 2 hours, the play duration of a target video for which clip generation is performed may not exceed 10 minutes. Therefore, the playing time of the target video generated by the video clip is generally smaller than that of the original video.
Further, if the total playing duration corresponding to the multiple video clips obtained by segmentation is greater than the maximum playing duration of the target video to be generated, the computing device 102 selects a part of video clips from the multiple video clips to generate the target video. For example, the computing device 102 may select a portion of the plurality of video segments having a relatively long playing duration to generate the target video, or may randomly select a portion of the video segments to generate the target video, which is not limited in this embodiment.
In this embodiment, since the transition position in the original video generally characterizes a section of continuous video content ending at the transition position, when the target video is generated according to the transition position, the video content of each clip in the target video at the start position and/or the end position can be generally complete. Therefore, when the user watches the target video, the user usually considers that the continuity of the look and feel is higher because of the completeness of the video content, and the experience of watching the clipped target video by the user can be improved.
For ease of understanding and explanation, the following will take the original video as talk show video and view video, respectively, as examples.
After obtaining the video of the talk show, the computing device 102 may identify, through the AI model, a location in the video of the talk show where laughing and/or applause exists, or determine, by means of voiceprint feature comparison, a key location in the video of the talk show where laughing and/or applause exists, and so on. Then, for each key location, the computing device 102 may further determine a play location at the first 15 seconds of the key location (i.e., a start split point) and a play location at the last 3 seconds (a stop split point), and identify whether a transition location is included in the multi-frame first video image between the key location and the play location at the first 15 seconds, and identify whether a transition location is included in the multi-frame second video image between the key location and the play location at the last 3 seconds. The computing device 102 may adjust a start split point of the video segment to the transition position if the multi-frame first video image includes the transition position and/or adjust an end split point of the video segment transition position to the transition position if the multi-frame second video image includes the transition position, thereby splitting the video segment from the original video based on the adjusted start split point and/or end split point. As such, computing device 102 may segment multiple video segments including laughter and/or applause for multiple key locations. Finally, computing device 102 may splice the multiple video clips that are cut to generate a talk show video highlight, i.e., a clip video desired by the user.
After obtaining the video of the view category, the computing device 102 may intercept a candidate video segment of a preset duration (for example, a play duration of 30 seconds) from the video of the view category, where a start segmentation point and a stop segmentation point of the candidate video segment may be preliminarily determined by using an OCR technology, specifically, a position where a caption starts to be displayed may be used as a start segmentation point of the candidate video segment and a position where another caption ends to be displayed may be used as a stop segmentation point of the candidate video segment, and a play duration corresponding to a multi-frame video image between the start segmentation point and the stop segmentation point is close to the preset duration. Then, the computing device 102 may identify whether a transition position is included in a multi-frame first video image in the view-class variety video that is not more than a first preset distance (e.g., 90 frames of video images, etc.) from the start division point, and identify whether a transition position is included in a multi-frame first video image in the view-class variety video that is not more than a second preset distance (e.g., 60 frames of video images, etc.), and if the transition position is included in the multi-frame first video image, the computing device 102 may adjust the start division point of the video segment to the transition position, and/or if the transition position is included in the multi-frame second video image, adjust the end division point of the video segment transition position to the transition position, thereby slicing the video segment from the original video based on the adjusted start division point and/or end division point. Thus, a plurality of video clips can be cut from the observation-type variety video. Finally, computing device 102 may splice the multiple video clips to generate a clip video of the view-type synthetic video.
In addition, the embodiment of the application also provides a video editing device. Referring to fig. 3, fig. 3 is a schematic structural diagram of a video editing apparatus according to an embodiment of the present application, and the video editing apparatus 300 includes:
An acquisition module 301, configured to acquire an original video to be processed;
a location identification module 302, configured to identify a plurality of key locations in the original video and a transition location, where the key locations are used to indicate video clips in the original video;
the segmentation module 303 is configured to segment the original video to obtain a plurality of video segments according to the plurality of key positions and the transition position;
and the splicing module 304 is configured to splice the plurality of video clips to obtain a target video, where a playing duration of the target video is less than a playing duration of the original video.
In one possible implementation, the location identification module 302 includes:
A calculating unit, configured to calculate a similarity between adjacent first frame images and second frame images in the original video;
And the first determining unit is used for determining the position of the first frame image or the second frame image in the original video as the transition position when the similarity between the first frame image and the second frame image is smaller than a preset threshold value.
In one possible implementation, the splitting module 303 includes:
The second determining unit is used for determining starting segmentation points and ending segmentation points corresponding to a plurality of candidate video clips in the original video according to a plurality of key positions in the original video;
A third determining unit, configured to determine whether a transition position is included in a multi-frame first video image in the original video, where a distance between the multi-frame first video image and a start division point of a target candidate video segment is not greater than a first preset distance, and whether a transition position is included in a multi-frame second video image in the original video, where a distance between the multi-frame second video image and a stop division point of the target candidate video segment is not greater than a second preset distance, where the target candidate video segment is any one of the plurality of candidate video segments;
and the segmentation unit is used for segmenting the target candidate video segment from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image when the multi-frame first video image comprises the transition position and/or the multi-frame second video image comprises the transition position.
In one possible implementation, the original video is a first type of video, and the key locations in the original video are identified by audio features in the original video.
In one possible implementation, the audio content corresponding to the key location in the original video is laughter and/or applause, and the location identification module 302 includes:
The first identification unit is used for inputting the original video into an artificial intelligent AI model to obtain a plurality of key positions in the original video output by the AI model, and the AI model completes training in advance through a sample video with laughing sound marks and/or applause marks;
Or alternatively, the first and second heat exchangers may be,
And the second identification unit is used for matching the voiceprint characteristics of the audio data in the original video and the audio data corresponding to laughter and/or applause to obtain the plurality of key positions matched with the voiceprint characteristics.
In one possible implementation, the original video is a second type of video, and the key locations in the original video are identified by image features in the original video.
In one possible implementation, the location identification module 302 includes:
A fourth determining unit, configured to determine a plurality of initial key positions from the original video;
and the adjusting unit is used for adjusting the plurality of initial key positions by utilizing an optical character recognition technology to obtain the plurality of key positions, so that each key position is a position for starting displaying the caption or ending displaying the caption.
It should be noted that, because the content of information interaction and execution process between each module and unit of the above-mentioned device is based on the same concept as the method embodiment in the embodiment of the present application, the technical effects brought by the content are the same as the method embodiment in the embodiment of the present application, and the specific content can be referred to the description in the foregoing method embodiment shown in the embodiment of the present application, which is not repeated here.
In addition, the embodiment of the application also provides a computing device. Referring to fig. 4, fig. 4 illustrates a schematic hardware architecture of a computing device 400 in an embodiment of the application, where the computing device 400 may include a processor 401 and a memory 402.
Wherein the memory 402 is configured to store a computer program;
the processor 401 is configured to execute the following steps according to the computer program:
Acquiring an original video to be processed;
identifying a plurality of key locations in the original video and a transition location, the key locations being used to indicate video clips in the original video;
According to the key positions and the transition positions, a plurality of video clips are obtained by segmentation from the original video;
and splicing the plurality of video clips to obtain a target video, wherein the playing time length of the target video is smaller than that of the original video.
In a possible implementation, the processor 401 is specifically configured to perform the following steps according to the computer program:
calculating the similarity between the adjacent first frame image and the second frame image in the original video;
and when the similarity between the first frame image and the second frame image is smaller than a preset threshold value, determining the position of the first frame image or the second frame image in the original video as the transition position.
In a possible implementation, the processor 401 is specifically configured to perform the following steps according to the computer program:
Determining a start segmentation point and a stop segmentation point corresponding to a plurality of candidate video segments in the original video according to a plurality of key positions in the original video;
Determining whether a transition position is included in a multi-frame first video image of which the distance between the original video and a start division point of a target candidate video segment is not more than a first preset distance, and whether a transition position is included in a multi-frame second video image of which the distance between the original video and a stop division point of the target candidate video segment is not more than a second preset distance, wherein the target candidate video segment is any one of the candidate video segments;
When the multi-frame first video image comprises a transition position and/or the multi-frame second video image comprises a transition position, the target candidate video segment is obtained by segmentation from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image.
In one possible implementation, the original video is a first type of video, and the key locations in the original video are identified by audio features in the original video.
In a possible implementation manner, the audio content corresponding to the key position in the original video is laughter and/or applause, and the processor 401 is specifically configured to perform the following steps according to the computer program:
Inputting the original video into an artificial intelligent AI model to obtain a plurality of key positions in the original video output by the AI model, wherein the AI model completes training in advance through a sample video with laughing sound marks and/or applause marks;
or matching the voice print characteristics of the audio data in the original video with the audio data corresponding to laughter and/or applause to obtain the key positions matched with the voice print characteristics.
In one possible implementation, the original video is a second type of video, and the key locations in the original video are identified by image features in the original video.
In a possible implementation, the processor 401 is specifically configured to perform the following steps according to the computer program:
Determining a plurality of initial key positions from the original video;
And adjusting the plurality of initial key positions by utilizing an optical character recognition technology to obtain the plurality of key positions, so that each key position is a position for starting displaying the caption or ending displaying the caption.
In addition, the embodiment of the application also provides a computer readable storage medium for storing a computer program for executing the method described in the embodiment of the method.
From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the method according to the embodiments or some parts of the embodiments of the present application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing description of the exemplary embodiments of the application is merely illustrative of the application and is not intended to limit the scope of the application.

Claims (9)

1. A method of video editing, the method comprising:
Acquiring an original video to be processed;
Identifying a plurality of key positions and transition positions in the original video, wherein the key positions are used for indicating a start segmentation point and a stop segmentation point of a video segment in the original video, and the transition positions are positions where the character types in the original video are switched;
According to the key positions and the transition positions, a plurality of video clips are obtained by segmentation from the original video;
based on the video clips, splicing to obtain a target video, wherein the playing time length of the target video is smaller than that of the original video;
and according to the key positions and the transition positions, segmenting the original video to obtain a plurality of video segments, wherein the video segments comprise:
Determining a start segmentation point and a stop segmentation point corresponding to a plurality of candidate video segments in the original video according to a plurality of key positions in the original video;
Determining whether a transition position is included in a multi-frame first video image of which the distance between the original video and a start division point of a target candidate video segment is not more than a first preset distance, and whether a transition position is included in a multi-frame second video image of which the distance between the original video and a stop division point of the target candidate video segment is not more than a second preset distance, wherein the target candidate video segment is any one of the candidate video segments;
When the multi-frame first video image comprises a transition position and/or the multi-frame second video image comprises a transition position, the target candidate video segment is obtained by segmentation from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image.
2. The method of claim 1, wherein the identifying a transition location in the original video comprises:
calculating the similarity between the adjacent first frame image and the second frame image in the original video;
and when the similarity between the first frame image and the second frame image is smaller than a preset threshold value, determining the position of the first frame image or the second frame image in the original video as the transition position.
3. The method of claim 1, wherein the original video is a first type of video, and wherein key locations in the original video are identified by audio features in the original video.
4. A method according to claim 3, wherein the audio content corresponding to the key locations in the original video is laughter and/or applause, and the identifying the plurality of key locations in the original video comprises:
Inputting the original video into an artificial intelligent AI model to obtain a plurality of key positions in the original video output by the AI model, wherein the AI model completes training in advance through a sample video with laughing sound marks and/or applause marks;
or matching the voice print characteristics of the audio data in the original video with the audio data corresponding to laughter and/or applause to obtain the key positions matched with the voice print characteristics.
5. The method of claim 1, wherein the original video is a second type of video, and wherein key locations in the original video are identified by image features in the original video.
6. The method of claim 5, wherein the identifying a plurality of key locations in the original video comprises:
Determining a plurality of initial key positions from the original video;
And adjusting the plurality of initial key positions by utilizing an optical character recognition technology to obtain the plurality of key positions, so that each key position is a position for starting displaying the caption or ending displaying the caption.
7. A video editing apparatus, the apparatus comprising:
the acquisition module is used for acquiring an original video to be processed;
the position identification module is used for identifying a plurality of key positions and transition positions in the original video, wherein the key positions are used for indicating a start segmentation point and a stop segmentation point of a video fragment in the original video, and the transition positions are positions where the character types in the original video are switched;
the segmentation module is used for segmenting the original video to obtain a plurality of video clips according to the plurality of key positions and the transition positions;
the splicing module is used for splicing the plurality of video clips to obtain a target video, and the playing time length of the target video is smaller than that of the original video;
the segmentation module is specifically configured to:
Determining a start segmentation point and a stop segmentation point corresponding to a plurality of candidate video segments in the original video according to a plurality of key positions in the original video;
Determining whether a transition position is included in a multi-frame first video image of which the distance between the original video and a start division point of a target candidate video segment is not more than a first preset distance, and whether a transition position is included in a multi-frame second video image of which the distance between the original video and a stop division point of the target candidate video segment is not more than a second preset distance, wherein the target candidate video segment is any one of the candidate video segments;
When the multi-frame first video image comprises a transition position and/or the multi-frame second video image comprises a transition position, the target candidate video segment is obtained by segmentation from the original video according to the transition position in the multi-frame first video image and/or the transition position in the multi-frame second video image.
8. A computing device, the device comprising a processor and a memory:
The memory is used for storing a computer program;
the processor is configured to perform the method of any of claims 1-6 according to the computer program.
9. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a computer program for executing the method of any one of claims 1-6.
CN202111679091.2A 2021-12-31 2021-12-31 Video editing method, device, computing equipment and storage medium Active CN114302174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111679091.2A CN114302174B (en) 2021-12-31 2021-12-31 Video editing method, device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111679091.2A CN114302174B (en) 2021-12-31 2021-12-31 Video editing method, device, computing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114302174A CN114302174A (en) 2022-04-08
CN114302174B true CN114302174B (en) 2025-02-11

Family

ID=80975654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111679091.2A Active CN114302174B (en) 2021-12-31 2021-12-31 Video editing method, device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114302174B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334235B (en) * 2022-07-01 2024-06-04 西安诺瓦星云科技股份有限公司 Video processing method, device, terminal equipment and storage medium
CN115633218A (en) * 2022-10-20 2023-01-20 深圳市菲菲教育发展有限公司 Video editing method, storage medium and device
CN115439482B (en) * 2022-11-09 2023-04-07 荣耀终端有限公司 Transition detection method and related device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110519655A (en) * 2018-05-21 2019-11-29 优酷网络技术(北京)有限公司 Video clipping method and device
CN113709561A (en) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 Video editing method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101516995B1 (en) * 2013-08-22 2015-05-15 주식회사 엘지유플러스 Context-based VOD Search System And Method of VOD Search Using the Same
US9620168B1 (en) * 2015-12-21 2017-04-11 Amazon Technologies, Inc. Cataloging video and creating video summaries
CN110493637B (en) * 2018-05-14 2022-11-18 阿里巴巴(中国)有限公司 Video splitting method and device
CN110401873A (en) * 2019-06-17 2019-11-01 北京奇艺世纪科技有限公司 Video clipping method, device, electronic equipment and computer-readable medium
CN110675371A (en) * 2019-09-05 2020-01-10 北京达佳互联信息技术有限公司 Scene switching detection method and device, electronic equipment and storage medium
CN112016427A (en) * 2020-08-21 2020-12-01 广州欢网科技有限责任公司 A kind of video strip method and device
CN113709575B (en) * 2021-04-07 2024-04-16 腾讯科技(深圳)有限公司 Video editing processing method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110519655A (en) * 2018-05-21 2019-11-29 优酷网络技术(北京)有限公司 Video clipping method and device
CN113709561A (en) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 Video editing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114302174A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN114302174B (en) Video editing method, device, computing equipment and storage medium
CN107707931B (en) Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment
US11308993B2 (en) Short video synthesis method and apparatus, and device and storage medium
EP3739888A1 (en) Live stream video highlight generation method and apparatus, server, and storage medium
US11985364B2 (en) Video editing method, terminal and readable storage medium
CN111988658B (en) Video generation method and device
CN114339451B (en) Video editing method, device, computing equipment and storage medium
US11025964B2 (en) Method, apparatus, server, and storage medium for generating live broadcast video of highlight collection
US20170065889A1 (en) Identifying And Extracting Video Game Highlights Based On Audio Analysis
CN113220940B (en) Video classification method, device, electronic equipment and storage medium
CN113613065A (en) Video editing method and device, electronic equipment and storage medium
WO2022134698A1 (en) Video processing method and device
CN114143575A (en) Video editing method and device, computing equipment and storage medium
CN112733654B (en) Method and device for splitting video
CN114286171B (en) Video processing method, device, equipment and storage medium
CN105812920A (en) Media information processing method and media information processing device
CN113301386A (en) Video processing method, device, server and storage medium
CN114339423B (en) Short video generation method, device, computing equipment and computer readable storage medium
CN115239551A (en) Video enhancement method and device
CN115119014A (en) Video processing method, and training method and device of frame insertion quantity model
CN113012723B (en) Multimedia file playing method and device and electronic equipment
CN114500879A (en) Video data processing method, device, equipment and storage medium
CN117319765A (en) Video processing method, device, computing equipment and computer storage medium
CN115225962B (en) Video generation method, system, terminal equipment and medium
CN115665508A (en) Method, device, electronic device and storage medium for video abstract generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant