[go: up one dir, main page]

CN119155484A - Intelligent video editing method based on large language model - Google Patents

Intelligent video editing method based on large language model Download PDF

Info

Publication number
CN119155484A
CN119155484A CN202411054282.3A CN202411054282A CN119155484A CN 119155484 A CN119155484 A CN 119155484A CN 202411054282 A CN202411054282 A CN 202411054282A CN 119155484 A CN119155484 A CN 119155484A
Authority
CN
China
Prior art keywords
matching
intelligent
shot
result
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411054282.3A
Other languages
Chinese (zh)
Other versions
CN119155484B (en
Inventor
王彦彬
李永葆
朱宇
朱庆余
郑铎
刘焕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dayang Technology Development Inc
Original Assignee
Beijing Dayang Technology Development Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dayang Technology Development Inc filed Critical Beijing Dayang Technology Development Inc
Priority to CN202411054282.3A priority Critical patent/CN119155484B/en
Publication of CN119155484A publication Critical patent/CN119155484A/en
Application granted granted Critical
Publication of CN119155484B publication Critical patent/CN119155484B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本发明公开了一种基于大语言模型的智能视频剪辑方法,基于机构资源库素材,通过大语言模型、跨模态引擎、语音识别与合成、视听语言模型等AI人工智能技术与视音频生产相结合,将专业媒体内容生产中的文字稿件通过大语言模型进行处理,把视音频素材通过跨模态引擎和语音识别引擎等进行综合智能分析,通过语义匹配方式分别在跨模态索引库和同期声索引库中进行智能镜头匹配,供编辑人员对智能编辑结果进行人工快速调整与修改。本发明可用于新媒体短视频、事件播报类视频新闻、电视节目二次创作、影视剧片花和集锦等类型节目的智能生产,为各媒体机构和专业内容生产者提供全新的视频生产方式,满足互联网时代,视频推送制下对海量视频内容的生产要求。

The present invention discloses an intelligent video editing method based on a large language model. Based on the institutional resource library materials, the AI artificial intelligence technologies such as large language models, cross-modal engines, speech recognition and synthesis, and audio-visual language models are combined with audio-visual production. The text manuscripts in the production of professional media content are processed through the large language model, and the audio-visual materials are subjected to comprehensive intelligent analysis through cross-modal engines and speech recognition engines. Through semantic matching, intelligent lens matching is performed in the cross-modal index library and the synchronous sound index library respectively, so that editors can manually and quickly adjust and modify the intelligent editing results. The present invention can be used for the intelligent production of new media short videos, event broadcast video news, TV program secondary creation, film and television drama trailers and highlights, etc., and provides a new video production method for various media organizations and professional content producers to meet the production requirements of massive video content under the video push system in the Internet era.

Description

Intelligent video editing method based on large language model
Technical Field
The invention relates to the technical field of video editing, in particular to an intelligent video editing method based on a large language model.
Background
Along with the continuous development of computer vision, voice recognition and other technologies, the application of the cross-mode large model is mature. The model can process data (such as text, images, voice and the like) of different modes, realizes fusion and interaction of multi-mode information, and provides richer possibility for artificial intelligence application. The large language model is deeply developed from the scale of the model, the diversification of application scenes and advanced technical innovation to the cross-modal large model, and the trends and the obtained achievements not only reflect the huge progress obtained in the artificial intelligence field, but also indicate that the future large model technology will show the unique value and capability in more fields.
In the production and creation process of video and audio content, information contained in an image is often not limited to visual information, but may also relate to data of other modalities, such as text, voice, and the like. Therefore, the intelligent production of video and audio contents is completed through a cross-modal data fusion technology, and the method becomes an important research and improvement direction of the invention. With the deep development of media fusion, each media organization faces the change of media content propagation channels and the mass demand of video and audio content caused by the media content propagation channels. The traditional media mechanism plays the new media in fixed time through the release forms such as broadcasting television, and the release of the new media and the content of the internet is anytime and anywhere, and is not limited by the traditional play channels and play time. Audience's viewing channels and viewing modes have also changed significantly, and obtaining news information through the internet has become the highest-duty mode, and meanwhile, as most of audience views through fragmentation time, the demand for short videos has also increased significantly.
Under the background of rapid development of new media of the internet, new requirements are put forward on the output and production efficiency of video and audio contents, but the conventional video and audio content production mode is difficult to meet the current requirements of media fusion development, and a rapid, efficient and content quality-guaranteed new production process is urgently needed by each professional media institution to meet the content production requirements in the fused media environment.
Disclosure of Invention
The invention provides an intelligent video editing method based on a large language model, which is based on AI technology such as the large language model, cross-modal analysis and the like, realizes automatic generation of professional media content from text manuscripts, combines the characteristics of audio-visual language, carries out a series of processes such as weight judgment, length optimization and the like on an intelligent matched lens, and meets the requirements of the professional media on the aspects of efficiency and quality of video production.
The invention is realized in that the intelligent video editing method based on the large language model comprises the following steps:
Step 1, performing cross-modal analysis on the materials, namely automatically performing multi-dimensional comprehensive intelligent analysis on the materials by AI engines such as cross-modal models, intelligent voices and the like during material warehousing;
Step2, selecting materials or material groups required by creation from a mechanism resource library;
importing the text manuscript, rewriting and classifying, namely importing the video text manuscript, using a large language model to rewrite the video text manuscript into a required text manuscript, classifying and labeling synchronous sound and text types,
Step 4, automatically using different intelligent matching models to carry out lens matching according to different classifications of the text manuscripts;
Step 5, adjusting the intelligent matching result of the shots to generate matching candidate shot groups, namely storing the matching result of each sentence/segment of text as a shot group, sorting according to the similarity, defining the maximum number of shots in the shot group, and providing the shot with the highest matching similarity in each group of shots as a preferred result for the next step;
Step 6, generating a sequence and adjusting according to the audio-visual language model, wherein the steps include intelligent scene merging and analysis and processing of the matched shot result by using the audio-visual language model according to the time sequence of the front shot and the rear shot in the raw material in the shot matching result;
Step 7, generating coordination and subtitles, and adding a music score;
And 8, completing intelligent editing, and performing manual checking to meet the finally issued auditing requirements.
Further, the comprehensive intelligent analysis in the step 1 includes:
1.1, performing transition frame detection on video, and splitting continuous video materials into a plurality of scene segments according to detection results;
1.2, extracting key frames aiming at scenes of each video;
1.3, cross-modal detection and analysis of key frames;
1.4, cross-modal analysis is carried out on the key frames, vectors are generated and stored in an index library;
1.5, carrying out audio synchronous sound analysis, generating synchronous sound indexes and storing the synchronous sound indexes in an index library.
Further, step 4, for text content marked as contemporaneous sound, the system performs similarity matching on contemporaneous sound indexes based on semantic understanding of text manuscripts, wherein the similarity matching comprises that text results of the text manuscripts and the speech recognition are not identical, so as to ensure intelligent matching between written text in the manuscripts and material interviews and spoken language in conversations, for text types, the system performs matching on text and video and audio content in vector dimensions from a cross-modal index library based on semantic understanding of text, and forms similarity data according to a matching comparison result, and performs intelligent matching of shots according to similarity.
Further, the intelligent scene merging method in step 6 according to the time sequence of the front and rear shots in the original material in the shot matching result is as follows:
each lens matching result contains information such as original material id ClipID, an IN point IN, an OUT point OUT and the like;
The matching lens results of the continuous multiple sentences/sections of contemporaneous sound words are respectively C0, C1 and C2., the corresponding original material ids are respectively ClipID 1、ClipID2、ClipID3, the corresponding original material entry points are respectively IN 1、IN2、IN3, and the corresponding original material exit points are respectively OUT 1、OUT2、OUT3.
Firstly, comparing material information of two lenses C1 and C0 of a first group, and comparing whether original material IDs corresponding to the two lenses are the same;
If the ClipID 2 is different from the ClipID 1, the two lens matching results are derived from different materials, scene combination is not needed, and the next group of materials C2 and C1 are compared;
If ClipID 2 is the same as ClipID 1, then the continuity of the two shots is also compared. Comparing the material IN-point IN 2 of the lens C1 with the material OUT-point OUT 1 of the lens C0;
if IN 2—OUT1 < t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result have continuity IN time, and performing scene merging;
if IN 2—OUT1 is more than or equal to t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result do not have continuity IN time, and not performing scene merging;
and so on until the last contemporaneous sound matches the shot result.
Further, the analysis and processing method for the matched shot result by using the audio-visual language model in the step 6 comprises the steps of judging the intelligent matched shot result of each sentence/section of text, processing the shot length of the intelligent matched shot result, analyzing the audio-visual voices such as foreground and background shots, shooting method and the like, matching with the montage sentence pattern in the audio-visual language model, trimming the length of the video shot according to the audio length of dubbing or contemporaneous sound, and arranging and splicing the shot matched result according to the sequence of the manuscript or the shot script after each sentence/section of text in the manuscript is subjected to shot matching.
Further, in step 7, dubbing can be generated for the bystander of the clipping result by the voice recognition and voice synthesis engine, subtitle can be automatically generated for the text and the voice in the same period, the music provided by the system can be subjected to emotion classification, and the generated intelligent clipping result is selected to be automatically added with the music according to the requirement.
Further, the maximum lens number is less than or equal to 10.
Further, the step 1.4 of cross-modal analysis of the key frames is to extract the 1 st frame as 1 key frame from the video content according to every 10 frames, perform vector analysis on the key frames after the frame extraction, and perform difference calculation on the vectors of the front and rear key frames;
if the vector difference ∂ between every two consecutive frames is smaller than the preset value, the scene is considered to be unnecessary to be further split, and the vector of each analyzed key frame is stored in an index library for storage;
if the difference value between the front and the back continuous key frames is larger than or equal to the preset value, adding the 6 th frame in the segment of the intermediate frame between the two key frames as the key frame, and indexing the analysis result vector.
The beneficial effects of the invention are as follows:
(1) The method is designed according to the characteristics of the professional media institution, the professional media institution has a large amount of self video and audio material resources, and an own institution resource library is established. The intelligent editing is carried out based on the organization resource library materials, so that the authenticity, reliability and legality of the generated content can be ensured, and meanwhile, copyright disputes possibly caused by the reference of the Internet materials are avoided.
(2) The intelligent editing mode initiated in the industry is adopted, professional media manuscripts are classified into different classifications of texts, synchronization and the like according to the characteristics of traditional video content production, and different AI models are used for intelligent matching according to the different classifications, so that the accuracy of intelligent editing is improved.
(3) And carrying out semantic matching on contemporaneous sound matching through a large language model, and ensuring the matching degree between written language and spoken language. The invention designs a special semantic matching mode aiming at synchronous sound, which is different from the traditional word-voice matching mode, carries out semantic understanding on word content by a large language model, then matches the word content with an audio vector of synchronous sound according to a vector of semantic understanding, ensures the matching latitude of the word to the voice, and for a professional media organization, manuscripts of the professional media organization often adopt more formal written language, and in interviews and daily conversations, spoken language expression is unavoidable, the problem of matching written language words to spoken language voice is solved through the semantic matching mode, the continuity of front and rear sentences can be intelligently judged, and the jump of pictures is ensured to the greatest extent in the result of word matching.
(4) By carrying out an intelligent merging algorithm on the contemporaneous sound matching result, intelligent scene merging is carried out on the contemporaneous sound intelligent matching lens result, and the problems of lens picture jump, discontinuity and the like caused by carrying out sound matching according to independent characters can be effectively avoided.
(5) The intelligent video and audio shot matching is carried out on the basis of the large language model and the cross-mode engine, the audio-visual language model is fused, the front-rear connection of the shots is intelligently processed, the repeated shots in the same program are avoided, and the secondary processing can be carried out according to the information of the duration, jing Bie, the scene and the like of the shots according to the habit characteristics of watching of audiences, so that the final intelligent editing result is formed.
(6) A series of matching candidate lens groups with higher matching degree with the sentence/segment text content are automatically generated while the editing result is intelligently generated through AI, so that editors can manually and quickly adjust and modify the intelligent editing result.
(7) The core model large language model and the cross-modal model adopted by the intelligent editing system support local privately-arranged deployment, so that the original material content can not flow outwards in the intelligent production and creation process of videos, and the data security is ensured.
The method can be widely applied to intelligent production of new media short videos, event broadcasting video news, secondary creation of television programs, film and television drama film and flower, gathering and other types of programs, and provides a brand-new video production mode for each media mechanism and professional content producer through application of an AI intelligent technology, thereby meeting the production requirements of mass video content under the video push system in the Internet era. The method saves the time of editing personnel to browse materials, select required shots from the materials, and take words from interview materials and record words for film shooting, saves the link of professional dubbing by means of AI dubbing, and greatly improves the production efficiency of event report contents. For secondary creation of finished programs, the invention can intelligently analyze the finished programs, select interest points suitable for an Internet platform in the finished programs, extract and transfer the interest points, and generate new short video manuscripts or scripts. And short video versions based on new interest points are intelligently generated, so that new requirements of creation and pushing for different audience groups are met.
Drawings
FIG. 1 is a flow chart of the method steps of the present invention;
FIG. 2 is a flow chart of steps of a cross-modality analysis method for an image of the present invention;
Fig. 3 is a schematic diagram of a video frame packet in accordance with the present invention.
Detailed Description
Term interpretation:
Cross-modality (Cross-model) is a process that involves extracting information from different data modalities and performing interactive fusion. This includes extracting features from various forms of data, such as text, audio, images, video, etc., and using these features for information retrieval, understanding, or generation. Cross-modal interaction fusion aims at realizing more effective data analysis and understanding through joint feature extraction and cross-modal association. For example, by identifying and utilizing the inherent links between data of different modalities, such as relationships between image content and corresponding textual descriptions, asymmetric data may be processed, i.e., data of one modality may be richer or more detailed than data of another modality.
The invention realizes automatic generation of professional media content from text manuscripts based on AI technology such as large language model and cross-modal analysis, combines the characteristics of audio-visual language, carries out a series of processes such as weight judgment and length optimization on the intelligently matched lens, and meets the requirements of professional media on efficiency and quality of video production.
A large language model-based intelligent video editing method, as shown in fig. 1 and 2, the method steps include:
And step 1, performing cross-modal analysis on the materials, namely automatically performing multi-dimensional comprehensive intelligent analysis on the materials by AI engines such as cross-modal models, intelligent voices and the like during material warehousing.
In order to improve efficiency of cross-modal intelligent analysis of materials and ensure accuracy of analysis, the embodiment improves a cross-modal analysis method for images, and the comprehensive intelligent analysis steps are shown in fig. 2 and include:
Performing transition frame detection on the video, and splitting the continuous video material into a plurality of scene segments according to the detection result;
Extracting key frames aiming at scenes of each video;
Key frame cross-mode detection and analysis;
cross-modal analysis is carried out on the key frames, vectors are generated and stored in an index library;
performing audio synchronous sound analysis, generating synchronous sound indexes and storing the synchronous sound indexes into an index library;
and (5) completing material analysis.
The method comprises the steps of transcoding video and audio materials, generating a low-code-rate agent serving as a cross-mode analysis basis to ensure the efficiency of cross-mode material intelligent analysis, detecting transition frames of the video, splitting continuous video materials into a plurality of scene fragments according to detection results, extracting key frames for scenes of each video, and extracting the key frames of other picture frames except the first frame of the transition frame detection to ensure that the decomposition results do not miss key information. And performing cross-modal analysis on the key frames, generating vectors and storing the vectors into an index base.
In the invention, a method capable of effectively detecting the content of the key frames and ensuring that the number of the key frames is not increased as much as possible is designed. The method comprises the steps of firstly extracting 1 st frame from video content according to every 10 frames to serve as 1 key frame, carrying out vector analysis on key frames after the extracted frames, carrying out difference calculation on vectors of two front and rear key frames, if the vector difference ∂ between every two continuous frames is smaller than a preset value, considering that the scene is not required to be split further, storing the vector of each analyzed key frame into an index library for storage, if the difference between the front and rear continuous key frames is larger than or equal to the preset value, adding an intermediate frame between the two key frames, taking the 6 th frame in the segment as the key frame, intelligently identifying the video key frames based on an AI algorithm firstly, and carrying out cross-mode vector analysis on the first frame of each group according to one group of every 10 frames in the video frames between the two intelligently identified key frames, and indexing analysis result vectors of the first frame.
By the method, the key content of each frame of picture in the continuously-changed scene can be analyzed, and the problems that the key information of the continuous picture is discarded possibly caused by only analyzing the key frame obtained by the transition frame detection are avoided.
And 2, selecting materials or material groups required by the authoring from the mechanism resource library, and selecting related materials required by the authoring from the resource library when the intelligent video clip authoring starts. One or more related materials may be selected for intelligent editing of the video and audio content. The material is selected from the organization resource library, so that the authenticity and reliability of the material source and the copyright legitimacy can be ensured. In the method, the intelligent editing creation is carried out based on the mechanism resource library and the local or cloud resource library, so that the problems of copyright disputes caused by capturing materials from the Internet, the need of identifying the authenticity and reliability of the materials and the like can be effectively avoided.
And step 3, importing the text manuscripts, rewriting and classifying, namely directly importing the video text manuscripts, and rewriting the video text manuscripts by using a large language model. The universal text manuscript is not necessarily suitable for being expressed by video shot language, and by means of a large language model, the universal text manuscript can be rewritten into a video shot script by semantic understanding of the original text manuscript, so that contents such as picture description, bystandstill and the like are generated. In the technical scheme of the invention, the text content in the shot script can be classified and marked.
For [ contemporaneous sound ], the method of the present invention provides two different labeling modes, continuous and single. The continuous mode is suitable for a whole interview or dialogue, and the single sentence is suitable for precisely selecting the whole sentence or phrase from the interview or dialogue. For the text portion not labeled [ contemporaneous sound ], the system defaults to processing according to the [ text ] classification.
Step 4, automatically using different intelligent matching models to carry out lens matching according to different classifications of the text manuscripts:
And (3) respectively adopting different intelligent lens matching methods to match the text parts marked as [ synchronous sounds ] and [ text ] in the last step. For the text content marked as [ synchronous sound ], the system carries out similarity matching on synchronous sound indexes based on semantic understanding of the text manuscript, and even if the text manuscript is not identical with the text result of voice recognition, the matching can be carried out according to the result of semantic understanding, so that the intelligent matching between the written text in the manuscript and the spoken language in material interview and conversation can be ensured.
For the lens result of the synchronous sound mode matching, when the lens segment is finally used, the method of the invention uses the original picture and sound corresponding to the lens to enhance the field feel of the final result.
For the section [ text ], the system performs semantic understanding based on characters, matches the characters with video and audio contents in vector dimension from a cross-modal index library, forms similarity data according to a matching comparison result, and performs intelligent matching of shots according to the similarity. For the result of text pattern matching, when the shot segment is finally used, only the picture part corresponding to the shot is used, and the sound part is replaced by the audio content of subsequent voice synthesis.
And step 5, adjusting the intelligent matching result of the lens to generate a matching candidate lens group. In the process of performing intelligent shot matching in the last step, whether in a [ synchronous sound ] mode or a [ text ] mode, the matching result of each sentence/segment of text is stored as a shot group, and is ordered according to the similarity, the maximum number of shots in the shot group is defined, the shot with the highest matching similarity in each group of shots is used as a preferred result to be provided for the next step, and the maximum number of shots in the shot group is usually recommended to be set to be 10.
In general, the present invention will provide the lens with the highest matching similarity (i.e. the lens with the number 01) in each group of lenses as the preferred result for the next processing.
And 6, generating a sequence and adjusting according to the audio-visual language model, wherein the sequence comprises intelligent scene merging and analysis and processing of the matched shot result by using the audio-visual language model according to the time sequence of the front shot and the rear shot in the raw material in the shot matching result.
In the invention, besides the intelligent shot matching and recommending for each sentence/segment of text, the single sentence matching shot result of the last step is further corrected and adjusted according to the content of the whole manuscript or the sub-shot script.
When the intelligent shot matching is performed by using the [ synchronous sound ] mode, the text-voice matching based on semantic understanding is adopted, and the post-processing of shot matching results is performed for synchronous sound manuscripts marked as continuous. The intelligent scene merging method comprises the steps of intelligently merging scenes according to the time sequence of front and rear shots in the original materials in the shot matching result for a plurality of continuous sentences/sections of characters.
The specific method comprises the following steps:
each lens matching result contains information such as original material id ClipID, an IN point IN, an OUT point OUT and the like;
The matching lens results of the continuous multiple sentences/sections of contemporaneous sound words are respectively C0, C1 and C2., the corresponding original material ids are respectively ClipID 1、ClipID2、ClipID3, the corresponding original material entry points are respectively IN 1、IN2、IN3, and the corresponding original material exit points are respectively OUT 1、OUT2、OUT3.
Firstly, comparing material information of two lenses C1 and C0 of a first group;
comparing whether the IDs of the original materials corresponding to the two lenses are the same;
If the ClipID 2 is different from the ClipID 1, the two lens matching results are derived from different materials, scene combination is not needed, and the next group of materials C2 and C1 are compared;
If ClipID 2 is the same as ClipID 1, then the continuity of the two shots is also compared. Comparing the material IN-point IN 2 of the lens C1 with the material OUT-point OUT 1 of the lens C0;
if IN 2—OUT1 < t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result have continuity IN time, and performing scene merging;
if IN 2—OUT1 is more than or equal to t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result do not have continuity IN time, and not performing scene merging;
and so on until the last contemporaneous sound matches the shot result.
When the [ synchronous sound ] mark in the manuscript is finished, the comparison is automatically finished, and scene merging analysis is not carried out among different [ synchronous sound ] paragraphs.
When the [ text ] mode is used for intelligent shot matching, the intelligent analysis system can calculate and recommend the intelligent shot matching similarity according to each sentence/segment of text, but the video and audio content editing is not processed for independent sentences, fragments or shots, but needs to be considered integrally according to the context.
When the video clip sequence is generated, the invention creatively introduces an audio-visual language model to analyze and process the matching shot result. Firstly, judging the intelligent matching lens result of each sentence/segment text, starting from the second sentence/segment, judging the material weight of the first lens segment of the current matching candidate lens group and the matching result of the preamble, if the repetition exists, using the next candidate lens of the matching candidate lens group until the matching candidate lens group is not repeated with the preamble. If all the matching candidate lens groups corresponding to the sentence/segment text are repeated with the preamble lens, leaving the matching lens of the sentence/segment blank, and then manually filling in the replacement of other lenses.
And secondly, processing the shot length of the intelligent matching shot result, and expanding the shot with too short duration forwards or backwards until the shot length meets the preset minimum shot length.
And thirdly, analyzing the audio-visual voice such as foreground and background shots, shooting methods and the like, and matching with the montage sentence patterns in the audio-visual language model. If the analysis results of the front lens and the rear lens are matched with any one of the montage sentence patterns, reserving, otherwise, replacing the current lens from the matched candidate lens group until the current lens accords with the audio-visual language sentence pattern.
And finally, fine tuning the length of the video lens according to the audio length of dubbing or contemporaneous sound.
After lens matching is carried out on each sentence/segment of text in the manuscript, the lens matching results are arranged and spliced according to the sequence of the manuscript or the sub-lens script.
And 7, generating coordination and subtitles, and adding a score. In the invention, when intelligent editing is carried out, dubbing can be generated for the bystander part of the editing result through a voice recognition and voice synthesis engine, subtitle can be automatically generated for the voice in the text and the same period, emotion classification can be carried out on music provided by the system, and the corresponding emotion can be selected for automatically adding the music according to the generated intelligent editing result.
And 8, completing intelligent editing, and performing manual checking to meet the finally issued auditing requirements. After the series of intelligent processing, the intelligent clipping time line result of the method is generated; after the system intelligently generates the time line result, further proofreading can be performed manually to meet the finally issued auditing requirements.
Finally, it should be noted that the above only illustrates the technical solution of the present invention, and not limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention (such as names of categories or number changes, sequence of steps, etc.) may be modified or substituted equivalently, without departing from the spirit and scope of the technical solution of the present invention.

Claims (8)

1. An intelligent video editing method based on a large language model, characterized in that the method steps comprise:
Step 1, performing cross-modal analysis on the materials, namely automatically performing multi-dimensional comprehensive intelligent analysis on the materials by AI engines such as cross-modal models, intelligent voices and the like during material warehousing;
step 2, selecting materials or material groups required by creation from a resource library;
Importing a video manuscript, using a large language model to rewrite the video manuscript into a required manuscript, and classifying and labeling synchronous sound and text types;
Step 4, automatically using different intelligent matching models to carry out lens matching according to different classifications of the text manuscripts;
Step 5, adjusting the intelligent matching result of the shots to generate matching candidate shot groups, namely storing the matching result of each sentence/segment of text as a shot group, sorting according to the similarity, defining the maximum number of shots in the shot group, and providing the shot with the highest matching similarity in each group of shots as a preferred result for the next step;
Step 6, generating a sequence and adjusting according to the audio-visual language model, wherein the steps include intelligent scene merging and analysis and processing of the matched shot result by using the audio-visual language model according to the time sequence of the front shot and the rear shot in the raw material in the shot matching result;
Step 7, generating coordination and subtitles, and adding a music score;
And 8, completing intelligent editing, and performing manual checking to meet the finally issued auditing requirements.
2. The intelligent video editing method based on large language model according to claim 1, wherein the integrated intelligent analysis in step 1 comprises:
1.1, transcoding video and audio materials to generate a low-code-rate agent serving as a base of cross-modal analysis;
1.2, performing transition frame detection on the video, and splitting continuous video materials into a plurality of scene fragments according to detection results;
1.3, extracting key frames aiming at scenes of each video;
1.4, cross-modal analysis is carried out on the key frames, vectors are generated and stored in an index library;
1.5, carrying out audio synchronous sound analysis, generating synchronous sound indexes and storing the synchronous sound indexes in an index library.
3. The intelligent video editing method based on the large language model according to claim 1, wherein the step 4 is characterized in that for text content marked as contemporaneous sound, the system performs similarity matching on contemporaneous sound indexes based on semantic understanding of text manuscripts, the text results including text manuscripts and voice recognition are not identical, so as to ensure intelligent matching between written text in manuscripts and material interviews and spoken language in conversations, for text types, the system performs matching on text and video and audio content in vector dimensions from a cross-modal index library based on semantic understanding of text, similarity data is formed according to the matched comparison result, and intelligent matching of shots is performed according to similarity.
4. The intelligent video editing method based on the large language model according to claim 1, wherein the intelligent scene merging method according to the time sequence of the front and rear shots in the original material in the shot matching result in step 6 is as follows:
each lens matching result contains information such as original material id ClipID, an IN point IN, an OUT point OUT and the like;
The matching lens results of the continuous multiple sentences/sections of contemporaneous sound words are respectively C0, C1 and C2., the corresponding original material ids are respectively ClipID 1、ClipID2、ClipID3, the corresponding original material entry points are respectively IN 1、IN2、IN3, and the corresponding original material exit points are respectively OUT 1、OUT2、OUT3.
Firstly, comparing material information of two lenses C1 and C0 of a first group, and comparing whether original material IDs corresponding to the two lenses are the same;
If the ClipID 2 is different from the ClipID 1, the two lens matching results are derived from different materials, scene combination is not needed, and the next group of materials C2 and C1 are compared;
If the ClipID 2 is the same as the ClipID 1, the continuity of the two shots is also compared, and the material IN point IN 2 of the shot C1 is compared with the material OUT point OUT 1 of the shot C0;
if IN 2 —OUT1 < t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result have continuity IN time, and performing scene merging;
If IN 2 —OUT1 is more than or equal to t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result do not have continuity IN time, and not performing scene merging;
and so on until the last contemporaneous sound matches the shot result.
5. The intelligent video editing method based on large language model according to claim 1, wherein the analyzing and processing method of the matching shot result using audio-visual language model in step 6 comprises judging the intelligent matching shot result of each sentence/section of text, processing the shot length of the intelligent matching shot result, analyzing the audio-visual voice such as foreground and background shot, shooting method and the like, matching with the montage sentence pattern in the audio-visual language model, trimming the length of the video shot according to the audio length of dubbing or contemporaneous sound, and arranging and splicing the shot matching result according to the sequence of manuscript or sub-shot script after each sentence/section of text in manuscript is matched.
6. The intelligent video editing method based on large language model according to claim 1, wherein step 7, a dubbing is generated for the bystander of the editing result by a speech recognition and speech synthesis engine, a subtitle is automatically generated for the text and the speech in the same period, the music provided by the system can be subjected to emotion classification, and the intelligent editing result is selected to be added with the music according to the requirement.
7. The large language model based intelligent video editing method according to claim 1, wherein the maximum number of shots is 10 or less.
8. The intelligent video editing method based on large language model as claimed in claim 2, wherein the step 1.4 of cross-modal analysis of key frames is to extract 1 st frame as 1 key frame from video content according to every 10 frames, perform vector analysis on the key frames after frame extraction, and perform difference calculation on vectors of the two key frames;
if the vector difference ∂ between every two consecutive frames is smaller than the preset value, the scene is considered to be unnecessary to be further split, and the vector of each analyzed key frame is stored in an index library for storage;
if the difference value between the front and the back continuous key frames is larger than or equal to the preset value, adding the 6 th frame in the segment of the intermediate frame between the two key frames as the key frame, and indexing the analysis result vector.
CN202411054282.3A 2024-08-02 2024-08-02 An intelligent video editing method based on large language model Active CN119155484B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411054282.3A CN119155484B (en) 2024-08-02 2024-08-02 An intelligent video editing method based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411054282.3A CN119155484B (en) 2024-08-02 2024-08-02 An intelligent video editing method based on large language model

Publications (2)

Publication Number Publication Date
CN119155484A true CN119155484A (en) 2024-12-17
CN119155484B CN119155484B (en) 2025-09-02

Family

ID=93806688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411054282.3A Active CN119155484B (en) 2024-08-02 2024-08-02 An intelligent video editing method based on large language model

Country Status (1)

Country Link
CN (1) CN119155484B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030160944A1 (en) * 2002-02-28 2003-08-28 Jonathan Foote Method for automatically producing music videos
CN113825012A (en) * 2021-06-04 2021-12-21 腾讯科技(深圳)有限公司 Video data processing method and computer device
WO2023040520A1 (en) * 2021-09-17 2023-03-23 腾讯科技(深圳)有限公司 Method and apparatus for performing music matching of video, and computer device and storage medium
CN116847123A (en) * 2023-08-01 2023-10-03 南拳互娱(武汉)文化传媒有限公司 Video later editing and video synthesis optimization method
WO2023184636A1 (en) * 2022-03-29 2023-10-05 平安科技(深圳)有限公司 Automatic video editing method and system, and terminal and storage medium
CN117082304A (en) * 2023-08-16 2023-11-17 北京达佳互联信息技术有限公司 Video generation method, device, computer equipment and storage medium
CN117336559A (en) * 2023-08-09 2024-01-02 东南大学 A live broadcast intelligent editing method based on large language model
CN117435769A (en) * 2023-10-24 2024-01-23 百度时代网络技术(北京)有限公司 Video generation and arrangement model acquisition method, device, equipment and storage medium
CN117880443A (en) * 2023-12-26 2024-04-12 成都索贝数码科技股份有限公司 Script-based multi-mode feature matching video editing method and system
CN118400575A (en) * 2024-06-24 2024-07-26 湖南快乐阳光互动娱乐传媒有限公司 Video processing method and related device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030160944A1 (en) * 2002-02-28 2003-08-28 Jonathan Foote Method for automatically producing music videos
CN113825012A (en) * 2021-06-04 2021-12-21 腾讯科技(深圳)有限公司 Video data processing method and computer device
WO2023040520A1 (en) * 2021-09-17 2023-03-23 腾讯科技(深圳)有限公司 Method and apparatus for performing music matching of video, and computer device and storage medium
WO2023184636A1 (en) * 2022-03-29 2023-10-05 平安科技(深圳)有限公司 Automatic video editing method and system, and terminal and storage medium
CN116847123A (en) * 2023-08-01 2023-10-03 南拳互娱(武汉)文化传媒有限公司 Video later editing and video synthesis optimization method
CN117336559A (en) * 2023-08-09 2024-01-02 东南大学 A live broadcast intelligent editing method based on large language model
CN117082304A (en) * 2023-08-16 2023-11-17 北京达佳互联信息技术有限公司 Video generation method, device, computer equipment and storage medium
CN117435769A (en) * 2023-10-24 2024-01-23 百度时代网络技术(北京)有限公司 Video generation and arrangement model acquisition method, device, equipment and storage medium
CN117880443A (en) * 2023-12-26 2024-04-12 成都索贝数码科技股份有限公司 Script-based multi-mode feature matching video editing method and system
CN118400575A (en) * 2024-06-24 2024-07-26 湖南快乐阳光互动娱乐传媒有限公司 Video processing method and related device

Also Published As

Publication number Publication date
CN119155484B (en) 2025-09-02

Similar Documents

Publication Publication Date Title
KR101326897B1 (en) Device and Method for Providing a Television Sequence
JP4873018B2 (en) Data processing apparatus, data processing method, and program
US8812311B2 (en) Character-based automated shot summarization
US8392183B2 (en) Character-based automated media summarization
US20130124984A1 (en) Method and Apparatus for Providing Script Data
JP2001515634A (en) Multimedia computer system having story segmentation function and its operation program
CN118400575B (en) Video processing method and related device
CN117880443A (en) Script-based multi-mode feature matching video editing method and system
CN117278699A (en) Video generation method, device, computer equipment and storage medium
WO2007004110A2 (en) System and method for the alignment of intrinsic and extrinsic audio-visual information
Bost A storytelling machine?: automatic video summarization: the case of TV series
CN113012723B (en) Multimedia file playing method and device and electronic equipment
CN119155484B (en) An intelligent video editing method based on large language model
CN119906867A (en) A financial news editing method based on large language model
JP2007336106A (en) Video editing support device
CN100538696C (en) Systems and methods for integrated analysis of intrinsic and extrinsic audiovisual data
Valdes et al. On-line video abstract generation of multimedia news
CN119418241A (en) System and method for extracting highlight video based on automobile field
CN117319765A (en) Video processing method, device, computing equipment and computer storage medium
Yu et al. Text2Video: automatic video generation based on text scripts
KR101783872B1 (en) Video Search System and Method thereof
JP3816901B2 (en) Stream data editing method, editing system, and program
Carmichael et al. Multimodal indexing of digital audio-visual documents: A case study for cultural heritage data
Wang et al. Overview of tencent multi-modal ads video understanding
KR102779637B1 (en) A system that automatically edits highlights

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant