Intelligent video editing method based on large language model
Technical Field
The invention relates to the technical field of video editing, in particular to an intelligent video editing method based on a large language model.
Background
Along with the continuous development of computer vision, voice recognition and other technologies, the application of the cross-mode large model is mature. The model can process data (such as text, images, voice and the like) of different modes, realizes fusion and interaction of multi-mode information, and provides richer possibility for artificial intelligence application. The large language model is deeply developed from the scale of the model, the diversification of application scenes and advanced technical innovation to the cross-modal large model, and the trends and the obtained achievements not only reflect the huge progress obtained in the artificial intelligence field, but also indicate that the future large model technology will show the unique value and capability in more fields.
In the production and creation process of video and audio content, information contained in an image is often not limited to visual information, but may also relate to data of other modalities, such as text, voice, and the like. Therefore, the intelligent production of video and audio contents is completed through a cross-modal data fusion technology, and the method becomes an important research and improvement direction of the invention. With the deep development of media fusion, each media organization faces the change of media content propagation channels and the mass demand of video and audio content caused by the media content propagation channels. The traditional media mechanism plays the new media in fixed time through the release forms such as broadcasting television, and the release of the new media and the content of the internet is anytime and anywhere, and is not limited by the traditional play channels and play time. Audience's viewing channels and viewing modes have also changed significantly, and obtaining news information through the internet has become the highest-duty mode, and meanwhile, as most of audience views through fragmentation time, the demand for short videos has also increased significantly.
Under the background of rapid development of new media of the internet, new requirements are put forward on the output and production efficiency of video and audio contents, but the conventional video and audio content production mode is difficult to meet the current requirements of media fusion development, and a rapid, efficient and content quality-guaranteed new production process is urgently needed by each professional media institution to meet the content production requirements in the fused media environment.
Disclosure of Invention
The invention provides an intelligent video editing method based on a large language model, which is based on AI technology such as the large language model, cross-modal analysis and the like, realizes automatic generation of professional media content from text manuscripts, combines the characteristics of audio-visual language, carries out a series of processes such as weight judgment, length optimization and the like on an intelligent matched lens, and meets the requirements of the professional media on the aspects of efficiency and quality of video production.
The invention is realized in that the intelligent video editing method based on the large language model comprises the following steps:
Step 1, performing cross-modal analysis on the materials, namely automatically performing multi-dimensional comprehensive intelligent analysis on the materials by AI engines such as cross-modal models, intelligent voices and the like during material warehousing;
Step2, selecting materials or material groups required by creation from a mechanism resource library;
importing the text manuscript, rewriting and classifying, namely importing the video text manuscript, using a large language model to rewrite the video text manuscript into a required text manuscript, classifying and labeling synchronous sound and text types,
Step 4, automatically using different intelligent matching models to carry out lens matching according to different classifications of the text manuscripts;
Step 5, adjusting the intelligent matching result of the shots to generate matching candidate shot groups, namely storing the matching result of each sentence/segment of text as a shot group, sorting according to the similarity, defining the maximum number of shots in the shot group, and providing the shot with the highest matching similarity in each group of shots as a preferred result for the next step;
Step 6, generating a sequence and adjusting according to the audio-visual language model, wherein the steps include intelligent scene merging and analysis and processing of the matched shot result by using the audio-visual language model according to the time sequence of the front shot and the rear shot in the raw material in the shot matching result;
Step 7, generating coordination and subtitles, and adding a music score;
And 8, completing intelligent editing, and performing manual checking to meet the finally issued auditing requirements.
Further, the comprehensive intelligent analysis in the step 1 includes:
1.1, performing transition frame detection on video, and splitting continuous video materials into a plurality of scene segments according to detection results;
1.2, extracting key frames aiming at scenes of each video;
1.3, cross-modal detection and analysis of key frames;
1.4, cross-modal analysis is carried out on the key frames, vectors are generated and stored in an index library;
1.5, carrying out audio synchronous sound analysis, generating synchronous sound indexes and storing the synchronous sound indexes in an index library.
Further, step 4, for text content marked as contemporaneous sound, the system performs similarity matching on contemporaneous sound indexes based on semantic understanding of text manuscripts, wherein the similarity matching comprises that text results of the text manuscripts and the speech recognition are not identical, so as to ensure intelligent matching between written text in the manuscripts and material interviews and spoken language in conversations, for text types, the system performs matching on text and video and audio content in vector dimensions from a cross-modal index library based on semantic understanding of text, and forms similarity data according to a matching comparison result, and performs intelligent matching of shots according to similarity.
Further, the intelligent scene merging method in step 6 according to the time sequence of the front and rear shots in the original material in the shot matching result is as follows:
each lens matching result contains information such as original material id ClipID, an IN point IN, an OUT point OUT and the like;
The matching lens results of the continuous multiple sentences/sections of contemporaneous sound words are respectively C0, C1 and C2., the corresponding original material ids are respectively ClipID 1、ClipID2、ClipID3, the corresponding original material entry points are respectively IN 1、IN2、IN3, and the corresponding original material exit points are respectively OUT 1、OUT2、OUT3.
Firstly, comparing material information of two lenses C1 and C0 of a first group, and comparing whether original material IDs corresponding to the two lenses are the same;
If the ClipID 2 is different from the ClipID 1, the two lens matching results are derived from different materials, scene combination is not needed, and the next group of materials C2 and C1 are compared;
If ClipID 2 is the same as ClipID 1, then the continuity of the two shots is also compared. Comparing the material IN-point IN 2 of the lens C1 with the material OUT-point OUT 1 of the lens C0;
if IN 2—OUT1 < t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result have continuity IN time, and performing scene merging;
if IN 2—OUT1 is more than or equal to t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result do not have continuity IN time, and not performing scene merging;
and so on until the last contemporaneous sound matches the shot result.
Further, the analysis and processing method for the matched shot result by using the audio-visual language model in the step 6 comprises the steps of judging the intelligent matched shot result of each sentence/section of text, processing the shot length of the intelligent matched shot result, analyzing the audio-visual voices such as foreground and background shots, shooting method and the like, matching with the montage sentence pattern in the audio-visual language model, trimming the length of the video shot according to the audio length of dubbing or contemporaneous sound, and arranging and splicing the shot matched result according to the sequence of the manuscript or the shot script after each sentence/section of text in the manuscript is subjected to shot matching.
Further, in step 7, dubbing can be generated for the bystander of the clipping result by the voice recognition and voice synthesis engine, subtitle can be automatically generated for the text and the voice in the same period, the music provided by the system can be subjected to emotion classification, and the generated intelligent clipping result is selected to be automatically added with the music according to the requirement.
Further, the maximum lens number is less than or equal to 10.
Further, the step 1.4 of cross-modal analysis of the key frames is to extract the 1 st frame as 1 key frame from the video content according to every 10 frames, perform vector analysis on the key frames after the frame extraction, and perform difference calculation on the vectors of the front and rear key frames;
if the vector difference ∂ between every two consecutive frames is smaller than the preset value, the scene is considered to be unnecessary to be further split, and the vector of each analyzed key frame is stored in an index library for storage;
if the difference value between the front and the back continuous key frames is larger than or equal to the preset value, adding the 6 th frame in the segment of the intermediate frame between the two key frames as the key frame, and indexing the analysis result vector.
The beneficial effects of the invention are as follows:
(1) The method is designed according to the characteristics of the professional media institution, the professional media institution has a large amount of self video and audio material resources, and an own institution resource library is established. The intelligent editing is carried out based on the organization resource library materials, so that the authenticity, reliability and legality of the generated content can be ensured, and meanwhile, copyright disputes possibly caused by the reference of the Internet materials are avoided.
(2) The intelligent editing mode initiated in the industry is adopted, professional media manuscripts are classified into different classifications of texts, synchronization and the like according to the characteristics of traditional video content production, and different AI models are used for intelligent matching according to the different classifications, so that the accuracy of intelligent editing is improved.
(3) And carrying out semantic matching on contemporaneous sound matching through a large language model, and ensuring the matching degree between written language and spoken language. The invention designs a special semantic matching mode aiming at synchronous sound, which is different from the traditional word-voice matching mode, carries out semantic understanding on word content by a large language model, then matches the word content with an audio vector of synchronous sound according to a vector of semantic understanding, ensures the matching latitude of the word to the voice, and for a professional media organization, manuscripts of the professional media organization often adopt more formal written language, and in interviews and daily conversations, spoken language expression is unavoidable, the problem of matching written language words to spoken language voice is solved through the semantic matching mode, the continuity of front and rear sentences can be intelligently judged, and the jump of pictures is ensured to the greatest extent in the result of word matching.
(4) By carrying out an intelligent merging algorithm on the contemporaneous sound matching result, intelligent scene merging is carried out on the contemporaneous sound intelligent matching lens result, and the problems of lens picture jump, discontinuity and the like caused by carrying out sound matching according to independent characters can be effectively avoided.
(5) The intelligent video and audio shot matching is carried out on the basis of the large language model and the cross-mode engine, the audio-visual language model is fused, the front-rear connection of the shots is intelligently processed, the repeated shots in the same program are avoided, and the secondary processing can be carried out according to the information of the duration, jing Bie, the scene and the like of the shots according to the habit characteristics of watching of audiences, so that the final intelligent editing result is formed.
(6) A series of matching candidate lens groups with higher matching degree with the sentence/segment text content are automatically generated while the editing result is intelligently generated through AI, so that editors can manually and quickly adjust and modify the intelligent editing result.
(7) The core model large language model and the cross-modal model adopted by the intelligent editing system support local privately-arranged deployment, so that the original material content can not flow outwards in the intelligent production and creation process of videos, and the data security is ensured.
The method can be widely applied to intelligent production of new media short videos, event broadcasting video news, secondary creation of television programs, film and television drama film and flower, gathering and other types of programs, and provides a brand-new video production mode for each media mechanism and professional content producer through application of an AI intelligent technology, thereby meeting the production requirements of mass video content under the video push system in the Internet era. The method saves the time of editing personnel to browse materials, select required shots from the materials, and take words from interview materials and record words for film shooting, saves the link of professional dubbing by means of AI dubbing, and greatly improves the production efficiency of event report contents. For secondary creation of finished programs, the invention can intelligently analyze the finished programs, select interest points suitable for an Internet platform in the finished programs, extract and transfer the interest points, and generate new short video manuscripts or scripts. And short video versions based on new interest points are intelligently generated, so that new requirements of creation and pushing for different audience groups are met.
Drawings
FIG. 1 is a flow chart of the method steps of the present invention;
FIG. 2 is a flow chart of steps of a cross-modality analysis method for an image of the present invention;
Fig. 3 is a schematic diagram of a video frame packet in accordance with the present invention.
Detailed Description
Term interpretation:
Cross-modality (Cross-model) is a process that involves extracting information from different data modalities and performing interactive fusion. This includes extracting features from various forms of data, such as text, audio, images, video, etc., and using these features for information retrieval, understanding, or generation. Cross-modal interaction fusion aims at realizing more effective data analysis and understanding through joint feature extraction and cross-modal association. For example, by identifying and utilizing the inherent links between data of different modalities, such as relationships between image content and corresponding textual descriptions, asymmetric data may be processed, i.e., data of one modality may be richer or more detailed than data of another modality.
The invention realizes automatic generation of professional media content from text manuscripts based on AI technology such as large language model and cross-modal analysis, combines the characteristics of audio-visual language, carries out a series of processes such as weight judgment and length optimization on the intelligently matched lens, and meets the requirements of professional media on efficiency and quality of video production.
A large language model-based intelligent video editing method, as shown in fig. 1 and 2, the method steps include:
And step 1, performing cross-modal analysis on the materials, namely automatically performing multi-dimensional comprehensive intelligent analysis on the materials by AI engines such as cross-modal models, intelligent voices and the like during material warehousing.
In order to improve efficiency of cross-modal intelligent analysis of materials and ensure accuracy of analysis, the embodiment improves a cross-modal analysis method for images, and the comprehensive intelligent analysis steps are shown in fig. 2 and include:
Performing transition frame detection on the video, and splitting the continuous video material into a plurality of scene segments according to the detection result;
Extracting key frames aiming at scenes of each video;
Key frame cross-mode detection and analysis;
cross-modal analysis is carried out on the key frames, vectors are generated and stored in an index library;
performing audio synchronous sound analysis, generating synchronous sound indexes and storing the synchronous sound indexes into an index library;
and (5) completing material analysis.
The method comprises the steps of transcoding video and audio materials, generating a low-code-rate agent serving as a cross-mode analysis basis to ensure the efficiency of cross-mode material intelligent analysis, detecting transition frames of the video, splitting continuous video materials into a plurality of scene fragments according to detection results, extracting key frames for scenes of each video, and extracting the key frames of other picture frames except the first frame of the transition frame detection to ensure that the decomposition results do not miss key information. And performing cross-modal analysis on the key frames, generating vectors and storing the vectors into an index base.
In the invention, a method capable of effectively detecting the content of the key frames and ensuring that the number of the key frames is not increased as much as possible is designed. The method comprises the steps of firstly extracting 1 st frame from video content according to every 10 frames to serve as 1 key frame, carrying out vector analysis on key frames after the extracted frames, carrying out difference calculation on vectors of two front and rear key frames, if the vector difference ∂ between every two continuous frames is smaller than a preset value, considering that the scene is not required to be split further, storing the vector of each analyzed key frame into an index library for storage, if the difference between the front and rear continuous key frames is larger than or equal to the preset value, adding an intermediate frame between the two key frames, taking the 6 th frame in the segment as the key frame, intelligently identifying the video key frames based on an AI algorithm firstly, and carrying out cross-mode vector analysis on the first frame of each group according to one group of every 10 frames in the video frames between the two intelligently identified key frames, and indexing analysis result vectors of the first frame.
By the method, the key content of each frame of picture in the continuously-changed scene can be analyzed, and the problems that the key information of the continuous picture is discarded possibly caused by only analyzing the key frame obtained by the transition frame detection are avoided.
And 2, selecting materials or material groups required by the authoring from the mechanism resource library, and selecting related materials required by the authoring from the resource library when the intelligent video clip authoring starts. One or more related materials may be selected for intelligent editing of the video and audio content. The material is selected from the organization resource library, so that the authenticity and reliability of the material source and the copyright legitimacy can be ensured. In the method, the intelligent editing creation is carried out based on the mechanism resource library and the local or cloud resource library, so that the problems of copyright disputes caused by capturing materials from the Internet, the need of identifying the authenticity and reliability of the materials and the like can be effectively avoided.
And step 3, importing the text manuscripts, rewriting and classifying, namely directly importing the video text manuscripts, and rewriting the video text manuscripts by using a large language model. The universal text manuscript is not necessarily suitable for being expressed by video shot language, and by means of a large language model, the universal text manuscript can be rewritten into a video shot script by semantic understanding of the original text manuscript, so that contents such as picture description, bystandstill and the like are generated. In the technical scheme of the invention, the text content in the shot script can be classified and marked.
For [ contemporaneous sound ], the method of the present invention provides two different labeling modes, continuous and single. The continuous mode is suitable for a whole interview or dialogue, and the single sentence is suitable for precisely selecting the whole sentence or phrase from the interview or dialogue. For the text portion not labeled [ contemporaneous sound ], the system defaults to processing according to the [ text ] classification.
Step 4, automatically using different intelligent matching models to carry out lens matching according to different classifications of the text manuscripts:
And (3) respectively adopting different intelligent lens matching methods to match the text parts marked as [ synchronous sounds ] and [ text ] in the last step. For the text content marked as [ synchronous sound ], the system carries out similarity matching on synchronous sound indexes based on semantic understanding of the text manuscript, and even if the text manuscript is not identical with the text result of voice recognition, the matching can be carried out according to the result of semantic understanding, so that the intelligent matching between the written text in the manuscript and the spoken language in material interview and conversation can be ensured.
For the lens result of the synchronous sound mode matching, when the lens segment is finally used, the method of the invention uses the original picture and sound corresponding to the lens to enhance the field feel of the final result.
For the section [ text ], the system performs semantic understanding based on characters, matches the characters with video and audio contents in vector dimension from a cross-modal index library, forms similarity data according to a matching comparison result, and performs intelligent matching of shots according to the similarity. For the result of text pattern matching, when the shot segment is finally used, only the picture part corresponding to the shot is used, and the sound part is replaced by the audio content of subsequent voice synthesis.
And step 5, adjusting the intelligent matching result of the lens to generate a matching candidate lens group. In the process of performing intelligent shot matching in the last step, whether in a [ synchronous sound ] mode or a [ text ] mode, the matching result of each sentence/segment of text is stored as a shot group, and is ordered according to the similarity, the maximum number of shots in the shot group is defined, the shot with the highest matching similarity in each group of shots is used as a preferred result to be provided for the next step, and the maximum number of shots in the shot group is usually recommended to be set to be 10.
In general, the present invention will provide the lens with the highest matching similarity (i.e. the lens with the number 01) in each group of lenses as the preferred result for the next processing.
And 6, generating a sequence and adjusting according to the audio-visual language model, wherein the sequence comprises intelligent scene merging and analysis and processing of the matched shot result by using the audio-visual language model according to the time sequence of the front shot and the rear shot in the raw material in the shot matching result.
In the invention, besides the intelligent shot matching and recommending for each sentence/segment of text, the single sentence matching shot result of the last step is further corrected and adjusted according to the content of the whole manuscript or the sub-shot script.
When the intelligent shot matching is performed by using the [ synchronous sound ] mode, the text-voice matching based on semantic understanding is adopted, and the post-processing of shot matching results is performed for synchronous sound manuscripts marked as continuous. The intelligent scene merging method comprises the steps of intelligently merging scenes according to the time sequence of front and rear shots in the original materials in the shot matching result for a plurality of continuous sentences/sections of characters.
The specific method comprises the following steps:
each lens matching result contains information such as original material id ClipID, an IN point IN, an OUT point OUT and the like;
The matching lens results of the continuous multiple sentences/sections of contemporaneous sound words are respectively C0, C1 and C2., the corresponding original material ids are respectively ClipID 1、ClipID2、ClipID3, the corresponding original material entry points are respectively IN 1、IN2、IN3, and the corresponding original material exit points are respectively OUT 1、OUT2、OUT3.
Firstly, comparing material information of two lenses C1 and C0 of a first group;
comparing whether the IDs of the original materials corresponding to the two lenses are the same;
If the ClipID 2 is different from the ClipID 1, the two lens matching results are derived from different materials, scene combination is not needed, and the next group of materials C2 and C1 are compared;
If ClipID 2 is the same as ClipID 1, then the continuity of the two shots is also compared. Comparing the material IN-point IN 2 of the lens C1 with the material OUT-point OUT 1 of the lens C0;
if IN 2—OUT1 < t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result have continuity IN time, and performing scene merging;
if IN 2—OUT1 is more than or equal to t, t is a system predefined value, indicating that the second matching lens result and the first matching lens result do not have continuity IN time, and not performing scene merging;
and so on until the last contemporaneous sound matches the shot result.
When the [ synchronous sound ] mark in the manuscript is finished, the comparison is automatically finished, and scene merging analysis is not carried out among different [ synchronous sound ] paragraphs.
When the [ text ] mode is used for intelligent shot matching, the intelligent analysis system can calculate and recommend the intelligent shot matching similarity according to each sentence/segment of text, but the video and audio content editing is not processed for independent sentences, fragments or shots, but needs to be considered integrally according to the context.
When the video clip sequence is generated, the invention creatively introduces an audio-visual language model to analyze and process the matching shot result. Firstly, judging the intelligent matching lens result of each sentence/segment text, starting from the second sentence/segment, judging the material weight of the first lens segment of the current matching candidate lens group and the matching result of the preamble, if the repetition exists, using the next candidate lens of the matching candidate lens group until the matching candidate lens group is not repeated with the preamble. If all the matching candidate lens groups corresponding to the sentence/segment text are repeated with the preamble lens, leaving the matching lens of the sentence/segment blank, and then manually filling in the replacement of other lenses.
And secondly, processing the shot length of the intelligent matching shot result, and expanding the shot with too short duration forwards or backwards until the shot length meets the preset minimum shot length.
And thirdly, analyzing the audio-visual voice such as foreground and background shots, shooting methods and the like, and matching with the montage sentence patterns in the audio-visual language model. If the analysis results of the front lens and the rear lens are matched with any one of the montage sentence patterns, reserving, otherwise, replacing the current lens from the matched candidate lens group until the current lens accords with the audio-visual language sentence pattern.
And finally, fine tuning the length of the video lens according to the audio length of dubbing or contemporaneous sound.
After lens matching is carried out on each sentence/segment of text in the manuscript, the lens matching results are arranged and spliced according to the sequence of the manuscript or the sub-lens script.
And 7, generating coordination and subtitles, and adding a score. In the invention, when intelligent editing is carried out, dubbing can be generated for the bystander part of the editing result through a voice recognition and voice synthesis engine, subtitle can be automatically generated for the voice in the text and the same period, emotion classification can be carried out on music provided by the system, and the corresponding emotion can be selected for automatically adding the music according to the generated intelligent editing result.
And 8, completing intelligent editing, and performing manual checking to meet the finally issued auditing requirements. After the series of intelligent processing, the intelligent clipping time line result of the method is generated; after the system intelligently generates the time line result, further proofreading can be performed manually to meet the finally issued auditing requirements.
Finally, it should be noted that the above only illustrates the technical solution of the present invention, and not limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention (such as names of categories or number changes, sequence of steps, etc.) may be modified or substituted equivalently, without departing from the spirit and scope of the technical solution of the present invention.