CN119299802B

CN119299802B - Video automatic generation system based on Internet

Info

Publication number: CN119299802B
Application number: CN202411827795.3A
Authority: CN
Inventors: 邓先才; 李睿
Original assignee: Shiyou Beijing Technology Co ltd
Current assignee: Shiyou Beijing Technology Co ltd
Priority date: 2024-12-12
Filing date: 2024-12-12
Publication date: 2025-03-04
Anticipated expiration: 2044-12-12
Also published as: CN119299802A

Abstract

The application discloses an automatic video generating system based on the Internet, which adopts a data analysis and processing technology based on artificial intelligence to sample key frames of first video fragments, and then carries out picture content identification and semantic distinguishable reinforcement on the sampled key frames of the images of the first video fragments, so that the type labels of the first video fragments are obtained according to the picture content dynamic semantic representation among the semantic distinguishable reinforcement features of the picture content. In this way, the method helps capture the core theme and important elements of the video, not just the elements with highest occurrence frequency, and captures the time evolution of the video content, so that the generated label not only reflects the static picture content, but also can embody the dynamic characteristics of the video.

Description

Video automatic generation system based on Internet

Technical Field

The application relates to the field of intelligent generation, and more particularly, to an automatic video generation system based on the Internet.

Background

With the rapid development of internet technology, video content has become an important form of information dissemination and entertainment consumption. The increasing demand by users for personalized, high quality video content has prompted the development of video automatic generation techniques. Patent CN118741247a proposes an automatic video generation method, which includes selecting a local video clip and recording total duration and script information, generating a first tag by frame-by-frame image recognition, matching and sequencing templates according to the tag, splicing a high-priority template with the video clip, and adjusting the length to generate a final video.

In the patent, the video clip tag counts the number of frames in which each element appears in the video clip according to the image recognition result, and selects the element with the largest number of frames as the tag of the video clip. However, determining tags based only on the most frequently occurring elements may not accurately capture the core content or theme of the video clip. For example, if a background in a video frequently presents an object, and the actual primary content is the action of a person, then the tag may be biased towards the object in the background rather than the key content. Furthermore, since video content is typically dynamically changing, the number of occurrences of a single element may not be sufficient to represent the primary content of the entire video segment, particularly in video with frequent scene transitions or rich motion, this approach may not accurately capture the core content of the video.

Accordingly, an optimized video auto-generation system is desired.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides an automatic video generating system based on Internet, which adopts an artificial intelligence based data analysis and processing technology to sample key frames of the first video segments, and then carries out picture content identification and semantic distinguishable enhancement on the sampled key frames of the images of the first video segments, so that the type labels of the first video segments are obtained according to the picture content dynamic semantic representation among the semantic distinguishable enhancement features of the picture content. In this way, the method helps capture the core theme and important elements of the video, not just the elements with highest occurrence frequency, and captures the time evolution of the video content, so that the generated label not only reflects the static picture content, but also can embody the dynamic characteristics of the video.

According to one aspect of the present application, there is provided an internet-based video automatic generation system, comprising a video clip extraction module for extracting a first video clip from a local video database; the type tag analysis module is used for analyzing the first video fragments to obtain type tags of the first video fragments, the template determination module is used for extracting template tags of a plurality of templates from a video material library, and based on the matching degree between the type tags of the first video fragments and the template tags of the templates, prioritizing the plurality of template tags to obtain a plurality of ordered templates which are arranged according to the priority, a slicing generation module is used for extracting ordered templates with highest priority from the plurality of ordered templates, splicing and cutting the ordered templates with the highest priority and the first video fragments to obtain slices, and is characterized in that the type tag analysis module comprises a video fragment sampling unit, a picture content identification unit, a semantic content enhancement unit, a picture enhancement feature enhancement unit and a picture enhancement feature enhancement unit, wherein the time queue of the first video fragments can be obtained, and the time queue is used for carrying out picture content dynamic coding on the picture content semantically distinguishable enhancement features to obtain picture content dynamic semantic coding representation, and obtaining the type label of the first video segment based on the picture content dynamic semantic coding representation.

Compared with the prior art, the Internet-based video automatic generation system provided by the application samples the key frames of the first video segments by adopting an artificial intelligence-based data analysis and processing technology, and then performs picture content identification and semantic distinguishable enhancement on the sampled key frames of the images of the first video segments, so that the type labels of the first video segments are obtained according to the picture content dynamic semantic representation among the semantic distinguishable enhancement features of the picture content. In this way, the method helps capture the core theme and important elements of the video, not just the elements with highest occurrence frequency, and captures the time evolution of the video content, so that the generated label not only reflects the static picture content, but also can embody the dynamic characteristics of the video.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a block diagram of an internet-based video auto-generation system according to an embodiment of the present application.

Fig. 2 is a data flow diagram of an internet-based video auto-generation system according to an embodiment of the present application.

Fig. 3 is a block diagram of a type tag analysis module in an internet-based video auto-generation system according to an embodiment of the present application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

In the technical scheme of the application, an automatic video generation system based on the Internet is provided. Fig. 1 is a block diagram of an internet-based video auto-generation system according to an embodiment of the present application. Fig. 2 is a data flow diagram of an internet-based video auto-generation system according to an embodiment of the present application. As shown in fig. 1 and 2, the internet-based video automatic generation system 300 according to an embodiment of the present application includes a video clip extraction module 310 configured to extract a first video clip from a local video database, a type tag analysis module 320 configured to analyze the first video clip to obtain a type tag of the first video clip, a template determination module 330 configured to extract template tags of a plurality of templates from a video material library, and prioritize the plurality of template tags based on a matching degree between the type tag of the first video clip and the template tag of each template to obtain a plurality of ranked templates ranked by priority, and a sheeting generation module 340 configured to extract a ranked template having a highest priority from the plurality of ranked templates, and splice and clip the ranked template having the highest priority and the first video clip to obtain a sheeting.

In particular, the video clip extraction module 310 is configured to extract a first video clip from a local video database. The process of extracting the first video clip from the local video repository is a key step in an internet-based video auto-generation system that ensures that subsequent processing (such as analysis, tag generation and final composition of the clips) can be performed based on high quality and relevant material.

When a user needs to generate a new video, the user firstly browses the local video database through an intuitive and functional user interface, and manually selects video clips according to the needs of the user. This interface supports a variety of filtering criteria, such as by folder category, date range, video duration, etc., so that the user can quickly find the desired resource. For more advanced application scenarios, the system may also make intelligent recommendations using machine learning algorithms. The mechanism can automatically select the video clip which is most suitable for the current task according to the preference history, the project requirement and the context information of the user. Intelligent recommendation typically involves deep understanding of video content, including identifying features such as scenes, characters, actions, etc. in the video to improve the relevance and accuracy of the recommendation.

Once the user selects a video clip or the system completes automatic recommendation, the next step is to read and parse the metadata of the selected video in detail. The system immediately reads all available metadata of the selected video, which covers the basic properties and technical parameters of the video, such as file format (MP 4, AVI), coding standard (h.264, VP 9), resolution (1080 p, 4K), frame rate (FPS), and track information. In addition, if the video is accompanied by additional information such as shooting time and place, the system can also read together. If a script file exists, the system will parse it to obtain more guiding information about the video structure and content. Script files typically contain details of the point in time of background music, subtitles, special effects insertion locations, etc., which are critical to ensuring the quality and consistency of the generated video. The parsed script information is integrated into the metadata of the video for use by the subsequent processing module.

To ensure that all video clips can be processed uniformly and meet the requirements of the target platform, the necessary pre-processing of the video follows. Considering that video from different sources may have different formats and coding schemes, the system will convert them into a common format, such as an h.264 encoded MP4 file. This step ensures that all video clips can be processed under the same framework, simplifying the subsequent workflow. Unnecessary parts of the beginning and the end of the video are removed, and the video is clipped according to a specific editing rule. This step also includes adjusting the video size, cropping the frame scale, etc. to better adapt to the requirements of the template or target platform. Meanwhile, the system can carry out basic evaluation on video quality, detect and correct the problems of jitter, blurring and the like, and apply enhancement technology to improve the image quality when necessary. After the preprocessing is finished, the system can carry out quality inspection on each video segment, and the aspects of definition, stability and the like of the video segment are ensured to reach the expected standard. Only the strictly verified high quality video segments will go to the next stage of processing.

Calculating the total duration of the video clip and achieving script synchronization are key steps to ensure the fluency and consistency of the final work. The system will accumulate the duration of all selected video clips to get the total duration of the entire project. This value is important for subsequent splicing, editing and ensuring that the final work meets the desired length. Accurate duration calculation is beneficial to planning the overall structure of the video, and overlong or too short conditions are avoided. If there is a script file, the system will attempt to synchronize the video clip with the time node in the script. This means that the key time in the video is to be exactly matched to the action, dialog or music change described in the script. In this way, the system ensures that the final output work is not only visually coherent, but also achieves optimal results in terms of audio and other aspects.

In order to maintain the efficiency of the process flow and the efficient management of resources, the system also needs to carefully schedule the storage paths of the video clips. To avoid affecting the original video file, the system creates a temporary copy of the selected video clip in the workspace. These copies are used in the next processing steps and are automatically deleted after the task is completed. Thus, the safety of the source file is protected, and the subsequent processing is convenient. The system will record the exact storage location of each video clip in its local repository, including the location of any temporary copies. This helps the subsequent modules to quickly access the required resources and maintains the efficiency of the process flow. Accurate storage path management is particularly important for multi-module collaboration, which ensures smooth and unimpeded information transfer between modules.

Finally, in order for the video clip to successfully enter the next stage, the type tag analysis module, the system needs to prepare and send out the corresponding data packet. All necessary information is packaged into a structured data object containing the video clip itself, metadata, pre-processing results and any associated script information. Such a data packet design makes the information transfer more orderly and efficient.

In particular, the type tag analysis module 320 is configured to analyze the first video segment to obtain a type tag of the first video segment. In one specific example of the present application, as shown in fig. 3, the type tag analysis module 320 includes a video clip sampling unit 321 configured to sample key frames of the first video clip to obtain a time queue of key frames of the first video clip images, a picture content identification unit 322 configured to identify picture content of each first video clip image key frame in the time queue of key frames of the first video clip images to obtain a time queue of semantic coding features of the picture content, a content enhancement unit 323 configured to perform sequential condition-distinguishable enhancement on the time queue of semantic coding features of the picture content to obtain a time queue of semantic distinguishable enhancement features of the picture content, and a type tag generation unit 324 configured to dynamically encode picture content of the time queue of semantic distinguishable enhancement features of the picture content to obtain a dynamic semantic coding representation of the picture content, and obtain a type tag of the first video clip based on the dynamic semantic coding representation of the picture content.

Specifically, the video clip sampling unit 321 is configured to sample the key frames of the first video clip to obtain a time queue of key frames of the image of the first video clip. Considering that the first video segment contains key information or change points, the key information plays an important role in classification of video. Based on the above, in the technical scheme of the application, the key frame sampling is performed on the first video segment to obtain the time queue of the key frame of the image of the first video segment. That is, key frame sampling is an efficient method that allows the system to focus on those parts that are critical to understanding the overall content of the video, while reducing unnecessary data processing effort, helping to quickly grasp the main plot or topic of the video, making the overall video analysis process more efficient and targeted.

Specifically, the picture content identifying unit 322 is configured to identify the picture content of each first video clip image key frame in the time queues of the first video clip image key frames to obtain the time queues of the semantic coding features of the picture content key frames. Considering that each key frame of the first video clip image contains semantic information represented by picture content, such as identifying elements of main characters, activities, places and the like in the video. While convolutional neural network models have better capabilities for capturing and mining key information in images. Therefore, in the technical scheme of the application, each first video clip image key frame in the time queue of the first video clip image key frame is respectively input into the picture content identifier based on the convolutional neural network model to obtain the time queue of the picture content key frame semantic coding feature vector, so that high-level visual features can be effectively extracted from each key frame. These features typically contain abstract representations of complex information such as objects, scenes, textures, etc. in the image.

Notably, convolutional neural networks (Convolutional Neural Network, CNN for short) are a type of deep learning model that is specifically used to process data (e.g., images, video, etc.) having a grid structure. The method is excellent in performance in the field of computer vision and is widely applied to various tasks such as image classification, target detection, semantic segmentation and the like. Unlike traditional fully connected neural networks, CNNs automatically extract local features in data by introducing a convolutional layer, and reduce the number of parameters and the amount of computation by using a pooling layer, thereby improving the efficiency and generalization capability of the model. The structure of CNN is composed of multiple parts. First, the input layer receives raw data, such as an image. For color images, the input layer is typically a three-dimensional tensor, expressed as width x height x number of channels (e.g., the number of channels for an RGB image is 3). Next are convolutional layers, each of which contains a number of convolutional kernels that slide over the input data, performing a convolutional operation to generate a Feature Map. The size of the convolution kernels is typically a small square (e.g., 3 x 3 or 5 x 5), and each convolution kernel is responsible for capturing a particular type of local feature. After convolution operations, an activation function (e.g., reLU) is typically applied to perform nonlinear conversion, enhancing the expressive power of the model. To keep the input and output feature maps consistent in size, additional zero-valued pixels (referred to as "zero padding" or "padding") may be added around the input data. In addition, the pixel distance that the convolution kernel moves each time is called a step size (Stride), and a larger step size may reduce the size of the output feature map. The convolutional layer is typically followed by a pooling layer that further compresses the data dimension through a downsampling operation while preserving key features. Common pooling methods include maximum pooling (Max Pooling) and average pooling (Average Pooling). The maximum pooling takes the maximum value in each cell as the representative value of that region, while the average pooling takes the average value. The pooling layer has the effect of reducing the spatial size of the feature map, reducing the number of parameters, while retaining the most important information. The feature map after multi-layer convolution and pooling is flattened and sent to a full connection layer, which flattens the feature map of all previous layers into a one-dimensional vector and then connects to a number of neurons. Each neuron is connected to all neurons of the previous layer, and the weights are updated by a back propagation algorithm. The full connection layer is used for final classification or other decision tasks, and the output layer is designed according to specific task requirements and is usually used for outputting prediction results. For classification tasks, the output layer gives probability distribution of each class, and for regression tasks, numerical values are directly output.

Specifically, the content enhancement unit 323 is configured to perform sequential condition distinguishable enhancement on the time queue of the semantic coding feature of the key frame of the picture content to obtain a time queue of the semantic distinguishable enhancement feature of the picture content. Considering that each picture content key frame semantic coding feature in the time queue of the picture content key frame semantic coding features has a deep important information structure with different degrees. Therefore, in order to further improve the expressive force of each key frame semantic coding feature vector, so that the difference between the key frame semantic coding features of each picture content is more obvious, thereby capturing the nuances between different scenes or elements in the video segment better, the application introduces a sequence condition distinguishable strengthening mechanism to strengthen the time queue of the picture content key frame semantic coding feature so as to obtain the time queue of the picture content semantic distinguishable strengthening feature, thus improving the quality of single key frame feature, enhancing the consistency and the expression capacity of the whole sequence and providing more accurate and rich data support for the subsequent generation of type labels.

The method comprises the specific steps of carrying out sequence condition distinguishable reinforcement on the time queues of the picture content key frame semantic coding feature, wherein the specific steps comprise firstly, carrying out field mapping on each picture content key frame semantic coding feature vector in the time queues of the picture content key frame semantic coding feature vector to obtain the time queues of the incoming picture content key frame semantic coding feature vector, namely, carrying out field mapping operation on each picture content key frame semantic coding feature vector in the time queues of the picture content key frame semantic coding feature vector so as to convert the picture content key frame semantic coding feature into a form more suitable for the current model or task requirement, so that the feature can better reflect the internal relation among samples or is easier to capture key information by an algorithm.

More specifically, each picture content key frame semantic coding feature vector in the time queue of picture content key frame semantic coding feature vectors is subjected to field mapping according to the following field mapping formula to obtain the time queue of the incoming picture content key frame semantic coding feature vectors, wherein the field mapping formula is as follows: Wherein, the method comprises the steps of, Time queue of feature vectors for key frame semantic coding of the picture contentThe key frames of the individual picture content semantically encode feature vectors,AndA first weight matrix and a second weight matrix respectively,Is thatThe key frame semantic coding feature vector of the corresponding incoming picture content.

In the embodiment of the application, the specific calculation process comprises the step of calculating the field depth factor of each of the key frame semantic coding feature vectors in the time queue of the key frame semantic coding feature vectors of the incoming picture content to obtain the time queue of the key frame semantic field depth factor of the picture content, and particularly, the field depth factor can be understood as an index for measuring the relative importance degree or complexity of the feature point in the area to which the feature point belongs. After the time queue of the picture content key frame semantic field depth factor is obtained by calculation, the picture content key frame semantic field essential feature vector of the time queue of the entering picture content key frame semantic coding feature vector is further calculated based on the time queue of the picture content key frame semantic field depth factor. That is, a feature distribution of the core attribute of the time queue comprehensively reflecting the key frame semantic coding feature vector of the incoming picture content is extracted by using a field effect normalization mechanism, so as to obtain the key frame semantic field essential feature vector of the picture content.

The specific process of calculating the field depth factor of each of the key frame semantic coding feature vectors of the key frame of the incoming picture content in the time queue of the key frame semantic coding feature vectors of the incoming picture content comprises the step of processing each of the key frame semantic coding feature vectors of the incoming picture content in the time queue of the key frame semantic coding feature vectors of the incoming picture content by using a softmax function to obtain the time queue of the field depth factor of the key frame semantic of the picture content.

More specifically, calculating the intrinsic characteristics of the time queue of the key frame semantic coding feature vector of the incoming picture content according to the following intrinsic characteristic calculation formula to obtain the key frame semantic field intrinsic feature vector of the picture content, wherein the intrinsic characteristic calculation formula is as follows: Wherein, the method comprises the steps of, In order to calculate the square of the vector length,The logarithmic function value is represented with a base of 2,To calculateIs set for the field depth factor of (c),Is thatThe corresponding picture content key frame semantic field depth factor,Is thatThe function of the function is that,Is thatA corresponding normalized picture content key frame semantic field depth factor,Is a picture content key frame semantic field essential feature vector.

And taking the picture content key frame semantic field essential feature vector as a conditional feature vector, and respectively carrying out distinguishable strengthening on each picture content key frame semantic coding feature vector in a time queue of the picture content key frame semantic coding feature vector to obtain a time queue of picture content semantic distinguishable strengthening coding vectors as a time queue of the picture content semantic distinguishable strengthening features. That is, the key frame semantic field essential feature vector of the picture content is used as additional condition information, and the time queue of the key frame semantic coding feature vector of the picture content is subjected to element-by-element distinguishable enhancement processing, so that important features in certain aspects are pertinently amplified according to a specific rule or weight adjustment scheme, and other less important parts are restrained, so that the purpose of optimizing feature expression is achieved, and the time queue of the picture content semantic distinguishable enhancement coding vector is obtained.

More specifically, the picture content key frame semantic field essential feature vector is used as a conditional feature vector, and each picture content key frame semantic coding feature vector in the time queue of the picture content key frame semantic coding feature vector is subjected to distinguishable enhancement by using a distinguishable enhancement formula as a time queue of the picture content semantic distinguishable enhancement feature, wherein the distinguishable enhancement formula is as follows: Wherein, the method comprises the steps of, For the picture content key frame semantic intrinsic weight matrix,For the key frame semantic weight matrix of the local picture content,Is the offset vector of the reference signal,Is thatThe function of the function is that,Is multiplied by the position point,Is toThe picture content semantics after the semantics enhancement can distinguish the enhanced coding vector.

Specifically, the type tag generating unit 324 is configured to dynamically encode the picture content on the time queue of the picture content semantically distinguishable enhancement features to obtain a picture content dynamic semantic encoded representation, and obtain the type tag of the first video segment based on the picture content dynamic semantic encoded representation. In the embodiment of the application, firstly, the time queue of the picture content semantically distinguishable enhancement coding vector is input into a picture content dynamic encoder based on an RNN model to obtain the picture content dynamic semantic coding vector, and here, the picture content semantic information in one time point is respectively expressed by taking into account the picture content semantically distinguishable enhancement coding vector, so that the continuity and the change trend among different key frames in a video segment are better understood, and the dynamic characteristic of the video content is more accurately represented. That is, the RNN can take into account information of previous frames while processing the current frame, which facilitates generating a coded vector containing more context information, which can help the system recognize more complex patterns, such as development of actions, transitions of scenes, etc., thereby providing a more comprehensive and deep understanding of content.

Notably, the recurrent neural network (Recurrent Neural Network, RNN) is a class of neural networks specifically designed to process sequence data. Unlike conventional feedforward neural networks, RNNs have feedback connections, allowing information to be transferred between time steps. The basic idea of RNN is to capture dependencies in a sequence by introducing a "memory" mechanism. Specifically, it receives a new input at each time step and calculates in combination with the information from all previous time steps. To achieve this, the RNN has a hidden state (HIDDEN STATE) at each time step, which is not only dependent on the current input, but is also affected by the hidden state at the previous time. Thus, with time, RNNs can accumulate and utilize past information to affect future outputs.

And then, based on the picture content dynamic semantic coding vector, obtaining the type tag of the first video segment. That is, in the technical solution of the present application, the dynamic semantic coding vector of the picture content is input into a video identifier based on a classifier to obtain the type tag of the first video segment. And the dynamic semantic features of the picture content obtained by dynamic fusion by utilizing the time queues of the picture content semantic distinguishable enhancement coding vectors are classified so as to obtain the type tag of the first video segment. In this way, the method helps capture the core theme and important elements of the video, not just the elements with highest occurrence frequency, and captures the time evolution of the video content, so that the generated label not only reflects the static picture content, but also can embody the dynamic characteristics of the video. The method comprises the specific process of inputting the picture content dynamic semantic coding vector into a video identifier based on a classifier, wherein the specific process comprises the steps of carrying out full-connection coding on the picture content dynamic semantic coding vector by using a plurality of full-connection layers of the classifier to obtain a coding classification feature vector, and passing the coding classification feature vector through a Softmax classification function of the classifier to obtain a type tag of the first video segment.

Preferably, inputting the picture content dynamic semantic coding vector into a classifier-based video identifier to obtain a type tag of the first video segment includes feature clustering a feature set of the picture content dynamic semantic coding vector to obtain a picture content dynamic semantic coding intra-class feature set and a picture content dynamic semantic coding extra-class feature set, namely: Wherein, the method comprises the steps of, Representing individual feature values in a feature set within the picture content dynamic semantic coding class,Representing each feature value in the picture content dynamic semantic coding out-of-class feature set.

Calculating the ratio of the number of the characteristic values in the characteristic set in the dynamic semantic coding class of the picture content to the number of the characteristic values in the characteristic set of the dynamic semantic coding vector of the picture content, namely: Wherein, the method comprises the steps of, Representing the number of feature values in a feature set within the picture content dynamic semantic coding class,A number of feature values representing a feature set of the picture content dynamic semantic coding vector.

Computing the sum of absolute values of all feature values in a feature set within the picture content dynamic semantic coding classThe sum of the power and the absolute value of all feature values in the feature set of the picture content dynamic semantic coding vectorThe ratio of the powers to obtain the dynamic semantic code modulation weight of the picture content, namely: Wherein, the method comprises the steps of, Representing individual feature values in a feature set of the picture content dynamic semantic coding vector,Representing the dynamic semantic coding modulation weight of the picture content.

Computing the sum of squares of all feature values in a feature set within the picture content dynamic semantic coding classThe sum of squares of the power of the square of all feature values in the feature set of the picture content dynamic semantic coding vectorThe ratio of the powers to obtain the dynamic semantic coding harmonic weight of the picture content, namely: Wherein, the method comprises the steps of, Representing the dynamic semantic coding harmonic weights of the picture contents.

For each feature value in the feature set in the picture content dynamic semantic coding class, calculating the product of the feature value and the picture content dynamic semantic coding harmonic weight, and adding the picture content dynamic semantic coding modulation weight to obtain an optimized feature value, namely: Wherein, the method comprises the steps of, And the optimized characteristic values of the characteristic values in the characteristic set in the dynamic semantic coding class of the picture content are represented.

For each feature value in the picture content dynamic semantic coding out-of-class feature set, calculating the product of the feature value and the picture content dynamic semantic coding tone value weight to obtain an optimized feature value, namely: Wherein, the method comprises the steps of, And the optimized characteristic value of each characteristic value in the picture content dynamic semantic coding out-of-class characteristic set is represented.

And inputting the optimized picture content dynamic semantic coding vector based on the optimized feature values of the picture content dynamic semantic coding intra-class feature set and the picture content dynamic semantic coding extra-class feature set into a video identifier based on a classifier to obtain the type tag of the first video segment.

Here, in the case where each picture content key frame semantic coding feature vector in the time queue of picture content key frame semantic coding feature vectors respectively represents an image semantic feature of each key frame, when performing feature distinguishable enhancement based on sequence condition description, it is desirable to promote a semantic uniform aggregate expression effect of picture content dynamic semantic coding vectors obtained by the feature distinguishable enhancement in consideration of sequence condition descriptive differences caused by image semantic differences between each key frame.

Therefore, while feature clustering is carried out on the picture content dynamic semantic coding vector, feature geometric alienation topology is constructed by feature low rank harmonic modulation on the basis of alienation of the clustering feature and the feature integral of the picture content dynamic semantic coding vector aiming at the interactive description of key feature information of the picture content dynamic semantic coding vector in the clustering process, so that pattern distribution translation and rotational symmetry of the clustering feature of the picture content dynamic semantic coding vector relative to the feature integral are obtained, the clustering mapping symmetry of the picture content dynamic semantic coding vector is realized on the basis of introducing geometric message transfer in the feature expression of the picture content dynamic semantic coding vector through irreducible low rank coefficient manipulation, the feature expression consistency of the picture content dynamic semantic coding vector based on the clustering is improved, and the accuracy of the picture content dynamic semantic coding vector input to the type label of the first video segment obtained by a video identifier based on a classifier is improved.

In particular, the template determining module 330 is configured to extract template labels of a plurality of templates from a video material library, and prioritize the plurality of template labels based on a matching degree between the type label of the first video clip and the template labels of the templates to obtain a plurality of prioritized ordered templates. In one specific example, after the selection, analysis, and tag generation of a video clip are completed, the next task is to find a template that best fits the video. To do so, the system first traverses the video material library for templates that have second tags that match the generated type tags. "matching" here refers not only to simple keyword comparison, but to evaluating the similarity between the two based on a more complex algorithm. Once a number of potentially matching templates are found, the system prioritizes them according to preset rules. Ranking depends on factors including, but not limited to, historical usage frequency of templates, click rate, user satisfaction, etc. The recommendation algorithm will integrate these metrics and dynamically adjust the priority of each template so that the most likely to meet the user's needs and the most popular templates are ranked in the top. Notably, this process is not a constant one, it will continuously self-optimize as new user data comes in, ensuring that the recommendation results always follow the current trends and individual preferences.

More specifically, ranking covers aspects in that counting the number of times each template has been selected in the past, a template that is used at a high frequency generally means that it has a high versatility and popularity. Frequently used templates are often verified many times, proving that they provide good results in most cases. The user's evaluation of each template before collection, the templates with high scores will be recommended preferentially. The user feedback directly reflects the actual performance and user experience of the template and is an important index for measuring the quality of the template. Tracking the number of times each template is clicked to view, and a higher click rate indicates that the user has greater interest in the template. The click rate can be a visual reflection of the template appeal, helping the system identify templates that are more popular with users. Based on specific properties of the video clips (e.g., duration, resolution, etc.), which templates are best suited to the current requirements are evaluated. Different video clips have different requirements, and the system needs to flexibly adjust the selection of the templates to ensure the best suitability. The priorities of certain templates are properly adjusted by combining the current popular trend or hot topics, so that the current market demands are more closely met. The popularity trend of video content changes rapidly, and the system needs to have certain flexibility and respond to market changes in time. The recommendation algorithm will dynamically adjust the priority of each template by comprehensively considering all the indexes, so that the most likely templates to meet the user needs and the most popular templates are ranked in the front. Notably, this process is not a constant one, it will continuously self-optimize as new user data comes in, ensuring that the recommendation results always follow the current trends and individual preferences.

In particular, the slice generating module 340 is configured to extract a ranking template with the highest priority from the plurality of ranking templates, and splice and clip the ranking template with the highest priority and the first video segment to obtain a slice. In one specific example, the highest priority template is spliced with the original video segment to produce a complete slice. This step also requires elaborate operations, such as automatically cropping the video segment or adapting the template by looping when the duration of the template exceeds the original video, and conversely, if the template is shorter, additional content may be added at the appropriate location. In either way, the goal is to make the final work both aesthetically pleasing and smooth. Once the shard is generated, the system automatically saves the file and notifies the user that a new finished product is available. Meanwhile, the user is prompted to score the used templates, and precious user experience feedback is collected so as to improve algorithms and services in the future.

More specifically, upon completion of tag matching and prioritization of multiple templates in the video material library, the system outputs a series of prioritized template lists. Next, the system selects the highest priority template as a candidate and further checks its compatibility with the video clip. If the highest priority template is fully satisfactory, then go directly to the next step, otherwise the system will continue to select the next best template downward until the most appropriate one is found. In the selection process, the system considers the factors such as technical specification adaptation, content consistency, timely length matching and the like. The method comprises the steps of selecting a template, ensuring that technical parameters (such as resolution, frame rate, audio format and the like) of the selected template are matched with video clips, avoiding the problem of technical incompatibility, checking whether the theme, style and scene type of the template are consistent with the core content of the video clips, ensuring that the finally generated video is consistent and natural, evaluating whether the length of the template is suitable for the total duration of the video clips, adjusting the template or the video clips if necessary, and ensuring the integrity and fluency of the finished video.

That is, once the most appropriate template is selected, the system performs the necessary pre-processing of the template and video clip, ready for subsequent splicing and editing. To ensure that templates and video clips can be processed under the same framework, the system will convert them to a common format, such as an h.264 encoded MP4 file, which simplifies the subsequent workflow and improves processing efficiency. Meanwhile, unnecessary parts at the beginning and the end of the video are removed, and the video is clipped according to specific editing rules, including operations of adjusting the video size, clipping picture proportion and the like, so as to better adapt to the requirements of the template. In addition, the system can basically evaluate the video quality, detect and correct the problems of jitter, blurring and the like, and apply enhancement technology to improve the image quality if necessary. If a script file is present, the system will attempt to synchronize the video clip with the time node in the script, precisely match the key time in the video with the action, dialog or music changes described in the script to ensure that the final output work is not only visually coherent, but also achieves optimal results in terms of audio and other aspects.

After preprocessing is completed, the system starts to splice and clip the template with the highest priority with the original video segment to generate a complete finished video. Depending on the length of time of the video clip and the template, the system will align both precisely on the timeline, including adjusting the playback speed of the video clip, repeating certain portions, or adding transitional effects to ensure seamless joining of the entire video over time. Considering that the template may contain specific start-up animation, end-of-scene or other fixed elements, the system needs to ensure that these elements are coordinated with the picture style of the video clip, and the visual transition between the video clip and the template is smoother and more natural by means of color correction, filter application, etc. If the template contains background music or special effects, the system mixes these tracks with the sound tracks of the video clip, taking care to keep the volume balanced, avoiding that either party is too prominent to affect the overall effect. In addition, the music rhythm can be dynamically adjusted according to the video content, so that the music rhythm is more suitable for plot development. Some templates may have various visual special effects, such as transition effects, text titles and the like, and the system can intelligently select and apply appropriate special effects according to the specific conditions of the video clips, so that the ornamental value and the professional degree of the finished video are improved. The splicing and editing stage is the key for generating high-quality video, and not only involves the operation of the technical level, but also includes the consideration of creative aspects, so that the best effect of the final work in visual, auditory and emotion expression can be ensured.

After the splicing and editing are completed, the system can conduct final auditing on the generated finished video to ensure that the finished video meets the expected standard. The auditing content comprises checking whether the video picture is smooth, no obvious jump or abrupt, confirming that the sound is well synchronized with the picture, matching background music and words properly, evaluating the overall impression of the finished video, and ensuring unified style, clear theme and clear expression. After a strict audit, the system will save the final version as a specified format and notify the user that a new finished product is available. Meanwhile, the user is prompted to score the used templates, and precious user experience feedback is collected so as to improve algorithms and services in the future.

As described above, the internet-based video auto-generation system 300 according to the embodiment of the present application may be implemented in various wireless terminals, for example, a server or the like having an internet-based video auto-generation algorithm. In one possible implementation, the internet-based video auto-generation system 300 according to an embodiment of the present application may be integrated into a wireless terminal as a software module and/or hardware module. For example, the Internet-based video auto-generation system 300 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal, although the Internet-based video auto-generation system 300 may be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the internet-based video auto-generation system 300 and the wireless terminal may be separate devices, and the internet-based video auto-generation system 300 may be connected to the wireless terminal through a wired and/or wireless network and transmit interactive information in a agreed data format.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An automatic video generating system based on Internet comprises a video segment extracting module for extracting a first video segment from a local video database, a type tag analyzing module for analyzing the first video segment to obtain a type tag of the first video segment, a template determining module for extracting template tags of a plurality of templates from a video material library and based on the matching degree between the type tag of the first video segment and the template tag of each template, a picture content identifying unit for prioritizing the plurality of template tags to obtain a plurality of sorting templates arranged according to priority, a picture generating module for extracting the sorting template with the highest priority from the plurality of sorting templates and splicing and cutting the sorting template with the highest priority and the first video segment to obtain pictures, characterized in that the type tag analyzing module comprises a video segment sampling unit for conducting key frame sampling on the first video segment to obtain a time queue of first video segment image key frames, a picture content identifying unit for conducting semantic enhancement on the first video segment image key frame, a picture frame identifying unit for conducting semantic enhancement on the time queue of the first video segment image, a time queue of the first video segment image can be distinguished, a time queue of the first video segment can be subjected to a time queue feature enhancement can be used for carrying out a semantic enhancement of a time queue feature enhancement, and a time queue of the time enhancement can be used for carrying out a time enhancement feature enhancement of a picture can be obtained, based on the dynamic semantic coding representation of the picture content, obtaining a type tag of the first video segment;

The content enhancement unit comprises a field mapping subunit, a picture content key frame semantic essence calculation subunit and a distinguishable enhancement subunit, wherein the field mapping subunit is used for carrying out field mapping on each picture content key frame semantic coding feature vector in a time queue of picture content key frame semantic coding feature vectors to obtain a time queue of incoming picture content key frame semantic coding feature vectors, the picture content key frame semantic essence calculation subunit is used for calculating essence characteristics of the time queue of the incoming picture content key frame semantic coding feature vectors to obtain picture content key frame semantic field essence feature vectors, and the distinguishable enhancement subunit is used for carrying out distinguishable enhancement on each picture content key frame semantic coding feature vector in the time queue of picture content key frame semantic coding feature vectors by taking the picture content key frame semantic feature vectors as conditional feature vectors to obtain the time queue of picture content semantic distinguishable enhancement feature vectors as the time queue of picture content semantic distinguishable enhancement features.

2. The system according to claim 1, wherein the picture content recognition unit is configured to input each of the first video clip image key frames in the time queues of the first video clip image key frames into a depth neural network model-based picture content recognizer to obtain a time queue of picture content key frame semantic coding feature vectors as the time queue of the picture content key frame semantic coding features.

3. The internet-based video auto-generation system of claim 2, wherein the depth neural network model-based screen content identifier is a convolutional neural network model-based screen content identifier.

4. The system of claim 3, wherein the scene content key frame semantic essence calculation subunit comprises a scene depth factor calculation secondary subunit for calculating a scene depth factor of each of the temporal queues of the scene content key frame semantic coding feature vectors to obtain a temporal queue of scene content key frame semantic field depth factors, and a scene essence feature calculation secondary subunit for calculating the scene content key frame semantic field essence feature vectors of the temporal queue of scene content key frame semantic coding feature vectors based on the temporal queue of scene content key frame semantic field depth factors.

5. The system of claim 4, wherein the field depth factor computation secondary subunit is configured to process each of the temporal queues of the temporal content key frame semantic coding feature vectors using a softmax function to obtain the temporal queues of the picture content key frame semantic field depth factors.

6. The system according to claim 5, wherein the type tag generating unit comprises a picture content dynamic coding subunit for inputting a time queue of the picture content semantically distinguishable enhancement coding vectors to a RNN model-based picture content dynamic encoder to obtain a picture content dynamic semantic coding vector, and a first video clip type determining subunit for obtaining a type tag of the first video clip based on the picture content dynamic semantic coding vector.

7. The system according to claim 6, wherein the first video clip type determining unit is configured to input the picture content dynamic semantic coding vector into a classifier-based video identifier to obtain a type tag of the first video clip.