CN119364143A

CN119364143A - A method and system for generating audio-driven real-person mouth-type broadcast video

Info

Publication number: CN119364143A
Application number: CN202411525978.XA
Authority: CN
Inventors: 郑伟; 王培元; 修志远; 尹青山; 王茂帅; 房兰涛; 赵启明
Original assignee: Shandong Langchao Ultra Hd Intelligent Technology Co ltd
Current assignee: Shandong Langchao Ultra Hd Intelligent Technology Co ltd
Priority date: 2024-10-30
Filing date: 2024-10-30
Publication date: 2025-01-24

Abstract

The invention discloses a method and a system for generating real-person mouth-shape broadcasting video driven by audio, and relates to the technical field of video generation. The method comprises real person material acquisition, video clipping and expansion, background synthesis, video reverse processing, face region clipping, audio material synthesis and face mouth shape driving. The method is based on basic real person video obtained by using a shooting and collecting method, combines a text voice synthesis technology, generates real person mouth shape broadcasting video in any scene capable of being used for mouth shape driving, and can be used in the fields of digital persons, virtual anchor and voice assistant. Through a series of innovation points, the technical scheme of the invention not only improves the efficiency and quality of the generation of the real-person mouth-shape broadcasting video, but also widens the application range in various scenes, brings richer and real interactive experience for users, and simultaneously provides powerful technical support for the development of industries such as digital man-broadcasting and the like.

Description

Audio-driven real person mouth shape broadcasting video generation method and system

Technical Field

The invention belongs to the technical field of video generation, and relates to a real person mouth shape broadcasting video generation method and system driven by audio.

Background

The face-mouth-style driving technique is an advanced artificial intelligence application that generates virtual character facial expressions and mouth-style changes synchronized with speech in real time by analyzing speech signals. This technique combines multiple fields of audio processing, deep learning, facial animation generation, and rendering synthesis, so that virtual characters can "perform" in a very natural and realistic manner.

In practical applications, face-mouth-style driven techniques are widely used to create virtual anchor, virtual teacher of online educational platform, virtual assistant in customer service, and animation and game characters in entertainment industry. The method not only improves the manufacturing efficiency and reduces the cost, but also provides more immersive and interactive experience for users.

The invention discloses an audio-driven real person mouth shape broadcasting video generation method, which can be applied to the field of digital persons and provides a rapid generation method of audio-based driving mouth shape broadcasting for 2D real person digital persons.

A 2D digital person is an avatar presented on a two-dimensional plane, which simulates a real person-like avatar through Computer Graphics (CG). The 2D digital person can be divided into two shapes of a 2D true person and a 2D cartoon, and complicated three-dimensional modeling is not needed in the manufacturing process, but the two shapes are generated through pictures and video materials, so that the manufacturing of the two shapes is relatively simple and the cost is low.

In making 2D digital persons, the creation of a character image is first required, which typically involves character design and material collection. Then the character representation, including speech generation and animation generation. The mouth-shaped action of the 2D digital person can be realized by intelligent synthesis technology, and the technology can train a model through the association mapping from input text to output audio and output visual information, so as to drive mouth-shaped animation. For example, the Wav2lip model can be used to realize intelligent synthesis of voice-driven 2D digital human mouth-shaped actions, and the method does not need extra training, and can modify the mouth shape in the video while keeping other contents of the original video unchanged.

Other actions than mouth shape, such as facial expressions, blinks, shaking, etc., are typically achieved by circular playback of prerecorded video or animation. These actions may be selected and played through a random or script strategy to generate a smooth animation effect.

The prior art has the defects of high cost, insufficient sense of reality and immersion experience, limited application scene, lack of AI technology drive and the like.

Disclosure of Invention

The invention provides a method and a system for generating a real person mouth shape broadcasting video driven by audio, which are used for generating a real person mouth shape broadcasting video under any scene driven by mouth shape based on a basic real person video obtained by using a shooting and collecting method and by combining a text voice synthesis technology, and can be used in the fields of digital persons, virtual anchor and voice assistants.

The invention provides an audio-driven real-person mouth-shape broadcasting video generation method, which comprises the following steps:

(1) And (3) collecting real person materials, namely engaging an actor or model, and performing professional video shooting to obtain the high-definition real person original materials, so as to ensure that the video meets the following requirements:

1.1. the duration is 6-12 seconds, and the lips are stationary;

1.2. The video size ratio is 9:16;

1.3. The resolution of the video is 1080P or 4K;

1.4. the frame rate is 25fps;

1.5. The format is MP4;

1.6. the actual person deduction state accords with the psychological expectation of the client and the corresponding scene requirement;

(2) Cutting and expanding the video, namely cutting and adjusting the size of an original video by using video editing software to adapt to the requirement of a target playing platform, and using a high-quality scaling algorithm to avoid image distortion;

(3) Synthesizing the new background by adopting an image processing technology including semantic segmentation, wherein the cut character is synthesized with the new background, so that the new background is ensured to be matched with the illumination, color and style of the character;

(4) The video reverse processing is carried out, namely the synthesized virtual background character video is processed in reverse order, and the video frames are reversely ordered and inter-frame transition effects are added to realize seamless transition between scenes;

(5) Cutting face areas, namely cutting the face areas in the video through an automatic face detection algorithm or manual editing, and recording coordinates of the cutting areas;

(6) The audio material synthesis, namely converting text content into voice, using professional voice synthesis software, and considering intonation, rhythm and emotion of the voice to match the expression and action of the character;

(7) And (3) face mouth shape driving, namely analyzing voice characteristics in the audio material, mapping the characteristics to a video of a face region by using a mouth shape driving algorithm, accurately adjusting lip shape and facial muscle movement, simulating real speaking action, and re-synthesizing the processed face region into an original video.

Preferably, the video cropping and expanding step includes the steps of improving the resolution of the video and adjusting the picture proportion.

Preferably, the background synthesis step includes using advanced image processing technology to identify and separate characters in the video, and synthesizing the separated character image layer with a new background image or video to create a brand new scene.

Preferably, in the step of driving the mouth shape of the face, the method comprises the step of positioning the mouth shape after driving back to the coordinates of the original video.

An audio-driven real person mouth shape broadcasting video generating system comprises,

The video acquisition unit is used for acquiring high-definition real person original materials;

The video editing unit is used for cutting and expanding videos;

The background synthesis unit is used for synthesizing the character and the new background;

the video reverse order processing unit is used for processing the reverse order of the video frames;

The face cutting unit is used for cutting face areas;

a voice synthesis unit for preparing an audio material;

and the mouth shape driving unit is used for driving the mouth shape of the face area.

The beneficial effects of the invention are as follows:

(1) The invention obviously improves the production efficiency of digital content, reduces the dependence on professional equipment and personnel and reduces the manufacturing cost through an automatic video cutting, expanding and high-quality scaling algorithm and an integrated real population type broadcasting video generation system.

(2) The method and the device enhance the realism and immersive experience, and can create a realistic fusion effect of the real character and the new background and smooth scene transition by adopting an advanced background synthesis technology and a video reverse processing technology, thereby providing a more realistic immersive experience for users.

(3) The invention can realize the highly personalized and natural interaction of the broadcasting character through the accurate face region cutting and face mouth shape driving technology, better simulate the facial expression and mouth shape of the real human, and promote the individuation and naturalness of the service.

(4) The technical scheme of the invention can be used in the fields of entertainment and education, can be extended to a plurality of industries such as finance, medical treatment, travel, and the like, provides customized virtual anchor, digital person, voice assistant, and the like for different fields, and promotes the digital transformation of related industries.

(5) The invention utilizes AI technology, especially deep learning model, to drive the generation flow of broadcast video, to improve intelligent interaction ability of digital person, voice assistant, etc.

Through a series of innovation points, the technical scheme of the invention not only improves the efficiency and quality of the generation of the real-person mouth-shape broadcasting video, but also widens the application range in various scenes, brings richer and real interactive experience for users, and simultaneously provides powerful technical support for the development of industries such as digital man-broadcasting and the like.

Drawings

The invention is further described below with reference to the accompanying drawings;

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below.

The attached drawings are specific embodiments of an audio-driven real-person mouth-shape broadcasting video generation method. This embodiment includes.

And acquiring the real person material, namely preparing the real person original material before starting the mouth shape broadcasting video generation process. This involves engaging the actor or model and taking professional video shots according to project requirements. During shooting, it is ensured that the video quality meets high definition standards so that sufficient detail can be captured in subsequent processing. In addition, the expression, action and language of the actor are natural and smooth, so that the generated result can be more real and natural. The simplicity of the background should also be noted during shooting so that the separation of the person from the background can be performed more easily in the subsequent steps. The reference shooting requirements are as follows:

1. duration is 6-12s lip rest video;

2. video size, video optimum proportion is 9:16;

3. video resolution 1080P or 4K;

4. frame rate 25fps;

5. MP4 format;

6. the state that the actual person deduction state in the whole submitted video can meet the psychological expectation of clients and the corresponding scene requirement.

According to the requirement of shooting video, shooting the real video, and shooting a section of green curtain video with expression action for about 10 seconds, wherein the green curtain video is used for the original video material of the person. This is the basis for the whole video generation, ensuring that the light, angle and background of the recording environment all meet high standards in order to capture the highest quality of real human motion and expression.

Video cropping and expansion, namely, after the original video is acquired, the next task is to crop and resize the video. First, the video editing software is used to precisely cut according to the position and size of the person in the picture, ensuring that only the necessary person parts remain. Then, the cropped video is subjected to size expansion according to the requirements of the target playing platform, which may involve resolution improvement and picture scale adjustment. In this process, high quality scaling algorithms can be used to avoid image distortion.

And cutting out the characters from the original video, cutting out the video with the character size according to the character size and the coordinate position, and expanding the video size to the target size. After the recording is completed, the original video is carefully edited. This includes removing unwanted background elements and precisely cropping the character to ensure that only the necessary character parts remain. The cropped video is then adjusted to the desired target size to accommodate different playback platforms and viewing devices.

And (3) background synthesis, namely after the video cutting of the character is completed, synthesizing the character and a new background. This typically involves the use of advanced image processing techniques, such as semantic segmentation, to identify and isolate people in the video. And then, combining the separated character image layer with a new background image or video to create a brand new scene. This step needs to ensure that the new background matches the illumination, color and style of the character to achieve a realistic visual effect.

And precisely extracting the pixel points of the person from the original material by using an advanced semantic segmentation technology, and synthesizing the pixel points with a preset scene picture. This process requires a high degree of technical expertise to ensure that the fusion of the character with the new background is both natural and seamless, thereby generating a virtual scene character video that is consistent with the original video length, with the background being specific.

And (3) performing video reverse processing, namely performing reverse processing on the synthesized virtual scene character video in order to ensure the fluency of the video when switching between different scenes. This includes the reverse ordering of video frames, and the addition of possible inter-frame transition effects, to ensure that the video can seamlessly transition from the current scene to the next scene when played.

And (3) performing video expansion and reverse order processing on the synthesized virtual scene character video so as to ensure seamless connection when the video is switched and played.

And cutting the face area, namely cutting the face part of the synthesized virtual scene character video after the video is processed in the reverse order. This step may be implemented by an automated face detection algorithm or may be accomplished by manual editing. Whichever method is used, it is necessary to ensure that the cropped face region remains consistent and accurate in the video. At the same time, the coordinates of the cropped area are recorded, which will be used to subsequently locate the actuated mouth shape back to the original video.

On the basis of synthesizing the video, the face part of the person is precisely cut. This step is critical because it directly affects the lip position location, the expressive power of the final character and the perception of the viewer. The rapid clipping can be realized by an automatic face detection technology, or a manual clipping mode is adopted, so that finer control is obtained. The face video after clipping is used for subsequent mouth shape driving processing.

And synthesizing the audio materials, namely synthesizing the text to be broadcasted into an audio file by using voice synthesis technology and the like. The face cropping is performed while the audio material is also prepared. This typically involves converting text content to speech, which can be accomplished using specialized speech synthesis software. In the synthesis process, the intonation, rhythm and emotion of the speech are considered to ensure that the final audio matches the expression and action of the character.

Face mouth shape driving, namely the final step is face mouth shape driving, which is the most critical step in the whole broadcasting video generation process. First, it is necessary to analyze the speech features in the audio material and then map these features onto the video of the face region using a mouth-driving algorithm. This requires precise adjustment of the shape of the lips and the movements of the facial muscles to simulate a real speaking action. After the mouth shape driving is completed, the processed face area is repositioned and synthesized into the original video, and the whole generation process of the real-population broadcasting video is completed.

And dynamically adjusting the mouth shape of the face area according to the audio file by using a mouth shape driving algorithm. This process involves complex facial animation techniques to ensure that the mouth shape matches the speech output perfectly.

The technical key points of the present invention include,

The high definition video acquisition technology is adopted to ensure the high quality of the real person material, and clear and detail-rich original data is provided for generating the real person mouth shape broadcasting video. The innovation point solves the problem of insufficient material quality in the traditional broadcast video generation process, and improves the sense of reality and detail expressive force of a real person broadcasting scene.

The invention can optimize the video according to the requirement of a target playing platform, including resolution improvement and picture proportion adjustment, and avoid image distortion. This innovation improves the adaptability and playing quality of the generated video.

Advanced background synthesis technology the invention adopts advanced image processing technology such as semantic segmentation and the like to realize the vivid synthesis of the character and the new background. The innovation point improves the visual effect and scene authenticity of the video through the precise separation of the characters from the background and the matching of the new background.

The video reverse processing technology realizes seamless transition of the broadcast video among different scenes by the video reverse processing technology. The innovation point is that the fluency and viewing experience of the broadcast video are improved through the reverse ordering of the video frames and the addition of the inter-frame transition effect.

The invention adopts an automatic face detection algorithm to realize the accurate cutting of the face region in the video. The innovation point improves the accuracy and efficiency of face region clipping, and provides high-quality basic data for subsequent mouth shape driving and facial expression generation.

The invention converts text content into voice through voice synthesis algorithm, and takes intonation, rhythm and emotion of voice into consideration to match expression and action of character. The innovation point improves the authenticity and expressive force of the voice, and enables the character broadcasting to be more vivid and natural.

The invention adopts a mouth shape driving algorithm to accurately map the voice characteristics in the audio material to the video of the face area, thereby realizing the accurate simulation of the shape of lips and the movement of facial muscles. The innovation point improves the reality of mouth shape and facial expression when the character speaks, and makes the character broadcasting more lifelike.

The invention provides an integrated real population type broadcast video generation system, which comprises a plurality of modules such as video acquisition, editing, background synthesis, reverse processing, face cutting, voice synthesis, mouth shape driving and the like. The innovation point realizes the automation and the high efficiency of the generation process of the broadcast video through systematic design and integration.

The invention realizes breakthrough in the field of generating the real person mouth shape broadcasting video, improves the sense of reality, expressive force and viewing experience of character broadcasting, and has important practical value and market prospect.

Claims

1. The audio-driven real person mouth shape broadcasting video generation method is characterized by comprising the following steps of:

the duration is 6-12 seconds, and the lips are stationary;

the video size ratio is 9:16;

The resolution of the video is 1080P or 4K;

The frame rate is 25fps;

The format is MP4;

the actual person deduction state accords with the psychological expectation of the client and the corresponding scene requirement;

2. The method for generating audio-driven real person mouth shape broadcast video according to claim 1, wherein the video cropping and expanding step includes the steps of improving the resolution of the video and adjusting the picture proportion.

3. The method for generating audio-driven real-person mouth-shape broadcast video according to claim 1, wherein the background synthesis step includes identifying and separating persons in the video by using advanced image processing technology, and synthesizing the separated person image layer with a new background image or video to create a brand new scene.

4. The method for generating audio-driven real person mouth shape broadcast video according to claim 1, wherein the step of driving the face mouth shape comprises positioning the mouth shape after driving back to the coordinates of the original video.

5. An audio-driven real person mouth-shape broadcasting video generating system is characterized by comprising,

The video editing unit is used for cutting and expanding videos;

The face cutting unit is used for cutting face areas;

a voice synthesis unit for preparing an audio material;