CN112562720B

CN112562720B - Lip-sync video generation method, device, equipment and storage medium

Info

Publication number: CN112562720B
Application number: CN202011372011.4A
Authority: CN
Inventors: 李�权; 王伦基; 叶俊杰; 成秋喜; 胡玉针; 李嘉雄; 朱杰; 刘华清; 韩蓝青
Original assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Current assignee: CYAGEN BIOSCIENCES (GUANGZHOU) Inc; Research Institute Of Tsinghua Pearl River Delta
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2024-07-12
Anticipated expiration: 2040-11-30
Also published as: CN112562720A

Abstract

The invention discloses a lip-sync video generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: after the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks, and can be widely applied to the technical field of video data processing.

Description

Lip-sync video generation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of video data processing, in particular to a lip-sync video generation method, a lip-sync video generation device, lip-sync video generation equipment and a storage medium.

Background

With the increasing diversity of video content, new demands are put on the creation of video content, and it is also a critical problem that needs to be solved urgently to allow these videos to be viewed in different languages. Such as a series of lectures, or a large news lecture, a very nice movie, and even a very interesting animation. This allows viewers in more different language environments to better view and access the video if they are translated into the desired target language. The key problem to be solved in translating a face video of a utterance or creating a new video in this way is to correct the mouth shape and match it with the target speech.

Some current techniques require no complex changes in the motion and background of a particular character's still or video character as seen in training to achieve character lip generation. However, in complex dynamic background, unrestricted speaker face video, the lip motion of any identity cannot be accurately changed, resulting in the person's lip portion of the video not being synchronized with the new audio.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method, apparatus, device and storage medium for generating video with high accuracy and lip sync.

One aspect of the present invention provides a lip-sync video generation method, including:

Acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;

performing character labeling on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each piece of voice data in a video image;

Performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;

training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;

constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;

And processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.

In some embodiments, the method further comprises preprocessing the voice data and the image data in the original video data;

specifically, the preprocessing of the voice data in the original video data includes:

Carrying out normalization processing on the voice data to obtain audio waveform data;

Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;

The preprocessing of the image data in the original video data comprises:

Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;

and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.

In some embodiments, the generation network comprises a vocoder, an image encoder, an image decoding generator;

The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;

The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;

The image decoding generator is used for generating a lip-shaped image of the person according to the sound characteristics and the image characteristics.

In some embodiments, the objective loss function of the character lip generation model is:

Loss＝(1-S_w-S_g)·L₁+S_w·L_sync+S_g·L_gen

S _w is to synchronously judge the influence of the network on the overall loss value; s _g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l ₁ is the mean square error loss value of the real image and the generated image; l _sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l _gen is the loss value of the image discrimination network discrimination on the real image and the generated image.

In some embodiments, the input sequence picture is provided with a tag constraint;

the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.

Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:

The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;

The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;

The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;

The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;

the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;

And the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.

In some embodiments, a preprocessing module is further included;

The preprocessing module is used for:

And

Another aspect of the invention also provides an electronic device comprising a processor and a memory;

The memory is used for storing programs;

the processor executes the program to implement the method as described above.

Another aspect of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.

After the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an overall step diagram of a lip-sync video generating method according to an embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Aiming at the problems existing in the prior art, the invention researches the problems of character lip shape generation and voice matching, and the face lip shape of any speaker can be matched with any target voice, including real voice and synthetic voice. And real world video contains rapidly changing pose, scale and illumination changes, the generated face results must also be seamlessly fused into the original target video.

The invention firstly adopts an end-to-end model to encode sound and video images, and then generates lip-shaped images matched with the sound through decoding. Meanwhile, the invention adopts a powerful lip synchronous discriminator, can accurately judge the synchronous accuracy and vivid lip movement of the generated lip and the voice, and is used for guiding the generation of more synchronous lips; the invention adopts a high-quality image quality discriminator, can accurately judge the true or false and the quality of the image, and is used for guiding the generation of more lifelike lip-shaped images. The present invention makes extensive quantitative and subjective human evaluations and is greatly superior to current methods on many bases.

The embodiment of the invention provides a lip-sync video generation method, as shown in fig. 1, comprising the following steps:

S1, acquiring original video data, wherein the original video data comprise voice data and image data of people in different scenes;

The voice data in the video of the embodiment of the invention is multi-person multi-language mixed voice data, the image data in the video is speaking face data of various scenes, proportions and illumination, and meanwhile, the resolution of the video is as high as 1080 p.

S2, performing character labeling on voice data in the original video data to obtain first data, wherein the first data are used for determining the position of a face corresponding to each piece of voice data in a video image;

Specifically, the embodiment of the invention divides the video into a plurality of small segments matched with the speaker video through labeling and stores the small segments. And carrying out matching labeling on the voice and the speaker of the acquired data, labeling the position of the face of the speaker corresponding to each section of voice in the video image, and simultaneously ensuring the synchronization of the voice and the video duration.

S3, carrying out face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;

Specifically, the embodiment of the invention carries out face detection on each frame of the marked video segment, obtains the position of the face in each frame through face detection, and extends the obtained face position information to the chin direction for 5-50 pixels, thereby ensuring that the face detection frame can cover the whole face. Then, each frame of face image is intercepted and stored through the optimized face detection frame, and meanwhile, the voice data of the video clip is also stored.

S4, training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;

s5, constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;

It should be noted that, the embodiment of the present invention constructs a high-definition character lip generating model based on a condition GAN (generating countermeasure network), the overall model structure is divided into two parts, namely a high-definition character image generating network and a discriminating network, the generating network is mainly used for generating a high-definition character lip image, the input data is a preprocessed condition Mask, a reference frame and audio, and the output is a high-definition character lip image frame synchronous with the audio. The judging network is used in model training, and has the functions of judging whether the generated character image is truly synchronous with the lip shape and the audio frequency, and after calculating the difference value of the generated image and the real image and the synchronous value of the generated lip shape and the real lip shape, feeding back loss to the generating network, and optimizing the quality of the generated image and the synchronous quality of the lip shape of the generating network.

S6, processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.

In some embodiments, before the training step of step S4, the method further includes: preprocessing voice data and image data in original video data;

The preprocessing of the image data in the original video data comprises:

The embodiment of the invention respectively preprocesses the sound and the image before inputting the sound and the image into the conditional GAN network model. Sound preprocessing is the normalization of audio data followed by the conversion of audio waveform data into sound spectrograms including, but not limited to, mel-frequency spectrum, linear spectrum, and the like. The image data preprocessing is to put the lower half part of each frame image containing lip shape in the frames of the video sequence to be generated into 0, so that the generating network generates the full lip shape image, and meanwhile, the same number of reference frames as the generated video sequence are selected for encoding character characteristic information, thereby providing better generating effect. Meanwhile, in order to ensure the association of the front frames and the rear frames of the generated video, the invention sets different video series frame inputs during training, and the generating network learns the association relation of the front frames and the rear frames of the video during training, so that the generated video is smoother and more natural, and the number of frames of the generated video sequence can be selected to be 1, 3, 5, 7, 9 and the like according to the generation requirements of different video scenes and characters.

Specifically, the generation network of the embodiment of the present invention may be classified into a vocoder, an image encoder, and an image decoding generator. Firstly, inputting the preprocessed sound spectrogram into a sound encoder, and extracting sound characteristics through convolution coding. The preprocessed image sequence data is also input into an image encoder, image features are extracted by convolutional encoding, input image resolutions including, but not limited to, 96x96, 128x128, 256x256, 512x512, etc. The extracted sound and image features are then input to an image decoding generator, which ultimately generates a lip-shaped image of the character synchronized with the sound, which may include, but is not limited to, 96x96, 128x128, 256x256, 512x512, etc., according to different generation requirements.

Specifically, the discrimination network can be divided into a lip synchronization discrimination network and an image quality discrimination network, and the role of the discrimination network is to detect the image quality and lip synchronization generated by the generation network in training, and give out an image quality discrimination value and a lip synchronization discrimination value to guide the generation network to generate a higher-definition real image and a more real synchronous lip. The lip synchronous judging network is a pre-training network, the audio frequency of the current frame and the corresponding generated image frame are input, the synchronous matching degree of each generated lip image and the corresponding audio frequency is output, and the judgment device judges and gives a feedback value so as to guide the generation of the lip image which is more synchronous with the sound and is optimized and improved during the network training. The image quality judging network trains with the generating network simultaneously, inputs the generated image and the real image, outputs the probability value of the image reality, is used for judging whether the generated image quality is good or not, and guides the generating network to generate more realistic images in the training process.

Loss＝(1-S_w-S_g)·L₁+S_w·L_sync+S_g·L_gen

s _w is to synchronously judge the influence of the network on the overall loss value; s _g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l ₁ is the mean square error loss value of the real image and the generated image; l _sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l _gen is the loss value of the image discrimination network for discriminating the real image and the generated image

Specifically, the overall loss function loss in the formula is obtained by respectively carrying out weighted summation on loss of the image L1, loss of lip-shaped video and audio synchronization and loss of image quality. Sw and Sg are the weight coefficients of the lip synchronous discriminator and the image quality discriminator affecting the whole loss respectively, and the weight of the discriminator affecting the whole image generation can be adjusted according to the requirement. In the GAN loss, the network D is judged to continuously maximize the objective function through iteration, the network G is generated to continuously minimize the loss of the image L1, the loss of lip-shaped video and audio synchronization and the loss of image quality through iteration, and further the lip-shaped image with clearer details is ensured to be generated.

Specifically, in order to generate a realistic lip image of a person, the input data is a sequence of pictures with label constraints, and the constraints may be variable-size edge pixel contours, face lip keypoint contour constraints, head contours, and a background. By including the limiting conditions in the picture, finer content control can be performed on the generated content, and a more controllable high-definition image can be generated. And new input limiting conditions can be added according to new requirements generated in subsequent use, so that the generated contents are expanded according to the requirements to be richer.

In summary, the invention can generate the high-definition character video which can be matched with the sound only by inputting the sound and the video to be translated, and can be used as a general high-definition video translation generation framework. In particular, the present invention trains an accurate lip sync arbiter that can be used to guide the generating network to generate accurate, natural lip movements. Face high-definition images which are different in image and matched with sound are generated for different application fields (public news, speech education, movie and television drama and the like). The invention is completely generated in an intelligent mode from nothing to nothing, and each video is not required to be recorded by a real person, so that the invention has faster manufacturing efficiency and richer expansion mode.

Compared with the prior art, the invention provides a novel lip generation and synchronization model of a video character, which can generate a face synchronous lip video of any speaker by using any voice, and is more accurate and better in generalization than lips generated by other works at present.

The invention also provides a new lip synchronous judging model so as to accurately judge the lip synchronous steps in various complex environment videos.

The model of the invention is independent of specific data training, is a speaker-independent generation model, and can generate lip shapes matched with voices even though lip shape data of people does not appear in training.

The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generation network is used for generating a character lip image, and the judging network is used for judging the synchronicity of the character lip and the character audio;

In some embodiments, a preprocessing module is further included;

The preprocessing module is used for:

And

The memory is used for storing programs;

the processor executes the program to implement the method as described above.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A lip-synchronized video generation method, comprising:

Training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;

processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data;

Wherein the generating network comprises a sound encoder, an image encoder and an image decoding generator;

the image decoding generator is used for generating a lip-shaped image of a person according to the sound characteristics and the image characteristics;

The target loss function of the character lip generating model is as follows:

Loss＝(1-S_w-S_g)·L₁+S_w·L_sync+S_g·L_gen

2. The lip sync video generating method according to claim 1, further comprising preprocessing voice data and image data in the original video data;

The preprocessing of the image data in the original video data comprises:

3. A lip sync video generation method according to claim 1, wherein said input sequence picture is provided with a tag constraint;

4. A lip-synchronized video generating apparatus, comprising:

The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;

the generation module is used for processing the input sequence pictures through the character lip generation model to generate lip-synchronized image data;

The target loss function of the character lip generating model is as follows:

Loss＝(1-S_w-S_g)·L₁+S_w·L_sync+S_g·L_gen

5. The lip sync video generating apparatus as defined in claim 4, further comprising a preprocessing module;

The preprocessing module is used for:

Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum; and

6. An electronic device comprising a processor and a memory;

The memory is used for storing programs;

the processor executing the program to implement the method of any one of claims 1-3.

7. A computer readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method of any one of claims 1-3.