CN113987269B

CN113987269B - Digital human video generation method, device, electronic device and storage medium

Info

Publication number: CN113987269B
Application number: CN202111169280.5A
Authority: CN
Inventors: 王鑫宇; 刘炫鹏; 常向月; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2025-02-14
Anticipated expiration: 2041-09-30
Also published as: CN113987269A

Abstract

The disclosed embodiments disclose a method, device, electronic device and storage medium for generating a digital human video. The method comprises: obtaining a target audio and a target face image; for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image other than the mouth area image in the target face image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target face image emits the audio indicated by the audio frame; based on the generated target image, a digital human video is generated. The disclosed embodiments can improve the efficiency of digital human generation.

Description

Digital human video generation method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of digital human video generation, in particular to a digital human video generation method, a digital human video generation device, electronic equipment and a storage medium.

Background

Digital human generation technology is becoming mature. Existing schemes have digital human generation methods based on pix2pix, pix2pixHD, video2video synchronization. In particular, a number of digital person generation techniques are currently emerging, such as digital person generation methods based on pix2pix, pix2pixHD, vid2Vid, few shot video2video, NERF, styleGAN, and the like.

However, in these conventional schemes, if the generated face key points are inaccurate and the effect of generating the sketch is poor, the effect of finally generating the digital human picture is poor.

Disclosure of Invention

In view of the above, to solve some or all of the technical problems above, embodiments of the present disclosure provide a digital human video generation method, apparatus, electronic device, and storage medium.

In a first aspect, an embodiment of the present disclosure provides a digital human video generating method, including:

acquiring target audio and target face images;

Inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, and the target image corresponding to the audio frame is used for indicating personnel indicated by the target face image to send out audio indicated by the audio frame;

Based on the generated target image, a digital human video is generated.

Optionally, in the method of any embodiment of the disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and

Inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the method comprises the following steps:

inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector;

inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector;

combining the first hidden vector and the second hidden vector to obtain a combined vector;

and inputting the combined vector into the third sub-model to obtain a target image corresponding to the audio frame.

Optionally, in a method of any embodiment of the disclosure, the end-to-end model is trained by:

acquiring video data;

Extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image;

and taking the sample audio as input data of a generator in the generated countermeasure network by adopting a machine learning algorithm to obtain a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generated countermeasure network determines that the target image generated by the generator meets a preset training ending condition.

Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:

Acquiring an initial generation type countermeasure network, wherein the initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point;

the following first training step is performed:

inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;

Inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain a predicted mouth key point corresponding to the sample audio;

calculating a first function value of a first preset loss function based on a predicted mouth key point corresponding to the sample audio and a mouth key point extracted from a sample face image corresponding to the sample audio;

If the calculated first function value is less than or equal to a first preset threshold, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.

Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:

If the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generation type countermeasure network, and continuously executing the first training step based on the initial generation type countermeasure network after the model parameters are updated.

The following second training step is performed:

inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in an initial generation type countering network to obtain a second hidden vector corresponding to the sample audio;

Combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio;

The combined vector corresponding to the sample audio is input to a third sub-model included in an initial generation type countermeasure network, and a prediction target image corresponding to the sample audio is obtained;

Calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio;

if the calculated second function value is less than or equal to the preset threshold, determining the model parameters of the second sub-model included in the current initial generation type countermeasure network as the model parameters of the second sub-model included in the trained end-to-end model, and determining the model parameters of the third sub-model included in the current initial generation type countermeasure network as the model parameters of the third sub-model included in the trained end-to-end model.

And if the calculated second function value is greater than the second preset threshold value, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generation type countermeasure network, and continuously executing the second training step based on the initial generation type countermeasure network after updating the model parameters.

Optionally, in the method of any embodiment of the disclosure, the preset training ending condition includes at least one of:

the function value of the preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value;

The function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.

Optionally, in the method of any embodiment of the disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

Optionally, in the method of any embodiment of the disclosure, the audio frame sequence corresponding to the audio frame includes the audio frame, and the audio frame that is continuous with a preset number of frames of the audio frame in the target audio.

In a second aspect, an embodiment of the present disclosure provides a digital human video generating apparatus, the apparatus including:

an acquisition unit configured to acquire a target audio and a target face image;

an input unit configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to issue an audio indicated by the audio frame;

and a generation unit configured to generate a digital personal video based on the generated target image.

Optionally, in the apparatus according to any of the embodiments of the present disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and

The generating unit is further configured to:

Optionally, in an apparatus of any embodiment of the disclosure, the end-to-end model is trained by:

acquiring video data;

Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:

the following first training step is performed:

Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:

The following second training step is performed:

Optionally, in the apparatus of any embodiment of the disclosure, the preset training ending condition includes at least one of:

Optionally, in the apparatus of any embodiment of the disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

Optionally, in the apparatus of any embodiment of the present disclosure, the audio frame sequence corresponding to the audio frame includes the audio frame, and a preset number of frames of the audio frame in the target audio are consecutive audio frames.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory for storing a computer program;

A processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement a method of any embodiment of the digital human video generating method of the first aspect of the disclosure.

In a fourth aspect, embodiments of the present disclosure provide a computer readable medium, which when executed by a processor, implements a method as in any of the embodiments of the digital human video generation method of the first aspect described above.

In a fifth aspect, embodiments of the present disclosure provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the steps in the method as in any of the embodiments of the digital human video generation method of the first aspect described above.

According to the digital human video generation method provided by the embodiment of the disclosure, a target audio and a target face image are obtained, then, an audio frame sequence corresponding to the audio frame and a target area image in the target face image are input into a pre-trained end-to-end model for generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and finally, the digital human video is generated based on the generated target image. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by embodiments of the present disclosure;

FIG. 2 is a flow chart of a digital human video generation method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an application scenario for the embodiment of FIG. 2;

FIG. 4A is a flow chart of another digital human video generation method provided by an embodiment of the present disclosure;

Fig. 4B is a schematic structural diagram of a mouth region image generation model in a digital human video generation method according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of a digital human video generating apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices, or modules, and do not represent any particular technical meaning nor logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate that a exists alone, and a and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided in an embodiment of the present disclosure.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit data (e.g., target audio and target face images), etc. Various client applications, such as audio and video processing software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server processing data transmitted by the terminal devices 101, 102, 103. As an example, the server 105 may be a cloud server.

It should be noted that, the server may be hardware, or may be software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should also be noted that, the digital human video generating method provided by the embodiment of the present disclosure may be executed by a server, may be executed by a terminal device, or may be executed by a server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the digital personal video generating apparatus may be all disposed in the server, may be all disposed in the terminal device, or may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the digital personal video generation method is operating does not require data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or terminal device) on which the digital personal video generation method is operating.

Fig. 2 illustrates a flow 200 of a digital human video generation method provided by an embodiment of the present disclosure. The digital human video generation method comprises the following steps:

step 201, acquiring target audio and target face images.

In this embodiment, the execution subject of the digital human video generation method (e.g., the server or the terminal device shown in fig. 1) may acquire the target audio and the target face image from other electronic devices or locally.

The target audio may be various audio. The target audio may be used for the digital human video generated in a subsequent step to sound an indication of the target audio. For example, the target audio may be voice audio or audio generated by converting text through a machine.

The target face image may be any face image. As an example, the target face image may be a captured image containing a face, or may be a frame of face image extracted from a video.

In some cases, there may be no association between the target audio and the target face image. For example, the target audio may be audio from a first person, the target facial image may be a facial image of a second person, wherein the second person may be a person other than the first person, or the target audio may be audio from the first person at a first time, the target facial image may be a facial image of the first person at a second time, wherein the second time may be any time other than the first time.

Step 202, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model for the audio frame in the target audio, and generating a target image corresponding to the audio frame.

In this embodiment, the execution body may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame.

The audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio. The target region image is a region image other than the mouth region image in the target face image. The target image corresponding to the audio frame is used for indicating the person indicated by the target face image to send out the audio indicated by the audio frame. The end-to-end model may represent a correspondence between an audio frame sequence corresponding to an audio frame, a target region image in a target face image, and a target image corresponding to the audio frame.

Here, the audio frame sequence corresponding to the audio frame may be a sequence of a preset number of frames of audio frames including the audio frame in the target audio. For example, the sequence of audio frames may contain the audio frame and the first 4 frames of the audio frame, or the sequence of audio frames may contain the audio frame and the first 2 frames of the audio frame and the last 2 frames of the audio frame.

Optionally, the audio frame sequence corresponding to the audio frame includes the audio frame and the audio frames that are continuous with the previous preset number of frames of the audio frame in the target audio.

In some optional implementations of this embodiment, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model. The input data of the first sub-model is an audio frame sequence corresponding to the audio frame. The output data of the first sub-model is a first hidden vector. The input data of the second sub-model is a target area image in the target face image. The output data of the second sub-model is a second hidden vector. The input data of the third sub-model includes the first hidden vector and the second hidden vector. The output data of the third sub-model includes a target image.

On this basis, the execution body may execute the step 202 in the following manner, so as to input the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame:

And a first step of inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector.

The first sub-model may include a model structure such as CNN (Convolutional Neural Networks, convolutional neural network), LSTM (Long Short-Term Memory networks, long-term memory network), and the like. As an example, the first sub-model may include 2 CNN layers and 2 LSTM layers. The first concealment vector may be a vocoded vector, i.e. a vector of intermediate layer outputs.

And a second step of inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector.

The second sub-model may include a model structure such as CNN, LSTM, etc. As an example, the second sub-model may include 4 CNN layers. The second hidden vector may be a vector (e.g., a vector of hidden space output of a joint encoder) of a target region image (e.g., a target region image in a target face image, or a target region image in a sample face image).

And thirdly, combining the first hidden vector and the second hidden vector to obtain a combined vector.

And step four, inputting the vector after combination into the third sub-model to obtain a target image corresponding to the audio frame.

The second sub-model may include a model structure such as CNN, LSTM, etc. As an example, the third sub-model may include 4 CNN layers. The third sub-model may represent a correspondence between the combined vector and the target image.

It can be appreciated that in the above alternative implementation manner, the target image corresponding to the audio frame is generated through the first sub-model, the second sub-model and the third sub-model included in the end-to-end model, so that the generating effect of the digital human video can be improved by improving the accuracy of the generated target image. In addition, in some cases, in the optional implementation manner, in the use process of the end-to-end model, operations such as key point extraction and inverse normalization processing are not needed, so that the accuracy of digital human video generation can be improved.

In some of the above alternative implementations, the above end-to-end model is trained by:

step one, obtaining video data.

The video data may be any video data containing voice and face images. In the video data, each video frame contains an audio frame and a face image, i.e., each audio frame has a corresponding one of the face images. For example, in the video data within one second, if the video within one second contains 5 frames, that is, 5 audio frames and 5 face images, the audio frames and the face images are in one-to-one correspondence.

And secondly, extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image.

And thirdly, adopting a machine learning algorithm, taking the sample audio as input data of a generator in the generating type countermeasure network, obtaining a target image generated by the generator corresponding to the sample audio, and taking the current generator as an end-to-end model if a discriminator in the generating type countermeasure network determines that the target image generated by the generator meets a preset training ending condition.

The preset training ending condition may include at least one of that the calculated loss function value is less than or equal to a preset threshold, and that the probability that the mouth region image generated by the generator is a mouth region image of a sample face image corresponding to the sample audio is 50%.

It will be appreciated that in the above case, the end-to-end model is obtained based on the generated countermeasure network, so that the generation effect of the digital human video can be improved by improving the accuracy of the target image generated by the generator.

In some cases, the preset training end condition also includes at least one of the following:

the first term is that a function value of a preset loss function calculated based on an audio frame sequence corresponding to an audio frame is smaller than or equal to a first preset value.

The audio frame sequence corresponding to the audio frame may be a sequence formed by a preset number of audio frames including the audio frame in the target audio. For example, the sequence of audio frames may contain the audio frame and the first 4 frames of the audio frame.

The second term is that the function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.

The sequence of audio frames corresponding to the non-audio frames (hereinafter referred to as target frames) may be a sequence of audio frames other than the sequence of audio frames corresponding to the audio frames. For example, the audio frame sequence corresponding to the audio frame may be a sequence formed by a preset number of randomly selected audio frames in the video data or the target video. The target frame may or may not be included in the audio frame sequence corresponding to the non-audio frame.

In some cases, the number of audio frames included in the sequence of audio frames corresponding to the audio frames may be equal to the sequence of audio frames corresponding to the non-audio frames.

It can be understood that in the above case, the sequence of audio frames (for example, the current frame and the previous 4 frames) corresponding to the audio frames, and the target image in the sample face image are input into the discriminator, the smaller the loss is, the better, specifically, the 26 key points generated by audio reasoning of the current frame and the previous 4 frames and the 26 key points of the real face mouth of the current frame are adopted, and the function value of the preset loss function is input into the discriminator, and the smaller the function value is, the better the function value is, so that the more real the countermeasure generated mouth is, namely, the better the effect of the digital human video is. The larger and better the function value of the preset loss function is, the more specific is that 5 frames of audio corresponding to the current frame (for example, 26 key points generated by other 5 frames of audio reasoning and 26 key points of the mouth of the real face of the current frame are input into the discriminator, the larger and better the function value is, so that the more real the mouth generated by the countermeasure generator is, namely, the better the generation effect of the digital human video is.

Optionally, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

In some application scenarios in the foregoing cases, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, taking a current generator as an end-to-end model includes:

first, an initially generated countermeasure network is acquired. The initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, wherein input data of the fourth sub-model is a first hidden vector, and output data of the fourth sub-model is a mouth key point.

After that, the following first training steps (including steps one to four) are performed:

step one, inputting sample audio to a first sub-model included in an initially generated countermeasure network to obtain a first hidden vector corresponding to the sample audio.

And step two, inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain the predicted mouth key point corresponding to the sample audio.

And thirdly, calculating a first function value of a first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio.

And step four, if the calculated first function value is smaller than or equal to a first preset threshold value, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.

Optionally, if the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initially generated countermeasure network, and continuing to perform the first training step based on the initially generated countermeasure network after updating the model parameters.

It can be understood that in the above alternative implementation manner, whether the model parameters of the first sub-model and the model parameters of the fourth sub-model in the generated countermeasure network can be used for reasoning are determined according to the magnitude of the first function value, and the digital human video is generated by adopting the trained generator in the generated countermeasure network, so that the generation effect of the digital human video is improved, and in the stage of using the generator, the key points are not required to be obtained by adopting the second sub-model, so that the generation efficiency of the digital human video can be improved.

Optionally, the step of training to obtain the end-to-end model may further include performing a second training step (including the first step to the sixth step) as follows.

And the first step is to input the sample audio to a first sub-model included in the initially generated countermeasure network to obtain a first hidden vector corresponding to the sample audio.

And a second step of inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in the initial generation type countering network to obtain a second hidden vector corresponding to the sample audio.

And thirdly, combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio.

And a fourth step of inputting the combined vector corresponding to the sample audio to a third sub-model included in the initially generated countermeasure network to obtain a prediction target image corresponding to the sample audio.

And a fifth step of calculating a second function value of a second preset loss function based on the predicted target image corresponding to the sample audio and the target image extracted from the sample face image corresponding to the sample audio.

And a sixth step of determining model parameters of a second sub-model included in the current initially generated countermeasure network as model parameters of a second sub-model included in the trained end-to-end model and determining model parameters of a third sub-model included in the current initially generated countermeasure network as model parameters of a third sub-model included in the trained end-to-end model if the calculated second function value is less than or equal to a preset threshold.

Optionally, if the calculated second function value is greater than the second preset threshold, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initially generated countermeasure network, and continuing to perform the second training step based on the initially generated countermeasure network after updating the model parameters.

It can be understood that, after the model parameters of the first sub-model and the model parameters of the fourth sub-model are fixed, whether the model parameters of the third sub-model can be used for reasoning is judged according to the magnitude of the second function value, and the digital human video is generated by adopting the generator in the training-completed generation type countermeasure network, so that the generation effect of the digital human video is improved, and the key points are not required to be obtained by adopting the second sub-model in the stage of using the generator, so that the generation efficiency of the digital human video can be further improved.

Step 203, generating a digital human video based on the generated target image.

In this embodiment, the execution subject may generate the digital human video based on the generated respective target images.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the digital human video generation method according to the present embodiment. In fig. 3, a server 310 (i.e., the execution subject described above) first acquires a target audio 301 and a target face image 304. The server 310 inputs, for the audio frames 302 in the target audio 301, an audio frame sequence 303 corresponding to the audio frames 302 and a target area image 305 in the target face image 304 into a pre-trained end-to-end model 306, and generates a target image 307 corresponding to the audio frames 302, where the audio frame sequence 303 corresponding to the audio frames 302 is a sequence of consecutive audio frames including the audio frames 302 in the target audio 301, the target area image 305 is an area image except a mouth area image in the target face image 304, and the target image 307 corresponding to the audio frames 302 is used to instruct a person indicated by the target face image to send audio indicated by the audio frame. The server 310 generates a digital personal video 308 based on the generated target image 307.

According to the method provided by the embodiment of the disclosure, through acquiring the target audio and the target face image, then inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model for generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and finally, based on the generated target image, a digital human video is generated. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

With further reference to fig. 4A, a flow 400 of yet another embodiment of a digital human video generation method is shown. The flow of the digital human video generation method comprises the following steps:

Step 401, acquiring target audio and target face images.

Step 402, for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector, inputting a target region image in the target face image into the second sub-model to obtain a second hidden vector, merging the first hidden vector and the second hidden vector to obtain a merged vector, and inputting the merged vector into the third sub-model to obtain a target image corresponding to the audio frame.

The input data of the first sub-model is an audio frame sequence corresponding to an audio frame, the output data of the first sub-model is a first hidden vector, the input data of the second sub-model is a target area image in the target face image, the output data of the second sub-model is a second hidden vector, the input data of the third sub-model comprises the first hidden vector and the second hidden vector, and the output data of the third sub-model comprises a target image.

Step 403, generating a digital human video based on the generated target image.

As an example, the digital personal video generation method of the present embodiment may be performed as follows:

first, the format of the data is introduced:

in the digital human video generation method, the size of a face sketch is 512×512×1, the size of a target face image is 512×512×3, and the combined size of the face sketch and the target face image is 512×512×4.

The implementation process of the specific scheme is described below with reference to fig. 4B:

After obtaining the user audio (i.e., the target audio), processing the user audio by using an encoder (i.e., the first sub-model) to generate a sound coding vector LM1 (i.e., a middle layer (hidden space) of cnn or lstm), the first hidden vector, then synthesizing the sound coding vector LM with an original picture (i.e., a target area image, i.e., a merged vector) vector LM2 (i.e., a hidden space encoded by a joint encoder) according to a channel synthesis manner to obtain a channel synthesis vector LM3 (including characteristics of a mole (mouth) and FACE IMAGE (face picture)), and then processing (i.e., inputting the channel synthesis vector LM3 into a GAN generation model for decoding) by using a decoder (i.e., a third sub-model) to obtain a digital human false picture (i.e., a target image), and then outputting a digital human video (a video including multiple frames).

In the training phase, this can be performed by the following steps:

Training is divided into two stages:

In the first stage, the sound (i.e. the sample audio) passes through CNN and LSTM, collectively referred to as a model LMEncoder (i.e. the first sub-model), and through full connection (i.e. the third sub-model), 26 inferred keypoints (i.e. the mouth keypoints may include 20 keypoints of the mouth and 6 keypoints of the chin, for example) are obtained, the first function value of the first preset loss function is obtained by the inferred 26 keypoints and the real keypoints (i.e. the mouth keypoints extracted from the sample face image corresponding to the sample audio), and LMEncoder is trained.

In the second stage, after the first function value of the first preset loss function of the 26 key points is stable (for example, the calculated first function value is less than or equal to the first preset threshold value), model parameters of LMEncoder are fixed, and the training of the encoder and decoder LipGAN is started, which specifically includes the following steps:

first, video data including audio (i.e., sample audio) and pictures (i.e., sample face images corresponding to the sample audio) are prepared.

And processing data according to a frame rate of 25 frames per second, extracting features from the audio, extracting face key points and corresponding canny lines from the picture, namely extracting 68 face key points from the video picture (namely sample face images corresponding to sample audio) for each video frame, wherein the method for extracting the features from the audio can use Fourier transformation to extract MFCC/use DEEPSPEECH MODEL to extract the features from the audio/use other algorithms (ASR model-voice recognition).

Then, as shown in fig. 4B, after the voice passes through CNN and LSTM, a voice coding vector LM1 is generated, then 26 mouth key points are generated through the full connection layer (i.e. 26 key points are generated by inference), and then a loss (i.e. a first function value) is calculated by using the 26 inferred key points and the 26 key points of the mouth of the real face, so as to train LMEncoder.

Subsequently, after loss (first function value) stabilizes, LMEncoder parameters (i.e., first submodel) are fixed, i.e., after LMEncoder model is trained, the encoder and decoder LipGAN (i.e., second submodel and third submodel) begin to be trained. Specifically, in the hidden layer, the hidden vector (i.e., the original picture vector LM 2) of the face picture (excluding the part of the face and mouth) and the hidden vector (i.e., the sound coding vector LM 1) of the real human voice are combined to be 1024×1×1 (i.e., the channel synthesis vector LM3, including the characteristics of the moth (mouth) and the face picture), and then output through the decoder to generate a picture (i.e., the target image).

It should be noted that, in the first stage and the second stage, the mouth picture of a frame of picture may be trained by using one frame of audio data or multiple frames of audio data. Specifically, when training a mouth picture (i.e., 26 face key points) by using N frames of audio data, for example, when training the face mouth key points of the t frame picture, audio data corresponding to the t frame, t-1, t-2..th.t- (N-1) frame can be used to train the 26 face mouth key points of the t frame picture, so that the generation effect of the face mouth picture is improved, and the generation effect of the digital human picture is better. N can be larger than 1, and the larger N, the better the generation effect of the mouth. For example, a final target image may be output using the current audio frame and the first 4 frames of the current audio, and a picture of the current frame excluding the mouth portion (i.e., target area image).

In addition, a new loss function of the discriminator (i.e., the fourth sub-model) can be added in LipGAN to ensure the stability of image generation;

The current frame, the previous 4 frames (namely an audio frame sequence corresponding to the audio frame) and the current true picture (namely a target area image) are input into a discriminator, the smaller the loss is, the better, specifically, 26 key points generated by audio reasoning of the current frame and the previous 4 frames and 26 key points of the mouth of the true face of the current frame are adopted and input into the discriminator to calculate the loss, the smaller the loss is, the better the loss is, and therefore the more true the mouth generated by antagonism is, namely the effect is good.

The other five frames (namely, the audio frame sequence corresponding to the non-audio frame) and the current frame picture (namely, the target area image) are input into the discriminator, the larger the loss is, the better, specifically, the 5 frames of audio corresponding to the non-current frame is adopted (namely, 26 key points generated by reasoning of the other 5 frames of audio and 26 key points of the mouth of the real face of the current frame are input into the discriminator to calculate the loss, the larger the loss is, the better the loss is, thereby representing that the more real the mouth generated by the countermeasure generator is, the better the effectiveness is

In the reasoning (application) phase:

firstly, input the audio of the current frame and the previous 4 frames (i.e. the audio frame sequence corresponding to the audio frame)/or extract the audio features, input the model LMEncoder (i.e. the first sub-model), and obtain the hidden vector LM1 (i.e. the first hidden vector).

Then, the current picture nozzle region (i.e., the target region image) is obtained, and the hidden vector IM2 (i.e., the second hidden vector) is obtained through the encoder.

Finally, the hidden vectors LM1 and IM2 are merged to obtain a hidden vector (i.e., a channel synthesis vector LM3 (i.e., a merged vector), which includes characteristics of a motion (mouth) and FACE IMAGE (face picture)), and the hidden vector is input to a decoder (i.e., a third sub-model), and a final picture (i.e., a target image) is output. And then outputs the digital human video.

The acoustic reasoning model can be used for extracting audio characteristics of audio, the input sound format can be wav format, and the frame rate can be 100, 50 or 25. Wherein wav is a lossless audio file format. For acoustic features, it may be MFCC, or features extracted from models such as DEEPSPEECH/ASR/wav2 Vector. The acoustic inference model may be LSMT, BERT (Bidirectional Encoder Representations from Transformers, bi-directional coding representation of the transducer-based model), transfromer (transducer model), CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network ), etc. The 3DMM is a face 3D deformation statistical model, is a basic three-dimensional statistical model, is originally proposed to solve the problem of recovering three-dimensional shapes from two-dimensional face figures, and is used for acquiring 200 pieces of three-dimensional face head information by an author, and acquiring principal component information capable of representing the face shapes and textures by taking the group of data as a basis of PCA (principal component analysis).

In this embodiment, the specific implementation manners of the steps 401 to 403 may refer to the related descriptions of the corresponding embodiments of fig. 2, which are not described herein again. In addition, in addition to the above, the embodiments of the disclosure may further include the same or similar features and effects as those of the embodiment corresponding to fig. 2, which are not described herein.

The digital human video generation method in the embodiment adopts an end-to-end mode to generate the digital human video, inputs the audio, combines the hidden space of the joint encoder to directly generate the target image for generating the digital human video, namely, key points and inverse normalization processing are not required to be acquired, the efficiency is high, further, the audio characteristics can be not extracted, the efficiency is further improved, and the effect is better when the audio characteristics are extracted. Furthermore, the stability of the target image generation can be maintained with the loss function of the new arbiter (i.e., the fourth sub-model).

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a digital human video generating apparatus, which corresponds to the above-described method embodiment, and which may include the same or corresponding features as the above-described method embodiment, and produce the same or corresponding effects as the above-described method embodiment, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the digital personal video generation apparatus 500 of the present embodiment. The apparatus 500 includes an acquisition unit 501, an input unit 502, and a generation unit 503. The device comprises an acquisition unit 501 configured to acquire target audio and target face images, an input unit 502 configured to input an audio frame sequence corresponding to the audio frame and a target area image in the target face images into a pre-trained end-to-end model for the audio frame in the target audio to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face images, the target image corresponding to the audio frame is used for indicating a person indicated by the target face images to send out the audio indicated by the audio frame, and a generation unit 503 configured to generate digital human video based on the generated target image.

In the present embodiment, the acquisition unit 501 of the digital personal video generation apparatus 500 may acquire the target audio and the target face image.

In this embodiment, the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, and the target area image is an area image except for a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame.

In the present embodiment, the generation unit 503 may generate a digital human video based on the generated target image.

In some optional implementations of this embodiment, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and

The generating unit is further configured to:

In some optional implementations of this embodiment, the end-to-end model is trained by:

acquiring video data;

In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:

the following first training step is performed:

In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:

The following second training step is performed:

In some optional implementations of this embodiment, the preset training ending condition includes at least one of:

In some optional implementations of this embodiment, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

In some optional implementations of this embodiment, the audio frame sequence corresponding to the audio frame includes the audio frame, and the audio frame that is continuous with the previous preset number of frames of the audio frame in the target audio.

In the apparatus 500 provided in the foregoing embodiment of the present disclosure, the acquiring unit 501 may acquire a target audio and a target face image, then, the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except for a mouth area image in the target face image, the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame, and finally, the generating unit 503 may generate a digital human video based on the generated target image. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and the electronic device 600 shown in fig. 6 includes at least one processor 601, a memory 602, and at least one network interface 604 and other user interfaces 603. The various components in the electronic device 600 are coupled together by a bus system 605. It is understood that the bus system 605 is used to enable connected communications between these components. The bus system 605 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 605 in fig. 6.

The user interface 603 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).

It is to be appreciated that the memory 602 in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus random access memory (DRRAM). The memory 602 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 602 stores elements, executable units or data structures, or a subset thereof, or an extended set thereof, an operating system 6021 and application programs 6022.

The operating system 6021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. Application 6022 includes various applications such as a media player (MEDIA PLAYER), browser (Browser), etc. for implementing various application services. A program for implementing the method of the embodiment of the present disclosure may be included in the application 6022.

In the embodiment of the disclosure, the processor 601 is configured to execute the method steps provided in the method embodiments by calling a program or an instruction stored in the memory 602, specifically, a program or an instruction stored in the application program 6022, for example, including acquiring a target audio and a target face image, inputting, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except for a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame, and generating a digital human video based on the generated target image.

The methods disclosed in the embodiments of the present disclosure may be applied to the processor 601 or implemented by the processor 601. The processor 601 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 601 or instructions in the form of software. The Processor 601 may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software elements in a decoded processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 602, and the processor 601 reads information in the memory 602 and performs the steps of the above method in combination with its hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSPDEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the above-described functions of the application, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be an electronic device as shown in fig. 6, and may perform all steps of the digital human video generating method shown in fig. 2, so as to achieve the technical effects of the digital human video generating method shown in fig. 2, and detailed description with reference to fig. 2 is omitted herein for brevity.

The disclosed embodiments also provide a storage medium (computer-readable storage medium). The storage medium here stores one or more programs. The storage medium may include volatile memory, such as random access memory, or nonvolatile memory, such as read only memory, flash memory, hard disk, or solid state disk, or a combination of the foregoing.

When the one or more programs in the storage medium are executable by the one or more processors, the digital human video generation method executed on the electronic device side is implemented.

The processor is used for executing a communication program stored in a memory to realize the steps of acquiring target audio and target face images, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model aiming at the audio frame in the target audio to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and generating the digital face video based on the generated target image.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

While the foregoing embodiments have been described in some detail for purposes of clarity of understanding, it will be understood that the foregoing embodiments are merely illustrative of the invention and are not intended to limit the scope of the invention, and that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the invention.

Claims

1. A method for generating a digital human video, characterized in that the method comprises:

Acquire target audio and target face image; wherein the target face image is a frame of face image extracted from a video;

For an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target facial image are input into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image excluding a mouth area image in the target facial image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target facial image emits the audio indicated by the audio frame;

generating a digital human video based on the generated target image;

The end-to-end model is trained in the following way:

Acquire video data; extract audio frames and face images corresponding to the audio frames from the video data, use the audio frame sequence corresponding to the extracted audio frames as sample audio, and use the extracted face images as sample face images; adopt a machine learning algorithm to acquire an initial generative adversarial network, wherein the initial generative adversarial network includes a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point; perform the following first training step: input the sample audio into the first sub-model included in the initial generative adversarial network to obtain the first hidden vector corresponding to the sample audio; input the first hidden vector corresponding to the sample audio into the fourth sub-model to obtain the predicted mouth key point corresponding to the sample audio; calculate the first function value of the first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio; if the calculated first function value is less than or equal to the first preset threshold, determine the model parameters of the first sub-model included in the current initial generative adversarial network as the model parameters of the first sub-model included in the trained end-to-end model.

2. The method according to claim 1, characterized in that the end-to-end model comprises a first sub-model, a second sub-model and a third sub-model, the input data of the first sub-model is an audio frame sequence corresponding to an audio frame, the output data of the first sub-model is a first hidden vector, the input data of the second sub-model is a target area image in the target face image, the output data of the second sub-model is a second hidden vector, the input data of the third sub-model includes the first hidden vector and the second hidden vector, and the output data of the third sub-model includes a target image; and

The step of inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame includes:

Merging the first hidden vector and the second hidden vector to obtain a merged vector;

The combined vector is input into the third sub-model to obtain a target image corresponding to the audio frame.

3. The method according to claim 2, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises:

If the calculated first function value is greater than the first preset threshold, the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generative adversarial network are updated, and the first training step is continued based on the initial generative adversarial network after the model parameters are updated.

4. The method according to claim 2, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises:

Perform the second training step as follows:

Inputting the sample audio into a first sub-model included in the initial generative adversarial network to obtain a first hidden vector corresponding to the sample audio;

Inputting the target area image in the sample face image corresponding to the sample audio into the second sub-model included in the initial generative adversarial network to obtain a second hidden vector corresponding to the sample audio;

Merging a first hidden vector corresponding to the sample audio and a second hidden vector corresponding to the sample audio to obtain a merged vector corresponding to the sample audio;

Inputting the merged vector corresponding to the sample audio into the third sub-model included in the initial generative adversarial network to obtain a predicted target image corresponding to the sample audio;

If the calculated second function value is less than or equal to the second preset threshold, the model parameters of the second sub-model included in the current initial generative adversarial network are determined as the model parameters of the second sub-model included in the trained end-to-end model, and the model parameters of the third sub-model included in the current initial generative adversarial network are determined as the model parameters of the third sub-model included in the trained end-to-end model.

5. The method according to claim 4, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises:

If the calculated second function value is greater than the second preset threshold, the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generative adversarial network are updated, and the second training step is continued based on the initial generative adversarial network after the model parameters are updated.

6. The method according to any one of claims 3 to 5, wherein the preset training end condition includes at least one of the following:

A function value of a preset loss function calculated based on an audio frame sequence corresponding to the audio frame is less than or equal to a first preset value;

A function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is greater than or equal to a second preset value.

7. The method according to any one of claims 2-5 is characterized in that the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.

8. The method according to any one of claims 1-5, characterized in that the audio frame sequence corresponding to the audio frame includes the audio frame and a preset number of consecutive audio frames before the audio frame in the target audio.

9. A digital human video generation device, characterized in that the device comprises:

An acquisition unit is configured to acquire a target audio and a target face image; wherein the target face image is a frame of face image extracted from a video;

An input unit is configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target facial image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image excluding a mouth area image in the target facial image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target facial image emits the audio indicated by the audio frame;

A generating unit configured to generate a digital human video based on the generated target image;

The end-to-end model is trained in the following way:

10. An electronic device, comprising:

Memory for storing computer programs;

A processor, configured to execute a computer program stored in the memory, and when the computer program is executed, implement the method described in any one of claims 1 to 8.

11. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.