CN116385629A

CN116385629A - Digital human video generation method and device, electronic equipment and storage medium

Info

Publication number: CN116385629A
Application number: CN202310132741.4A
Authority: CN
Inventors: 程平; 吴松城
Original assignee: Xiamen Black Mirror Technology Co ltd
Current assignee: Xiamen Black Mirror Technology Co ltd
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-07-04

Abstract

The invention discloses a method, a device, electronic equipment and a storage medium for generating digital human videos, wherein the method comprises the following steps: acquiring a plurality of source video frames from a real speaking video of a target person according to the duration of the target audio; carrying out 3D face modeling on each source video frame, and setting expression parameters of the obtained first 3D face models to zero to obtain a plurality of second 3D face models; fusing each second 3D face model and each third 3D face model generated based on each phoneme according to the time sequence of each phoneme, and rendering a face image sequence; fusing the face image sequence and each source video frame according to the time sequence, and setting a preset area in the fused image to be black to obtain a plurality of rendering frames; and inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video, thereby improving the consistency of human faces between the digital human video and the true speaking video.

Description

Digital human video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and more particularly, to a method and apparatus for generating digital human video, an electronic device, and a storage medium.

Background

And generating a digital person video of the target person speaking synchronous with the audio according to the audio and the real speaking video of the target person. The audio-driven digital human video has wide application, for example, can be applied to scenes such as digital virtual human, game/cartoon character dubbing mouth shape synchronization, voice translation with synchronous voice and lip.

In the prior art, when the digital human video is generated, a mapping model from audio to expression parameters and gesture parameters is required to be trained, the audio is input into the mapping model to obtain the expression parameters and the gesture parameters, then the expression parameters and the gesture parameters are replaced with corresponding parameters of a 3D human face model corresponding to a source video frame, a human face image is obtained through rendering, the human face image is fused with the source video frame to obtain a rough video frame, and the rough video frame is optimized according to the source video frame to obtain a final digital human video.

However, because the face image and the source video frame are directly fused in the prior art, when the face shape in the face image and the face shape in the source video frame are greatly different, the chin and other parts are uncoordinated with the neck region, and the high-fidelity face effect is difficult to render.

Therefore, how to improve the consistency of the face between the digital human video and the true speaking video is a technical problem to be solved at present.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for generating a digital human video, which are used for improving the consistency of human faces between the digital human video and a true speaking video.

In a first aspect, a method for generating a digital human video is provided, the method comprising: acquiring a plurality of source video frames from a real speaking video of a target person according to the duration of the target audio; carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models; generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images; fusing the face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged on the periphery of the face image along the contour line of the face image; and inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, wherein the image conversion model is generated after training a preset generated countermeasure model according to the mapping relation between the rendering frame and the source video frame in advance.

In a second aspect, there is provided an apparatus for generating a digital human video, the apparatus comprising: the acquisition module is used for acquiring a plurality of source video frames from the true speaking video of the target person according to the duration of the target audio; the modeling module is used for carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models; the first fusion module is used for generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images; the second fusion module is used for fusing the human face image sequence and each source video frame according to the time sequence to obtain a plurality of fusion images, and setting a preset area in the fusion images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged on the periphery of the human face image along the contour line of the human face image; the synthesis module is used for inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, wherein the image conversion model is generated after training a preset generated countermeasure model according to the mapping relation between the rendering frame and the source video frame in advance.

In a third aspect, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of generating digital human video of the first aspect via execution of the executable instructions.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when being executed by a processor, implements the method for generating digital human video according to the first aspect.

By applying the technical scheme, a plurality of source video frames are acquired from the true speaking video of the target person according to the duration of the target audio; carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models; generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images; fusing the face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged at the periphery of the face image along the contour line of the face image; and inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to a target person, wherein the image conversion model is generated after training a preset generated countermeasure model according to the mapping relation between the rendering frame and the source video frame in advance. By learning the texture features of the peripheral preset area of the face image, the generalization capability of the image conversion model is improved, and the consistency of the face between the digital human video and the true speaking video is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for generating digital human video according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of fusing and rendering each second 3D face model and each third 3D face model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a method for generating digital human video according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a digital human video generating device according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It is noted that other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise construction set forth herein below and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

A method of generating digital human video according to an exemplary embodiment of the present application is described below with reference to fig. 1-2. It should be noted that the following application scenario is only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

The embodiment of the application provides a method for generating a digital human video, as shown in fig. 1, the method comprises the following steps:

step S101, a plurality of source video frames are obtained from the true speaking video of the target person according to the duration of the target audio.

In this embodiment, the target audio may be pre-recorded voice audio, natural voice audio of a natural person speaking, or voice audio obtained by performing voice synthesis on input text information according to a preset voice synthesis algorithm. Correspondingly, a section of prerecorded voice audio input by a user can be received and used as target audio; or receiving and storing natural voice audio of natural person speaking as target audio; or receiving text information input by a user, and performing voice synthesis on the text information based on a preset voice synthesis algorithm to obtain target audio.

The real speaking video of the target person can be a video input by a user, or can be a video when the target person speaks in real time. In order to obtain a better effect, in a specific application scenario of the application, the duration of the real speaking video of the target person should be not less than a preset duration, such as 2 minutes.

The frame number of the digital human video to be generated can be determined according to the duration of the target audio, and a plurality of source video frames are acquired from the true speaking video of the target human according to the frame number. In order to obtain a better effect, in a specific application scene of the application, each source video frame comprises a complete face image.

Optionally, the format of the target audio may be any one of mp3, wma, aac, ogg, mpc, flac, ape and other formats, and the format of the true speaking video of the target person may be any one of wmv, asf, asx, rm, rmvb, mpg, mpeg, mpe, 3gp, mov, mp4, m4v, avi, dat, mkv, flv, vob and other formats, which can be flexibly selected by those skilled in the art according to actual needs.

Step S102, 3D face modeling is conducted on each source video frame based on a preset 3D face reconstruction algorithm, a plurality of first 3D face models are obtained, expression parameters of the first 3D face models are set to be zero, and a plurality of second 3D face models are obtained.

In this embodiment, a face region may be obtained from a source video frame based on a face detection technology, and then 3D face modeling may be performed on each face region based on a preset 3D face reconstruction algorithm, so as to obtain a plurality of first 3D face models. And setting the expression parameters of the first 3D face model to zero so as to remove the expression and the mouth shape of the first 3D face model and generate a plurality of second 3D face models.

Optionally, the preset 3D face reconstruction algorithm may be a 3DMM (3D Morphable Face Model, face 3D deformation statistical model), where 3DMM is a three-dimensional face statistical model based on a comparison basis, and may represent any face based on a set of statistical models of face shapes and textures. Each first 3D face model characterizes a set of 3DMM parameters, which may include shape parameters, texture parameters, brightness parameters, expression parameters, and pose parameters, among others. The preset 3D face reconstruction algorithm may also be a DECA (Detailed Expression Capture and Animation ) that can robustly generate UV displacement maps from a low-dimensional potential representation consisting of specific person's detail parameters and general expression parameters, while a regressor is trained to predict the detail, shape, albedo, expression, pose and illumination parameters from a single picture. The person skilled in the art can also adopt other types of preset 3D face reconstruction algorithms to reconstruct the face according to actual needs, which does not affect the protection scope of the application.

Step S103, generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence composed of a plurality of face images.

In this embodiment, the phonemes are the smallest phonetic units that make up a syllable, and any audio segment is composed of a limited number of phonemes. A plurality of third 3D face models are generated from each phoneme in the target audio, and therefore the third 3D face models characterize the pronunciation characteristics of each phoneme. And fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering to obtain a face image sequence consisting of a plurality of face images.

In some embodiments of the present application, the fusing each of the second 3D face models and each of the third 3D face models according to the time sequence of each of the phonemes, and rendering a face image sequence composed of a plurality of face images, as shown in fig. 2, includes the following steps:

step S1031, fusing each of the second 3D face models and each of the third 3D face models according to the time sequence, to obtain a plurality of fourth 3D face models.

In this embodiment, the second 3D face models and the third 3D face models are sequentially fused according to the time sequence, so as to obtain a plurality of fourth 3D face models arranged according to the time sequence. It may be understood that fusing each second 3D face model and each third 3D face model refers to fusing corresponding model parameters, and a specific fusion process is obvious to those skilled in the art, and will not be described herein.

And S1032, expanding the pronunciation starting point and the pronunciation ending point of each phoneme according to a preset frame number so as to form an overlapped section between every two adjacent phonemes.

In this embodiment, in order to achieve a mouth shape action effect that better accords with a normal person speaking, the pronunciation of each phoneme needs to be expanded, specifically, the pronunciation starting point and the pronunciation ending point of each phoneme are expanded according to a preset frame number, so that an overlapping interval is formed between every two adjacent phonemes, where the preset frame number may be one or more frames.

For example, the range of the phoneme "b" in a section of voice is from the nth frame to the n+5 frame, the range of the phoneme "o" in the voice is from the n+6 frame to the n+12 frame, the range of the phoneme "b" can be set to the n-1 frame to the n+6 frame, and the range of the phoneme "o" can be set to the n+5 frame to the n+13 frame, so that the two phonemes are overlapped in the two frames of n+5 and n+6 to form an overlapped section.

Step S1033, carrying out mean weighted fusion on the parameters of the two fourth 3D face models corresponding to each overlapping interval according to preset weight parameters, and obtaining a plurality of fifth 3D face models.

In this embodiment, the preset weight parameter is determined by the pronunciation duration of each phoneme. Each overlapping interval corresponds to two adjacent phonemes, each phoneme corresponds to a fourth 3D face model, therefore, each overlapping interval corresponds to two fourth 3D face models, the parameters of the two fourth 3D face models are subjected to mean weighting fusion according to preset weight parameters to obtain a fifth 3D face model, the fifth 3D face model can serve as transition of two adjacent phonemes, and a plurality of overlapping intervals can correspond to a plurality of fifth 3D face models.

Step S1034, inserting each fifth 3D face model between each fourth 3D face model according to the time sequence, and rendering each fifth 3D face model and each fourth 3D face model to obtain the face image sequence.

In this embodiment, by inserting the fifth 3D face model as the transition between the fourth 3D face models corresponding to each adjacent phoneme, good connection between phonemes is ensured, and smoothness of mouth shape change is improved.

In some embodiments of the present application, after inserting each of the fifth 3D face models between each of the fourth 3D face models at the timing, the method further includes:

and carrying out filtering processing on a model sequence consisting of the fifth 3D face model and the fourth 3D face model based on a preset filtering algorithm.

In this embodiment, the model sequence is filtered by a preset filtering algorithm, so that the model sequence more accords with the mouth shape consistency and integrity of normal speaking. Those skilled in the art can adopt different preset filtering algorithms according to actual needs, which do not affect the protection scope of the present application.

In some embodiments of the present application, the filtering processing, based on a preset filtering algorithm, on a model sequence formed by each of the fifth 3D face models and each of the fourth 3D face models includes:

And performing polynomial curve fitting on each 3D face model in the model sequence so that the variation of expression parameters between each 3D face model and the adjacent 3D face models in the model sequence meets preset conditions.

In this embodiment, polynomial curve fitting is performed on each fifth 3D face model and each fourth 3D face model in the model sequence, and expression parameters of each frame are reconstructed, so that the variation of the expression parameters between each 3D face model and the adjacent 3D face models meets a preset condition, and the preset condition can be that the variation is smaller than the preset variation, so that a shake frame with larger variation amplitude can be filtered, and the situation that the mouth shape suddenly changes in the generated digital human video is avoided.

Optionally, in addition to polynomial curve fitting, parameters of each fifth 3D face model and parameters of each fourth 3D face model may be median filtered or gaussian filtered over a time window, so as to filter out some abnormal data.

Step S104, fusing the face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged on the periphery of the face image along the contour line of the face image.

In this embodiment, since each face image in the face image sequence does not include hair and background information, each face image needs to be fused with each source video frame to obtain a plurality of fused images. Because the face shape of each rendered face image may be different from the face shape of the real face in the source video frame, direct fusion may cause uncoordinated chin and other parts and neck regions, and therefore, in this embodiment, a preset region in the fused image is set to be black, and the preset region is disposed at the periphery of the face image along the contour line of the face image in the fused image. By setting the preset area on the periphery of the face image in the fusion image to be black, the transformation from the texture of the face image to the real face texture can be learned in the subsequent training process of the preset generated countermeasure model, and the texture characteristics of the small area outside the boundary of the face image can be learned at the same time, so that the generalization capability of the image conversion model is improved, and the target video frame sequence output by the image conversion model is more in line with the real face.

In some embodiments of the present application, before setting a preset area in the fused image to black to obtain a plurality of rendered frames, the method further includes:

Determining the contour line according to the coordinate data of the face image in the fusion image;

determining a peripheral contour line on the periphery of the face image in the fusion image, wherein the distance between the peripheral contour line and the contour line is a preset distance;

and determining the preset area according to the contour line and the peripheral contour line.

In this embodiment, the preset area is determined by the contour line and the peripheral contour line, wherein the contour line is determined by the coordinate data of the face image in the fused image, and extends a preset distance to the periphery of the face image on the basis of the contour line, so that the peripheral contour line can be determined, and the preset area can be accurately determined in the fused image.

Alternatively, the distance between the contour line and the peripheral contour line may not be a fixed preset distance, and a certain change may be generated at different positions, as long as the minimum distance between the contour line and the peripheral contour line is not less than the preset distance.

Step S105, inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, where the image conversion model is generated after training a preset generated countermeasure model according to a mapping relationship between the rendering frame and the source video frame in advance.

In this embodiment, in order to make the images in the rendering frames and the source video frames closer, further optimization is required, and training is performed on a preset generated countermeasure model in advance according to the mapping relationship between the rendering frames and the source video frames, so as to obtain an image conversion model. After each rendering frame is obtained, each rendering frame is input into an image conversion model, the image conversion model outputs an optimized target video frame sequence, and then the target audio and the target video frame sequence are synthesized, so that the digital human video of the target person with accurate mouth shape and no jitter can be finally generated.

Alternatively, the preset generated countermeasure model may be a Memory Gan model, which includes a generator, a discriminator, and a Memory network.

Alternatively, the synthesis of the target audio and target video frame sequences may be accomplished by FFmpeg (Fast Forward Mpeg) encoding.

In some embodiments of the present application, before generating the plurality of third 3D face models from each phoneme in the target audio, the method further includes:

establishing a preset phoneme library according to the corresponding relation between different phonemes and different mouth shapes of the 3D face models;

screening a 3D face model set from a preset phoneme library according to the mouth shape of each first 3D face model;

Wherein each third 3D face model is obtained from the set of 3D face models according to each phoneme.

In this embodiment, different phones correspond to different mouth shapes, and a preset phone library may be established according to the correspondence between different phones and 3D face models of different mouth shapes, so that the preset phone library includes 3D face models of different mouth shapes, and each 3D face model in the preset phone library may correspond to one phone. And screening a group of 3D face models from a preset phoneme library according to the mouth shapes of the first 3D face models to serve as a 3D face model set, wherein the 3D face model set can cover 3D face models corresponding to various different phonemes as the real speaking video of the target person meets a certain length. And screening the 3D face models corresponding to the phonemes from the 3D face model set according to the phonemes to obtain third 3D face models. Because the mouth shape of the first 3D face model can be directly obtained from the preset phoneme library, the large-scale data training caused by the adoption of the mapping model is avoided, and the consistency of the audio frequency and the mouth shape can be improved while the efficiency is improved.

Performing voice recognition on the target audio based on a preset voice recognition algorithm, and acquiring text data and timestamp information corresponding to the text data according to a voice recognition result;

and obtaining each phoneme according to the pinyin information of the text data and the time stamp information.

In this embodiment, the target audio is subjected to voice recognition based on a preset voice recognition algorithm, so that corresponding text data and time stamp information aligned with the text data can be obtained, then the text data is converted into corresponding pinyin information, and each phoneme can be obtained based on the pinyin information and the time stamp information, so that each phoneme can be obtained more accurately.

Alternatively, the preset speech recognition algorithm may be any one of algorithms including a Dynamic Time Warping (DTW) based algorithm, a non-parametric model based Vector Quantization (VQ) method, a parametric model based Hidden Markov Model (HMM) method, an Artificial Neural Network (ANN) based algorithm, and a support vector machine.

It will be appreciated that if the target audio is audio in a language other than chinese, each phoneme may be acquired based on word pronunciation information corresponding to the text data and the time stamp information because there is no pinyin information.

By applying the technical scheme, a plurality of source video frames are acquired from the true speaking video of the target person according to the duration of the target audio; carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models; generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images; fusing the face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged at the periphery of the face image along the contour line of the face image; and inputting each rendering frame into an image conversion model, and synthesizing a target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to a target person, wherein the image conversion model is generated after training a preset generation countermeasure model according to the mapping relation between the rendering frame and a source video frame in advance, and the generalization capability of the image conversion model is improved by learning the texture characteristics of a preset area on the periphery of a human face image, so that the consistency of human faces between the digital human video and a real speaking video is improved.

In order to further explain the technical idea of the invention, the technical scheme of the invention is described with specific application scenarios.

The embodiment of the application provides a method for generating digital human video, as shown in fig. 3, comprising the following steps:

step 1, acquiring a target audio and a real speaking video of a target person, acquiring a source video frame sequence from the real speaking video of the target person according to the duration of the target audio, and generating a phoneme sequence according to the time sequence of each phoneme in the target audio;

step 2, carrying out 3D face modeling on face areas in each source video frame based on a 3DMM algorithm, and screening a 3D face model set from a preset phoneme library according to the established 3D face model (namely a first 3D face model);

step 3, determining preset weight parameters of each phoneme according to the pronunciation time length of each phoneme;

step 4, acquiring a phoneme-based 3D face model (namely a third 3D face model) from a preset 3D face model set according to the phoneme sequence;

step 5, setting expression parameters in the 3D face model parameters of the source video frame to zero to obtain a 3D face model without expression (namely a second 3D face model);

step 6, carrying out weighted fusion on the phoneme-based 3D face model in the step 4 and the non-expression 3D face model of the source video frame in the step 5 according to time sequence and weight parameters to obtain a new 3D face model (namely a fourth 3D face model);

Step 7, rendering the new 3D face model, fusing the new 3D face model with a source video frame to obtain a fused image, and setting a preset area in the fused image to be black to obtain a plurality of rendered frames;

step 8, inputting the rendered frame in the step 7 into a trained Memory Gan model to obtain an optimized video frame;

and 9, synthesizing the optimized video frame and the target audio through FFmpeg coding to obtain a digital person video corresponding to the target person.

By applying the technical scheme, the 3D face model related to the phonemes is obtained from the preset 3D face model set according to the phoneme sequence, so that a relatively accurate mouth shape effect can be generated on the premise of not considering large-scale data training, and the generalization capability of an image conversion model is improved by learning the texture characteristics of a peripheral preset area of a face image, so that the consistency of faces between digital human videos and true speaking videos is improved.

The embodiment of the application also provides a device for generating the digital human video, as shown in fig. 4, the device comprises:

an obtaining module 401, configured to obtain a plurality of source video frames from a real speaking video of a target person according to a duration of a target audio;

The modeling module 402 is configured to perform 3D face modeling on each of the source video frames based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and set expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models;

a first fusion module 403, configured to generate a plurality of third 3D face models according to each phoneme in the target audio, fuse each of the second 3D face models and each of the third 3D face models according to a time sequence of each phoneme, and render a face image sequence composed of a plurality of face images;

the second fusion module 404 is configured to fuse the face image sequence and each of the source video frames according to the time sequence to obtain a plurality of fused images, and set a preset area in the fused images to be black to obtain a plurality of rendered frames, where the preset area is disposed at the periphery of the face image along a contour line of the face image;

the synthesizing module 405 is configured to input each of the rendering frames into an image conversion model, and synthesize the target audio with a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, where the image conversion model is generated after training a preset generated countermeasure model according to a mapping relationship between the rendering frames and the source video frames in advance.

In a specific application scenario, the apparatus further includes a determining module configured to:

In a specific application scenario, the device further includes a screening module, configured to:

In a specific application scenario, the first fusion module 403 is specifically configured to:

fusing each second 3D face model and each third 3D face model according to the time sequence to obtain a plurality of fourth 3D face models;

expanding the pronunciation starting point and the pronunciation ending point of each phoneme according to a preset frame number to form an overlapped interval between every two adjacent phonemes;

Carrying out mean value weighted fusion on parameters of two fourth 3D face models corresponding to each overlapping interval according to preset weight parameters, and obtaining a plurality of fifth 3D face models;

inserting each fifth 3D face model between each fourth 3D face model according to the time sequence, and rendering each fifth 3D face model and each fourth 3D face model to obtain the face image sequence;

the preset weight parameters are determined according to the pronunciation time length of each phoneme.

In a specific application scenario, the device further includes a filtering module, configured to:

In a specific application scenario, the filtering module is specifically configured to:

In a specific application scenario, the identification module is configured to:

By applying the technical scheme, the digital human video generating device comprises: the acquisition module is used for acquiring a plurality of source video frames from the true speaking video of the target person according to the duration of the target audio; the modeling module is used for carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models; the first fusion module is used for generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images; the second fusion module is used for fusing the human face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged at the periphery of the human face image along the contour line of the human face image; the synthesis module is used for inputting each rendering frame into the image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to a target person, wherein the image conversion model is generated after training a preset generation countermeasure model according to the mapping relation between the rendering frame and a source video frame in advance, and the generalization capability of the image conversion model is improved by learning the texture characteristics of a preset area around a human face image, so that the consistency of the human face between the digital human video and a real speaking video is improved.

The embodiment of the invention also provides an electronic device, as shown in fig. 5, which comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504,

a memory 503 for storing executable instructions of the processor;

a processor 501 configured to execute via execution of the executable instructions:

acquiring a plurality of source video frames from a real speaking video of a target person according to the duration of the target audio;

carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models;

generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images;

fusing the face image sequence and each source video frame according to the time sequence to obtain a plurality of fused images, and setting a preset area in the fused images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged on the periphery of the face image along the contour line of the face image;

And inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, wherein the image conversion model is generated after training a preset generated countermeasure model according to the mapping relation between the rendering frame and the source video frame in advance.

The communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include RAM (Random Access Memory ) or may include non-volatile memory, such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of generating digital human video as described above.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the method of generating digital human video as described above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for generating digital human video, the method comprising:

2. The method of claim 1, wherein prior to setting a preset area in the fused image to black to obtain a plurality of rendered frames, the method further comprises:

3. The method of claim 1, wherein prior to generating a plurality of third 3D face models from each phoneme in the target audio, the method further comprises:

4. The method of claim 1, wherein fusing each of the second 3D face models and each of the third 3D face models at a time sequence of each of the phonemes and rendering a face image sequence composed of a plurality of face images includes:

5. The method of claim 4, wherein after inserting each of the fifth 3D face models between each of the fourth 3D face models at the timing, the method further comprises:

6. The method according to claim 5, wherein the filtering the model sequence consisting of each of the fifth 3D face models and each of the fourth 3D face models based on a preset filtering algorithm includes:

7. The method of claim 1, wherein prior to generating a plurality of third 3D face models from each phoneme in the target audio, the method further comprises:

8. A digital human video generation apparatus, the apparatus comprising:

the acquisition module is used for acquiring a plurality of source video frames from the true speaking video of the target person according to the duration of the target audio;

the modeling module is used for carrying out 3D face modeling on each source video frame based on a preset 3D face reconstruction algorithm to obtain a plurality of first 3D face models, and setting expression parameters of the first 3D face models to zero to obtain a plurality of second 3D face models;

The first fusion module is used for generating a plurality of third 3D face models according to each phoneme in the target audio, fusing each second 3D face model and each third 3D face model according to the time sequence of each phoneme, and rendering a face image sequence consisting of a plurality of face images;

the second fusion module is used for fusing the human face image sequence and each source video frame according to the time sequence to obtain a plurality of fusion images, and setting a preset area in the fusion images to be black to obtain a plurality of rendering frames, wherein the preset area is arranged on the periphery of the human face image along the contour line of the human face image;

the synthesis module is used for inputting each rendering frame into an image conversion model, and synthesizing the target audio and a target video frame sequence output by the image conversion model to obtain a digital human video corresponding to the target human, wherein the image conversion model is generated after training a preset generated countermeasure model according to the mapping relation between the rendering frame and the source video frame in advance.

9. An electronic device, comprising:

a processor; and

A memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of generating digital human video of any one of claims 1 to 7 via execution of the executable instructions.

10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the method of generating a digital human video according to any of claims 1-7.