[go: up one dir, main page]

CN113987269B - Digital human video generation method, device, electronic device and storage medium - Google Patents

Digital human video generation method, device, electronic device and storage medium Download PDF

Info

Publication number
CN113987269B
CN113987269B CN202111169280.5A CN202111169280A CN113987269B CN 113987269 B CN113987269 B CN 113987269B CN 202111169280 A CN202111169280 A CN 202111169280A CN 113987269 B CN113987269 B CN 113987269B
Authority
CN
China
Prior art keywords
model
audio
sub
target
audio frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111169280.5A
Other languages
Chinese (zh)
Other versions
CN113987269A (en
Inventor
王鑫宇
刘炫鹏
常向月
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202111169280.5A priority Critical patent/CN113987269B/en
Publication of CN113987269A publication Critical patent/CN113987269A/en
Application granted granted Critical
Publication of CN113987269B publication Critical patent/CN113987269B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种数字人视频生成方法、装置、电子设备和存储介质。上述方法包括:获取目标音频和目标人脸图像;针对上述目标音频中的音频帧,将该音频帧对应的音频帧序列和上述目标人脸图像中的目标区域图像输入至预先训练的端到端模型中,生成与该音频帧相对应的目标图像,其中,该音频帧对应的音频帧序列为上述目标音频中包含该音频帧的连续的音频帧的序列,上述目标区域图像为上述目标人脸图像中除嘴部区域图像之外的区域图像,与该音频帧相对应的目标图像用于指示上述目标人脸图像指示的人员发出该音频帧指示的音频;基于所生成的目标图像,生成数字人视频。本公开实施例可以提高数字人生成的效率。

The disclosed embodiments disclose a method, device, electronic device and storage medium for generating a digital human video. The method comprises: obtaining a target audio and a target face image; for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image other than the mouth area image in the target face image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target face image emits the audio indicated by the audio frame; based on the generated target image, a digital human video is generated. The disclosed embodiments can improve the efficiency of digital human generation.

Description

Digital human video generation method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of digital human video generation, in particular to a digital human video generation method, a digital human video generation device, electronic equipment and a storage medium.
Background
Digital human generation technology is becoming mature. Existing schemes have digital human generation methods based on pix2pix, pix2pixHD, video2video synchronization. In particular, a number of digital person generation techniques are currently emerging, such as digital person generation methods based on pix2pix, pix2pixHD, vid2Vid, few shot video2video, NERF, styleGAN, and the like.
However, in these conventional schemes, if the generated face key points are inaccurate and the effect of generating the sketch is poor, the effect of finally generating the digital human picture is poor.
Disclosure of Invention
In view of the above, to solve some or all of the technical problems above, embodiments of the present disclosure provide a digital human video generation method, apparatus, electronic device, and storage medium.
In a first aspect, an embodiment of the present disclosure provides a digital human video generating method, including:
acquiring target audio and target face images;
Inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, and the target image corresponding to the audio frame is used for indicating personnel indicated by the target face image to send out audio indicated by the audio frame;
Based on the generated target image, a digital human video is generated.
Optionally, in the method of any embodiment of the disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and
Inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generating a target image corresponding to the audio frame, wherein the method comprises the following steps:
inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector;
inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector;
combining the first hidden vector and the second hidden vector to obtain a combined vector;
and inputting the combined vector into the third sub-model to obtain a target image corresponding to the audio frame.
Optionally, in a method of any embodiment of the disclosure, the end-to-end model is trained by:
acquiring video data;
Extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image;
and taking the sample audio as input data of a generator in the generated countermeasure network by adopting a machine learning algorithm to obtain a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generated countermeasure network determines that the target image generated by the generator meets a preset training ending condition.
Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:
Acquiring an initial generation type countermeasure network, wherein the initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point;
the following first training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
Inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain a predicted mouth key point corresponding to the sample audio;
calculating a first function value of a first preset loss function based on a predicted mouth key point corresponding to the sample audio and a mouth key point extracted from a sample face image corresponding to the sample audio;
If the calculated first function value is less than or equal to a first preset threshold, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.
Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
If the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generation type countermeasure network, and continuously executing the first training step based on the initial generation type countermeasure network after the model parameters are updated.
Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
The following second training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in an initial generation type countering network to obtain a second hidden vector corresponding to the sample audio;
Combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio;
The combined vector corresponding to the sample audio is input to a third sub-model included in an initial generation type countermeasure network, and a prediction target image corresponding to the sample audio is obtained;
Calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio;
if the calculated second function value is less than or equal to the preset threshold, determining the model parameters of the second sub-model included in the current initial generation type countermeasure network as the model parameters of the second sub-model included in the trained end-to-end model, and determining the model parameters of the third sub-model included in the current initial generation type countermeasure network as the model parameters of the third sub-model included in the trained end-to-end model.
Optionally, in the method of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
And if the calculated second function value is greater than the second preset threshold value, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generation type countermeasure network, and continuously executing the second training step based on the initial generation type countermeasure network after updating the model parameters.
Optionally, in the method of any embodiment of the disclosure, the preset training ending condition includes at least one of:
the function value of the preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value;
The function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.
Optionally, in the method of any embodiment of the disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.
Optionally, in the method of any embodiment of the disclosure, the audio frame sequence corresponding to the audio frame includes the audio frame, and the audio frame that is continuous with a preset number of frames of the audio frame in the target audio.
In a second aspect, an embodiment of the present disclosure provides a digital human video generating apparatus, the apparatus including:
an acquisition unit configured to acquire a target audio and a target face image;
an input unit configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to issue an audio indicated by the audio frame;
and a generation unit configured to generate a digital personal video based on the generated target image.
Optionally, in the apparatus according to any of the embodiments of the present disclosure, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and
The generating unit is further configured to:
inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector;
inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector;
combining the first hidden vector and the second hidden vector to obtain a combined vector;
and inputting the combined vector into the third sub-model to obtain a target image corresponding to the audio frame.
Optionally, in an apparatus of any embodiment of the disclosure, the end-to-end model is trained by:
acquiring video data;
Extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image;
and taking the sample audio as input data of a generator in the generated countermeasure network by adopting a machine learning algorithm to obtain a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generated countermeasure network determines that the target image generated by the generator meets a preset training ending condition.
Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:
Acquiring an initial generation type countermeasure network, wherein the initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point;
the following first training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
Inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain a predicted mouth key point corresponding to the sample audio;
calculating a first function value of a first preset loss function based on a predicted mouth key point corresponding to the sample audio and a mouth key point extracted from a sample face image corresponding to the sample audio;
If the calculated first function value is less than or equal to a first preset threshold, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.
Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
If the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generation type countermeasure network, and continuously executing the first training step based on the initial generation type countermeasure network after the model parameters are updated.
Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
The following second training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in an initial generation type countering network to obtain a second hidden vector corresponding to the sample audio;
Combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio;
The combined vector corresponding to the sample audio is input to a third sub-model included in an initial generation type countermeasure network, and a prediction target image corresponding to the sample audio is obtained;
Calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio;
if the calculated second function value is less than or equal to the preset threshold, determining the model parameters of the second sub-model included in the current initial generation type countermeasure network as the model parameters of the second sub-model included in the trained end-to-end model, and determining the model parameters of the third sub-model included in the current initial generation type countermeasure network as the model parameters of the third sub-model included in the trained end-to-end model.
Optionally, in the apparatus of any embodiment of the present disclosure, the obtaining, with the sample audio as input data of a generator in a generated countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generated countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
And if the calculated second function value is greater than the second preset threshold value, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generation type countermeasure network, and continuously executing the second training step based on the initial generation type countermeasure network after updating the model parameters.
Optionally, in the apparatus of any embodiment of the disclosure, the preset training ending condition includes at least one of:
the function value of the preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value;
The function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.
Optionally, in the apparatus of any embodiment of the disclosure, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.
Optionally, in the apparatus of any embodiment of the present disclosure, the audio frame sequence corresponding to the audio frame includes the audio frame, and a preset number of frames of the audio frame in the target audio are consecutive audio frames.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
a memory for storing a computer program;
A processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement a method of any embodiment of the digital human video generating method of the first aspect of the disclosure.
In a fourth aspect, embodiments of the present disclosure provide a computer readable medium, which when executed by a processor, implements a method as in any of the embodiments of the digital human video generation method of the first aspect described above.
In a fifth aspect, embodiments of the present disclosure provide a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the steps in the method as in any of the embodiments of the digital human video generation method of the first aspect described above.
According to the digital human video generation method provided by the embodiment of the disclosure, a target audio and a target face image are obtained, then, an audio frame sequence corresponding to the audio frame and a target area image in the target face image are input into a pre-trained end-to-end model for generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and finally, the digital human video is generated based on the generated target image. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.
The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.
Drawings
Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:
FIG. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided by embodiments of the present disclosure;
FIG. 2 is a flow chart of a digital human video generation method provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an application scenario for the embodiment of FIG. 2;
FIG. 4A is a flow chart of another digital human video generation method provided by an embodiment of the present disclosure;
Fig. 4B is a schematic structural diagram of a mouth region image generation model in a digital human video generation method according to an embodiment of the present disclosure;
Fig. 5 is a schematic structural diagram of a digital human video generating apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices, or modules, and do not represent any particular technical meaning nor logical order between them.
It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.
It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.
In addition, the term "and/or" in this disclosure is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate that a exists alone, and a and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is an exemplary system architecture diagram of a digital human video generation method or a digital human video generation apparatus provided in an embodiment of the present disclosure.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit data (e.g., target audio and target face images), etc. Various client applications, such as audio and video processing software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server processing data transmitted by the terminal devices 101, 102, 103. As an example, the server 105 may be a cloud server.
It should be noted that, the server may be hardware, or may be software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should also be noted that, the digital human video generating method provided by the embodiment of the present disclosure may be executed by a server, may be executed by a terminal device, or may be executed by a server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the digital personal video generating apparatus may be all disposed in the server, may be all disposed in the terminal device, or may be disposed in the server and the terminal device, respectively.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the digital personal video generation method is operating does not require data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or terminal device) on which the digital personal video generation method is operating.
Fig. 2 illustrates a flow 200 of a digital human video generation method provided by an embodiment of the present disclosure. The digital human video generation method comprises the following steps:
step 201, acquiring target audio and target face images.
In this embodiment, the execution subject of the digital human video generation method (e.g., the server or the terminal device shown in fig. 1) may acquire the target audio and the target face image from other electronic devices or locally.
The target audio may be various audio. The target audio may be used for the digital human video generated in a subsequent step to sound an indication of the target audio. For example, the target audio may be voice audio or audio generated by converting text through a machine.
The target face image may be any face image. As an example, the target face image may be a captured image containing a face, or may be a frame of face image extracted from a video.
In some cases, there may be no association between the target audio and the target face image. For example, the target audio may be audio from a first person, the target facial image may be a facial image of a second person, wherein the second person may be a person other than the first person, or the target audio may be audio from the first person at a first time, the target facial image may be a facial image of the first person at a second time, wherein the second time may be any time other than the first time.
Step 202, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model for the audio frame in the target audio, and generating a target image corresponding to the audio frame.
In this embodiment, the execution body may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame.
The audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio. The target region image is a region image other than the mouth region image in the target face image. The target image corresponding to the audio frame is used for indicating the person indicated by the target face image to send out the audio indicated by the audio frame. The end-to-end model may represent a correspondence between an audio frame sequence corresponding to an audio frame, a target region image in a target face image, and a target image corresponding to the audio frame.
Here, the audio frame sequence corresponding to the audio frame may be a sequence of a preset number of frames of audio frames including the audio frame in the target audio. For example, the sequence of audio frames may contain the audio frame and the first 4 frames of the audio frame, or the sequence of audio frames may contain the audio frame and the first 2 frames of the audio frame and the last 2 frames of the audio frame.
Optionally, the audio frame sequence corresponding to the audio frame includes the audio frame and the audio frames that are continuous with the previous preset number of frames of the audio frame in the target audio.
In some optional implementations of this embodiment, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model. The input data of the first sub-model is an audio frame sequence corresponding to the audio frame. The output data of the first sub-model is a first hidden vector. The input data of the second sub-model is a target area image in the target face image. The output data of the second sub-model is a second hidden vector. The input data of the third sub-model includes the first hidden vector and the second hidden vector. The output data of the third sub-model includes a target image.
On this basis, the execution body may execute the step 202 in the following manner, so as to input the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model, and generate a target image corresponding to the audio frame:
And a first step of inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector.
The first sub-model may include a model structure such as CNN (Convolutional Neural Networks, convolutional neural network), LSTM (Long Short-Term Memory networks, long-term memory network), and the like. As an example, the first sub-model may include 2 CNN layers and 2 LSTM layers. The first concealment vector may be a vocoded vector, i.e. a vector of intermediate layer outputs.
And a second step of inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector.
The second sub-model may include a model structure such as CNN, LSTM, etc. As an example, the second sub-model may include 4 CNN layers. The second hidden vector may be a vector (e.g., a vector of hidden space output of a joint encoder) of a target region image (e.g., a target region image in a target face image, or a target region image in a sample face image).
And thirdly, combining the first hidden vector and the second hidden vector to obtain a combined vector.
And step four, inputting the vector after combination into the third sub-model to obtain a target image corresponding to the audio frame.
The second sub-model may include a model structure such as CNN, LSTM, etc. As an example, the third sub-model may include 4 CNN layers. The third sub-model may represent a correspondence between the combined vector and the target image.
It can be appreciated that in the above alternative implementation manner, the target image corresponding to the audio frame is generated through the first sub-model, the second sub-model and the third sub-model included in the end-to-end model, so that the generating effect of the digital human video can be improved by improving the accuracy of the generated target image. In addition, in some cases, in the optional implementation manner, in the use process of the end-to-end model, operations such as key point extraction and inverse normalization processing are not needed, so that the accuracy of digital human video generation can be improved.
In some of the above alternative implementations, the above end-to-end model is trained by:
step one, obtaining video data.
The video data may be any video data containing voice and face images. In the video data, each video frame contains an audio frame and a face image, i.e., each audio frame has a corresponding one of the face images. For example, in the video data within one second, if the video within one second contains 5 frames, that is, 5 audio frames and 5 face images, the audio frames and the face images are in one-to-one correspondence.
And secondly, extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image.
And thirdly, adopting a machine learning algorithm, taking the sample audio as input data of a generator in the generating type countermeasure network, obtaining a target image generated by the generator corresponding to the sample audio, and taking the current generator as an end-to-end model if a discriminator in the generating type countermeasure network determines that the target image generated by the generator meets a preset training ending condition.
The preset training ending condition may include at least one of that the calculated loss function value is less than or equal to a preset threshold, and that the probability that the mouth region image generated by the generator is a mouth region image of a sample face image corresponding to the sample audio is 50%.
It will be appreciated that in the above case, the end-to-end model is obtained based on the generated countermeasure network, so that the generation effect of the digital human video can be improved by improving the accuracy of the target image generated by the generator.
In some cases, the preset training end condition also includes at least one of the following:
the first term is that a function value of a preset loss function calculated based on an audio frame sequence corresponding to an audio frame is smaller than or equal to a first preset value.
The audio frame sequence corresponding to the audio frame may be a sequence formed by a preset number of audio frames including the audio frame in the target audio. For example, the sequence of audio frames may contain the audio frame and the first 4 frames of the audio frame.
The second term is that the function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.
The sequence of audio frames corresponding to the non-audio frames (hereinafter referred to as target frames) may be a sequence of audio frames other than the sequence of audio frames corresponding to the audio frames. For example, the audio frame sequence corresponding to the audio frame may be a sequence formed by a preset number of randomly selected audio frames in the video data or the target video. The target frame may or may not be included in the audio frame sequence corresponding to the non-audio frame.
In some cases, the number of audio frames included in the sequence of audio frames corresponding to the audio frames may be equal to the sequence of audio frames corresponding to the non-audio frames.
It can be understood that in the above case, the sequence of audio frames (for example, the current frame and the previous 4 frames) corresponding to the audio frames, and the target image in the sample face image are input into the discriminator, the smaller the loss is, the better, specifically, the 26 key points generated by audio reasoning of the current frame and the previous 4 frames and the 26 key points of the real face mouth of the current frame are adopted, and the function value of the preset loss function is input into the discriminator, and the smaller the function value is, the better the function value is, so that the more real the countermeasure generated mouth is, namely, the better the effect of the digital human video is. The larger and better the function value of the preset loss function is, the more specific is that 5 frames of audio corresponding to the current frame (for example, 26 key points generated by other 5 frames of audio reasoning and 26 key points of the mouth of the real face of the current frame are input into the discriminator, the larger and better the function value is, so that the more real the mouth generated by the countermeasure generator is, namely, the better the generation effect of the digital human video is.
Optionally, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.
In some application scenarios in the foregoing cases, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, taking a current generator as an end-to-end model includes:
first, an initially generated countermeasure network is acquired. The initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, wherein input data of the fourth sub-model is a first hidden vector, and output data of the fourth sub-model is a mouth key point.
After that, the following first training steps (including steps one to four) are performed:
step one, inputting sample audio to a first sub-model included in an initially generated countermeasure network to obtain a first hidden vector corresponding to the sample audio.
And step two, inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain the predicted mouth key point corresponding to the sample audio.
And thirdly, calculating a first function value of a first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio.
And step four, if the calculated first function value is smaller than or equal to a first preset threshold value, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.
Optionally, if the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initially generated countermeasure network, and continuing to perform the first training step based on the initially generated countermeasure network after updating the model parameters.
It can be understood that in the above alternative implementation manner, whether the model parameters of the first sub-model and the model parameters of the fourth sub-model in the generated countermeasure network can be used for reasoning are determined according to the magnitude of the first function value, and the digital human video is generated by adopting the trained generator in the generated countermeasure network, so that the generation effect of the digital human video is improved, and in the stage of using the generator, the key points are not required to be obtained by adopting the second sub-model, so that the generation efficiency of the digital human video can be improved.
Optionally, the step of training to obtain the end-to-end model may further include performing a second training step (including the first step to the sixth step) as follows.
And the first step is to input the sample audio to a first sub-model included in the initially generated countermeasure network to obtain a first hidden vector corresponding to the sample audio.
And a second step of inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in the initial generation type countering network to obtain a second hidden vector corresponding to the sample audio.
And thirdly, combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio.
And a fourth step of inputting the combined vector corresponding to the sample audio to a third sub-model included in the initially generated countermeasure network to obtain a prediction target image corresponding to the sample audio.
And a fifth step of calculating a second function value of a second preset loss function based on the predicted target image corresponding to the sample audio and the target image extracted from the sample face image corresponding to the sample audio.
And a sixth step of determining model parameters of a second sub-model included in the current initially generated countermeasure network as model parameters of a second sub-model included in the trained end-to-end model and determining model parameters of a third sub-model included in the current initially generated countermeasure network as model parameters of a third sub-model included in the trained end-to-end model if the calculated second function value is less than or equal to a preset threshold.
Optionally, if the calculated second function value is greater than the second preset threshold, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initially generated countermeasure network, and continuing to perform the second training step based on the initially generated countermeasure network after updating the model parameters.
It can be understood that, after the model parameters of the first sub-model and the model parameters of the fourth sub-model are fixed, whether the model parameters of the third sub-model can be used for reasoning is judged according to the magnitude of the second function value, and the digital human video is generated by adopting the generator in the training-completed generation type countermeasure network, so that the generation effect of the digital human video is improved, and the key points are not required to be obtained by adopting the second sub-model in the stage of using the generator, so that the generation efficiency of the digital human video can be further improved.
Step 203, generating a digital human video based on the generated target image.
In this embodiment, the execution subject may generate the digital human video based on the generated respective target images.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the digital human video generation method according to the present embodiment. In fig. 3, a server 310 (i.e., the execution subject described above) first acquires a target audio 301 and a target face image 304. The server 310 inputs, for the audio frames 302 in the target audio 301, an audio frame sequence 303 corresponding to the audio frames 302 and a target area image 305 in the target face image 304 into a pre-trained end-to-end model 306, and generates a target image 307 corresponding to the audio frames 302, where the audio frame sequence 303 corresponding to the audio frames 302 is a sequence of consecutive audio frames including the audio frames 302 in the target audio 301, the target area image 305 is an area image except a mouth area image in the target face image 304, and the target image 307 corresponding to the audio frames 302 is used to instruct a person indicated by the target face image to send audio indicated by the audio frame. The server 310 generates a digital personal video 308 based on the generated target image 307.
According to the method provided by the embodiment of the disclosure, through acquiring the target audio and the target face image, then inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model for generating a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and finally, based on the generated target image, a digital human video is generated. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.
With further reference to fig. 4A, a flow 400 of yet another embodiment of a digital human video generation method is shown. The flow of the digital human video generation method comprises the following steps:
Step 401, acquiring target audio and target face images.
Step 402, for an audio frame in the target audio, inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector, inputting a target region image in the target face image into the second sub-model to obtain a second hidden vector, merging the first hidden vector and the second hidden vector to obtain a merged vector, and inputting the merged vector into the third sub-model to obtain a target image corresponding to the audio frame.
The input data of the first sub-model is an audio frame sequence corresponding to an audio frame, the output data of the first sub-model is a first hidden vector, the input data of the second sub-model is a target area image in the target face image, the output data of the second sub-model is a second hidden vector, the input data of the third sub-model comprises the first hidden vector and the second hidden vector, and the output data of the third sub-model comprises a target image.
Step 403, generating a digital human video based on the generated target image.
As an example, the digital personal video generation method of the present embodiment may be performed as follows:
first, the format of the data is introduced:
in the digital human video generation method, the size of a face sketch is 512×512×1, the size of a target face image is 512×512×3, and the combined size of the face sketch and the target face image is 512×512×4.
The implementation process of the specific scheme is described below with reference to fig. 4B:
After obtaining the user audio (i.e., the target audio), processing the user audio by using an encoder (i.e., the first sub-model) to generate a sound coding vector LM1 (i.e., a middle layer (hidden space) of cnn or lstm), the first hidden vector, then synthesizing the sound coding vector LM with an original picture (i.e., a target area image, i.e., a merged vector) vector LM2 (i.e., a hidden space encoded by a joint encoder) according to a channel synthesis manner to obtain a channel synthesis vector LM3 (including characteristics of a mole (mouth) and FACE IMAGE (face picture)), and then processing (i.e., inputting the channel synthesis vector LM3 into a GAN generation model for decoding) by using a decoder (i.e., a third sub-model) to obtain a digital human false picture (i.e., a target image), and then outputting a digital human video (a video including multiple frames).
In the training phase, this can be performed by the following steps:
Training is divided into two stages:
In the first stage, the sound (i.e. the sample audio) passes through CNN and LSTM, collectively referred to as a model LMEncoder (i.e. the first sub-model), and through full connection (i.e. the third sub-model), 26 inferred keypoints (i.e. the mouth keypoints may include 20 keypoints of the mouth and 6 keypoints of the chin, for example) are obtained, the first function value of the first preset loss function is obtained by the inferred 26 keypoints and the real keypoints (i.e. the mouth keypoints extracted from the sample face image corresponding to the sample audio), and LMEncoder is trained.
In the second stage, after the first function value of the first preset loss function of the 26 key points is stable (for example, the calculated first function value is less than or equal to the first preset threshold value), model parameters of LMEncoder are fixed, and the training of the encoder and decoder LipGAN is started, which specifically includes the following steps:
first, video data including audio (i.e., sample audio) and pictures (i.e., sample face images corresponding to the sample audio) are prepared.
And processing data according to a frame rate of 25 frames per second, extracting features from the audio, extracting face key points and corresponding canny lines from the picture, namely extracting 68 face key points from the video picture (namely sample face images corresponding to sample audio) for each video frame, wherein the method for extracting the features from the audio can use Fourier transformation to extract MFCC/use DEEPSPEECH MODEL to extract the features from the audio/use other algorithms (ASR model-voice recognition).
Then, as shown in fig. 4B, after the voice passes through CNN and LSTM, a voice coding vector LM1 is generated, then 26 mouth key points are generated through the full connection layer (i.e. 26 key points are generated by inference), and then a loss (i.e. a first function value) is calculated by using the 26 inferred key points and the 26 key points of the mouth of the real face, so as to train LMEncoder.
Subsequently, after loss (first function value) stabilizes, LMEncoder parameters (i.e., first submodel) are fixed, i.e., after LMEncoder model is trained, the encoder and decoder LipGAN (i.e., second submodel and third submodel) begin to be trained. Specifically, in the hidden layer, the hidden vector (i.e., the original picture vector LM 2) of the face picture (excluding the part of the face and mouth) and the hidden vector (i.e., the sound coding vector LM 1) of the real human voice are combined to be 1024×1×1 (i.e., the channel synthesis vector LM3, including the characteristics of the moth (mouth) and the face picture), and then output through the decoder to generate a picture (i.e., the target image).
It should be noted that, in the first stage and the second stage, the mouth picture of a frame of picture may be trained by using one frame of audio data or multiple frames of audio data. Specifically, when training a mouth picture (i.e., 26 face key points) by using N frames of audio data, for example, when training the face mouth key points of the t frame picture, audio data corresponding to the t frame, t-1, t-2..th.t- (N-1) frame can be used to train the 26 face mouth key points of the t frame picture, so that the generation effect of the face mouth picture is improved, and the generation effect of the digital human picture is better. N can be larger than 1, and the larger N, the better the generation effect of the mouth. For example, a final target image may be output using the current audio frame and the first 4 frames of the current audio, and a picture of the current frame excluding the mouth portion (i.e., target area image).
In addition, a new loss function of the discriminator (i.e., the fourth sub-model) can be added in LipGAN to ensure the stability of image generation;
The current frame, the previous 4 frames (namely an audio frame sequence corresponding to the audio frame) and the current true picture (namely a target area image) are input into a discriminator, the smaller the loss is, the better, specifically, 26 key points generated by audio reasoning of the current frame and the previous 4 frames and 26 key points of the mouth of the true face of the current frame are adopted and input into the discriminator to calculate the loss, the smaller the loss is, the better the loss is, and therefore the more true the mouth generated by antagonism is, namely the effect is good.
The other five frames (namely, the audio frame sequence corresponding to the non-audio frame) and the current frame picture (namely, the target area image) are input into the discriminator, the larger the loss is, the better, specifically, the 5 frames of audio corresponding to the non-current frame is adopted (namely, 26 key points generated by reasoning of the other 5 frames of audio and 26 key points of the mouth of the real face of the current frame are input into the discriminator to calculate the loss, the larger the loss is, the better the loss is, thereby representing that the more real the mouth generated by the countermeasure generator is, the better the effectiveness is
In the reasoning (application) phase:
firstly, input the audio of the current frame and the previous 4 frames (i.e. the audio frame sequence corresponding to the audio frame)/or extract the audio features, input the model LMEncoder (i.e. the first sub-model), and obtain the hidden vector LM1 (i.e. the first hidden vector).
Then, the current picture nozzle region (i.e., the target region image) is obtained, and the hidden vector IM2 (i.e., the second hidden vector) is obtained through the encoder.
Finally, the hidden vectors LM1 and IM2 are merged to obtain a hidden vector (i.e., a channel synthesis vector LM3 (i.e., a merged vector), which includes characteristics of a motion (mouth) and FACE IMAGE (face picture)), and the hidden vector is input to a decoder (i.e., a third sub-model), and a final picture (i.e., a target image) is output. And then outputs the digital human video.
The acoustic reasoning model can be used for extracting audio characteristics of audio, the input sound format can be wav format, and the frame rate can be 100, 50 or 25. Wherein wav is a lossless audio file format. For acoustic features, it may be MFCC, or features extracted from models such as DEEPSPEECH/ASR/wav2 Vector. The acoustic inference model may be LSMT, BERT (Bidirectional Encoder Representations from Transformers, bi-directional coding representation of the transducer-based model), transfromer (transducer model), CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network ), etc. The 3DMM is a face 3D deformation statistical model, is a basic three-dimensional statistical model, is originally proposed to solve the problem of recovering three-dimensional shapes from two-dimensional face figures, and is used for acquiring 200 pieces of three-dimensional face head information by an author, and acquiring principal component information capable of representing the face shapes and textures by taking the group of data as a basis of PCA (principal component analysis).
In this embodiment, the specific implementation manners of the steps 401 to 403 may refer to the related descriptions of the corresponding embodiments of fig. 2, which are not described herein again. In addition, in addition to the above, the embodiments of the disclosure may further include the same or similar features and effects as those of the embodiment corresponding to fig. 2, which are not described herein.
The digital human video generation method in the embodiment adopts an end-to-end mode to generate the digital human video, inputs the audio, combines the hidden space of the joint encoder to directly generate the target image for generating the digital human video, namely, key points and inverse normalization processing are not required to be acquired, the efficiency is high, further, the audio characteristics can be not extracted, the efficiency is further improved, and the effect is better when the audio characteristics are extracted. Furthermore, the stability of the target image generation can be maintained with the loss function of the new arbiter (i.e., the fourth sub-model).
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a digital human video generating apparatus, which corresponds to the above-described method embodiment, and which may include the same or corresponding features as the above-described method embodiment, and produce the same or corresponding effects as the above-described method embodiment, in addition to the features described below. The device can be applied to various electronic equipment.
As shown in fig. 5, the digital personal video generation apparatus 500 of the present embodiment. The apparatus 500 includes an acquisition unit 501, an input unit 502, and a generation unit 503. The device comprises an acquisition unit 501 configured to acquire target audio and target face images, an input unit 502 configured to input an audio frame sequence corresponding to the audio frame and a target area image in the target face images into a pre-trained end-to-end model for the audio frame in the target audio to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face images, the target image corresponding to the audio frame is used for indicating a person indicated by the target face images to send out the audio indicated by the audio frame, and a generation unit 503 configured to generate digital human video based on the generated target image.
In the present embodiment, the acquisition unit 501 of the digital personal video generation apparatus 500 may acquire the target audio and the target face image.
In this embodiment, the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, and the target area image is an area image except for a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame.
In the present embodiment, the generation unit 503 may generate a digital human video based on the generated target image.
In some optional implementations of this embodiment, the end-to-end model includes a first sub-model, a second sub-model, and a third sub-model, input data of the first sub-model is an audio frame sequence corresponding to an audio frame, output data of the first sub-model is a first hidden vector, input data of the second sub-model is a target region image in the target face image, output data of the second sub-model is a second hidden vector, input data of the third sub-model includes the first hidden vector and the second hidden vector, output data of the third sub-model includes a target image, and
The generating unit is further configured to:
inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector;
inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector;
combining the first hidden vector and the second hidden vector to obtain a combined vector;
and inputting the combined vector into the third sub-model to obtain a target image corresponding to the audio frame.
In some optional implementations of this embodiment, the end-to-end model is trained by:
acquiring video data;
Extracting an audio frame and a face image corresponding to the audio frame from the video data, taking an audio frame sequence corresponding to the extracted audio frame as sample audio, and taking the extracted face image as a sample face image;
and taking the sample audio as input data of a generator in the generated countermeasure network by adopting a machine learning algorithm to obtain a target image which corresponds to the sample audio and is generated by the generator, and taking the current generator as an end-to-end model if a discriminator in the generated countermeasure network determines that the target image generated by the generator meets a preset training ending condition.
In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model includes:
Acquiring an initial generation type countermeasure network, wherein the initial generation type countermeasure network comprises a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point;
the following first training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
Inputting the first hidden vector corresponding to the sample audio to a fourth sub-model to obtain a predicted mouth key point corresponding to the sample audio;
calculating a first function value of a first preset loss function based on a predicted mouth key point corresponding to the sample audio and a mouth key point extracted from a sample face image corresponding to the sample audio;
If the calculated first function value is less than or equal to a first preset threshold, determining the model parameters of the first sub-model included in the current initial generation type countermeasure network as the model parameters of the first sub-model included in the end-to-end model after training.
In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
If the calculated first function value is greater than the first preset threshold, updating the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generation type countermeasure network, and continuously executing the first training step based on the initial generation type countermeasure network after the model parameters are updated.
In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
The following second training step is performed:
inputting sample audio into a first sub-model included in an initial generation type countermeasure network to obtain a first hidden vector corresponding to the sample audio;
inputting a target area image in a sample face image corresponding to the sample audio to a second sub-model included in an initial generation type countering network to obtain a second hidden vector corresponding to the sample audio;
Combining the first hidden vector corresponding to the sample audio with the second hidden vector corresponding to the sample audio to obtain a combined vector corresponding to the sample audio;
The combined vector corresponding to the sample audio is input to a third sub-model included in an initial generation type countermeasure network, and a prediction target image corresponding to the sample audio is obtained;
Calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio;
if the calculated second function value is less than or equal to the preset threshold, determining the model parameters of the second sub-model included in the current initial generation type countermeasure network as the model parameters of the second sub-model included in the trained end-to-end model, and determining the model parameters of the third sub-model included in the current initial generation type countermeasure network as the model parameters of the third sub-model included in the trained end-to-end model.
In some optional implementations of this embodiment, the obtaining, with the sample audio as input data of a generator in a generating countermeasure network, a target image generated by the generator and corresponding to the sample audio, and if a arbiter in the generating countermeasure network determines that the target image generated by the generator meets a preset training end condition, using a current generator as an end-to-end model, further includes:
And if the calculated second function value is greater than the second preset threshold value, updating the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generation type countermeasure network, and continuously executing the second training step based on the initial generation type countermeasure network after updating the model parameters.
In some optional implementations of this embodiment, the preset training ending condition includes at least one of:
the function value of the preset loss function calculated based on the audio frame sequence corresponding to the audio frame is smaller than or equal to a first preset value;
The function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is larger than or equal to a second preset value.
In some optional implementations of this embodiment, the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder.
In some optional implementations of this embodiment, the audio frame sequence corresponding to the audio frame includes the audio frame, and the audio frame that is continuous with the previous preset number of frames of the audio frame in the target audio.
In the apparatus 500 provided in the foregoing embodiment of the present disclosure, the acquiring unit 501 may acquire a target audio and a target face image, then, the input unit 502 may input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except for a mouth area image in the target face image, the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame, and finally, the generating unit 503 may generate a digital human video based on the generated target image. Therefore, the end-to-end model is adopted to directly obtain the target image for generating the digital human video, so that the efficiency of generating the digital human video is improved by improving the speed of generating the target image.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and the electronic device 600 shown in fig. 6 includes at least one processor 601, a memory 602, and at least one network interface 604 and other user interfaces 603. The various components in the electronic device 600 are coupled together by a bus system 605. It is understood that the bus system 605 is used to enable connected communications between these components. The bus system 605 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 605 in fig. 6.
The user interface 603 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).
It is to be appreciated that the memory 602 in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus random access memory (DRRAM). The memory 602 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 602 stores elements, executable units or data structures, or a subset thereof, or an extended set thereof, an operating system 6021 and application programs 6022.
The operating system 6021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. Application 6022 includes various applications such as a media player (MEDIA PLAYER), browser (Browser), etc. for implementing various application services. A program for implementing the method of the embodiment of the present disclosure may be included in the application 6022.
In the embodiment of the disclosure, the processor 601 is configured to execute the method steps provided in the method embodiments by calling a program or an instruction stored in the memory 602, specifically, a program or an instruction stored in the application program 6022, for example, including acquiring a target audio and a target face image, inputting, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model, to generate a target image corresponding to the audio frame, where the audio frame sequence corresponding to the audio frame is a sequence of consecutive audio frames including the audio frame in the target audio, the target area image is an area image except for a mouth area image in the target face image, and the target image corresponding to the audio frame is used to instruct a person indicated by the target face image to send the audio indicated by the audio frame, and generating a digital human video based on the generated target image.
The methods disclosed in the embodiments of the present disclosure may be applied to the processor 601 or implemented by the processor 601. The processor 601 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 601 or instructions in the form of software. The Processor 601 may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software elements in a decoded processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 602, and the processor 601 reads information in the memory 602 and performs the steps of the above method in combination with its hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSPDEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the above-described functions of the application, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The electronic device provided in this embodiment may be an electronic device as shown in fig. 6, and may perform all steps of the digital human video generating method shown in fig. 2, so as to achieve the technical effects of the digital human video generating method shown in fig. 2, and detailed description with reference to fig. 2 is omitted herein for brevity.
The disclosed embodiments also provide a storage medium (computer-readable storage medium). The storage medium here stores one or more programs. The storage medium may include volatile memory, such as random access memory, or nonvolatile memory, such as read only memory, flash memory, hard disk, or solid state disk, or a combination of the foregoing.
When the one or more programs in the storage medium are executable by the one or more processors, the digital human video generation method executed on the electronic device side is implemented.
The processor is used for executing a communication program stored in a memory to realize the steps of acquiring target audio and target face images, inputting an audio frame sequence corresponding to the audio frame and a target area image in the target face image into a pre-trained end-to-end model aiming at the audio frame in the target audio to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image except a mouth area image in the target face image, the target image corresponding to the audio frame is used for indicating a person indicated by the target face image to send out the audio indicated by the audio frame, and generating the digital face video based on the generated target image.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
While the foregoing embodiments have been described in some detail for purposes of clarity of understanding, it will be understood that the foregoing embodiments are merely illustrative of the invention and are not intended to limit the scope of the invention, and that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the invention.

Claims (11)

1.一种数字人视频生成方法,其特征在于,所述方法包括:1. A method for generating a digital human video, characterized in that the method comprises: 获取目标音频和目标人脸图像;其中,所述目标人脸图像是从视频中提取的一帧人脸图像;Acquire target audio and target face image; wherein the target face image is a frame of face image extracted from a video; 针对所述目标音频中的音频帧,将该音频帧对应的音频帧序列和所述目标人脸图像中的目标区域图像输入至预先训练的端到端模型中,生成与该音频帧相对应的目标图像,其中,该音频帧对应的音频帧序列为所述目标音频中包含该音频帧的连续的音频帧的序列,所述目标区域图像为所述目标人脸图像中除嘴部区域图像之外的区域图像,与该音频帧相对应的目标图像用于指示所述目标人脸图像指示的人员发出该音频帧指示的音频;For an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target facial image are input into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image excluding a mouth area image in the target facial image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target facial image emits the audio indicated by the audio frame; 基于所生成的目标图像,生成数字人视频;generating a digital human video based on the generated target image; 其中,所述端到端模型通过如下方式训练得到:The end-to-end model is trained in the following way: 获取视频数据;从所述视频数据中提取音频帧和与音频帧相对应的人脸图像,将所提取的音频帧对应的音频帧序列作为样本音频,将所提取的人脸图像作为样本人脸图像;采用机器学习算法,获取初始生成式对抗网络,其中,初始生成式对抗网络包括第一子模型、第二子模型、第三子模型和第四子模型,第四子模型的输入数据为第一隐藏向量,第四子模型的输出数据为嘴部关键点;执行如下第一训练步骤:将样本音频输入至初始生成式对抗网络包括的第一子模型,得到该样本音频对应的第一隐藏向量;将该样本音频对应的第一隐藏向量输入至第四子模型,得到该样本音频对应的预测嘴部关键点;基于与该样本音频相对应的预测嘴部关键点和从与该样本音频相对应的样本人脸图像中提取的嘴部关键点,计算第一预设损失函数的第一函数值;如果所计算的第一函数值小于或等于第一预设阈值,则将当前的初始生成式对抗网络包括的第一子模型的模型参数确定为训练完成的端到端模型包括的第一子模型的模型参数。Acquire video data; extract audio frames and face images corresponding to the audio frames from the video data, use the audio frame sequence corresponding to the extracted audio frames as sample audio, and use the extracted face images as sample face images; adopt a machine learning algorithm to acquire an initial generative adversarial network, wherein the initial generative adversarial network includes a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point; perform the following first training step: input the sample audio into the first sub-model included in the initial generative adversarial network to obtain the first hidden vector corresponding to the sample audio; input the first hidden vector corresponding to the sample audio into the fourth sub-model to obtain the predicted mouth key point corresponding to the sample audio; calculate the first function value of the first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio; if the calculated first function value is less than or equal to the first preset threshold, determine the model parameters of the first sub-model included in the current initial generative adversarial network as the model parameters of the first sub-model included in the trained end-to-end model. 2.根据权利要求1所述的方法,其特征在于,所述端到端模型包括第一子模型、第二子模型和第三子模型,所述第一子模型的输入数据为音频帧对应的音频帧序列,所述第一子模型的输出数据为第一隐藏向量,所述第二子模型的输入数据为所述目标人脸图像中的目标区域图像,所述第二子模型的输出数据为第二隐藏向量,所述第三子模型的输入数据包括所述第一隐藏向量和所述第二隐藏向量,所述第三子模型的输出数据包括目标图像;以及2. The method according to claim 1, characterized in that the end-to-end model comprises a first sub-model, a second sub-model and a third sub-model, the input data of the first sub-model is an audio frame sequence corresponding to an audio frame, the output data of the first sub-model is a first hidden vector, the input data of the second sub-model is a target area image in the target face image, the output data of the second sub-model is a second hidden vector, the input data of the third sub-model includes the first hidden vector and the second hidden vector, and the output data of the third sub-model includes a target image; and 所述将该音频帧对应的音频帧序列和所述目标人脸图像中的目标区域图像输入至预先训练的端到端模型中,生成与该音频帧相对应的目标图像,包括:The step of inputting the audio frame sequence corresponding to the audio frame and the target area image in the target face image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame includes: 将该音频帧对应的音频帧序列输入至所述第一子模型,得到第一隐藏向量;Inputting an audio frame sequence corresponding to the audio frame into the first sub-model to obtain a first hidden vector; 将所述目标人脸图像中的目标区域图像输入至所述第二子模型,得到第二隐藏向量;Inputting the target area image in the target face image into the second sub-model to obtain a second hidden vector; 将所述第一隐藏向量和所述第二隐藏向量进行合并处理,得到合并后向量;Merging the first hidden vector and the second hidden vector to obtain a merged vector; 将所述合并后向量输入至所述第三子模型,得到与该音频帧相对应的目标图像。The combined vector is input into the third sub-model to obtain a target image corresponding to the audio frame. 3.根据权利要求2所述的方法,其特征在于,所述将样本音频作为生成式对抗网络中的生成器的输入数据,得到与样本音频相对应的、所述生成器生成的目标图像,如果所述生成式对抗网络中的判别器确定所述生成器生成的目标图像满足预设训练结束条件,则将当前的生成器作为端到端模型,还包括:3. The method according to claim 2, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises: 如果所计算的第一函数值大于所述第一预设阈值,则对当前的初始生成式对抗网络包括的第一子模型的模型参数和第四子模型的模型参数进行更新,以及基于模型参数更新后的初始生成式对抗网络继续执行所述第一训练步骤。If the calculated first function value is greater than the first preset threshold, the model parameters of the first sub-model and the model parameters of the fourth sub-model included in the current initial generative adversarial network are updated, and the first training step is continued based on the initial generative adversarial network after the model parameters are updated. 4.根据权利要求2所述的方法,其特征在于,所述将样本音频作为生成式对抗网络中的生成器的输入数据,得到与样本音频相对应的、所述生成器生成的目标图像,如果所述生成式对抗网络中的判别器确定所述生成器生成的目标图像满足预设训练结束条件,则将当前的生成器作为端到端模型,还包括:4. The method according to claim 2, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises: 执行如下第二训练步骤:Perform the second training step as follows: 将样本音频输入至初始生成式对抗网络包括的第一子模型,得到该样本音频对应的第一隐藏向量;Inputting the sample audio into a first sub-model included in the initial generative adversarial network to obtain a first hidden vector corresponding to the sample audio; 将与该样本音频相对应的样本人脸图像中的目标区域图像输入至初始生成式对抗网络包括的第二子模型,得到该样本音频对应的第二隐藏向量;Inputting the target area image in the sample face image corresponding to the sample audio into the second sub-model included in the initial generative adversarial network to obtain a second hidden vector corresponding to the sample audio; 将该样本音频对应的第一隐藏向量和该样本音频对应的第二隐藏向量进行合并处理,得到该样本音频对应的合并后向量;Merging a first hidden vector corresponding to the sample audio and a second hidden vector corresponding to the sample audio to obtain a merged vector corresponding to the sample audio; 将该样本音频对应的合并后向量输入至初始生成式对抗网络包括的第三子模型,得到与该样本音频相对应的预测目标图像;Inputting the merged vector corresponding to the sample audio into the third sub-model included in the initial generative adversarial network to obtain a predicted target image corresponding to the sample audio; 基于与该样本音频相对应的预测目标图像和从与该样本音频相对应的样本人脸图像中提取的目标图像,计算第二预设损失函数的第二函数值;Calculating a second function value of a second preset loss function based on a predicted target image corresponding to the sample audio and a target image extracted from a sample face image corresponding to the sample audio; 如果所计算的第二函数值小于或等于第二预设阈值,则将当前的初始生成式对抗网络包括的第二子模型的模型参数确定为训练完成的端到端模型包括的第二子模型的模型参数,以及将当前的初始生成式对抗网络包括的第三子模型的模型参数确定为训练完成的端到端模型包括的第三子模型的模型参数。If the calculated second function value is less than or equal to the second preset threshold, the model parameters of the second sub-model included in the current initial generative adversarial network are determined as the model parameters of the second sub-model included in the trained end-to-end model, and the model parameters of the third sub-model included in the current initial generative adversarial network are determined as the model parameters of the third sub-model included in the trained end-to-end model. 5.根据权利要求4所述的方法,其特征在于,所述将样本音频作为生成式对抗网络中的生成器的输入数据,得到与样本音频相对应的、所述生成器生成的目标图像,如果所述生成式对抗网络中的判别器确定所述生成器生成的目标图像满足预设训练结束条件,则将当前的生成器作为端到端模型,还包括:5. The method according to claim 4, characterized in that the sample audio is used as input data of a generator in a generative adversarial network to obtain a target image generated by the generator corresponding to the sample audio, and if the discriminator in the generative adversarial network determines that the target image generated by the generator meets a preset training end condition, the current generator is used as an end-to-end model, and further comprises: 如果所计算的第二函数值大于所述第二预设阈值,则对当前的初始生成式对抗网络包括的第二子模型的模型参数和第三子模型的模型参数进行更新,以及基于模型参数更新后的初始生成式对抗网络继续执行所述第二训练步骤。If the calculated second function value is greater than the second preset threshold, the model parameters of the second sub-model and the model parameters of the third sub-model included in the current initial generative adversarial network are updated, and the second training step is continued based on the initial generative adversarial network after the model parameters are updated. 6.根据权利要求3-5之一所述的方法,其特征在于,所述预设训练结束条件包括以下至少一项:6. The method according to any one of claims 3 to 5, wherein the preset training end condition includes at least one of the following: 基于音频帧对应的音频帧序列计算得到的预设损失函数的函数值小于或等于第一预设数值;A function value of a preset loss function calculated based on an audio frame sequence corresponding to the audio frame is less than or equal to a first preset value; 基于非音频帧对应的音频帧序列计算得到的预设损失函数的函数值大于或等于第二预设数值。A function value of the preset loss function calculated based on the audio frame sequence corresponding to the non-audio frame is greater than or equal to a second preset value. 7.根据权利要求2-5之一所述的方法,其特征在于,所述第二子模型为编码器,所述第三子模型为所述编码器对应的解码器。7. The method according to any one of claims 2-5 is characterized in that the second sub-model is an encoder, and the third sub-model is a decoder corresponding to the encoder. 8.根据权利要求1-5之一所述的方法,其特征在于,该音频帧对应的音频帧序列包括该音频帧,以及所述目标音频中该音频帧的前预设数量帧连续的音频帧。8. The method according to any one of claims 1-5, characterized in that the audio frame sequence corresponding to the audio frame includes the audio frame and a preset number of consecutive audio frames before the audio frame in the target audio. 9.一种数字人视频生成装置,其特征在于,所述装置包括:9. A digital human video generation device, characterized in that the device comprises: 获取单元,被配置成获取目标音频和目标人脸图像;其中,所述目标人脸图像是从视频中提取的一帧人脸图像;An acquisition unit is configured to acquire a target audio and a target face image; wherein the target face image is a frame of face image extracted from a video; 输入单元,被配置成针对所述目标音频中的音频帧,将该音频帧对应的音频帧序列和所述目标人脸图像中的目标区域图像输入至预先训练的端到端模型中,生成与该音频帧相对应的目标图像,其中,该音频帧对应的音频帧序列为所述目标音频中包含该音频帧的连续的音频帧的序列,所述目标区域图像为所述目标人脸图像中除嘴部区域图像之外的区域图像,与该音频帧相对应的目标图像用于指示所述目标人脸图像指示的人员发出该音频帧指示的音频;An input unit is configured to input, for an audio frame in the target audio, an audio frame sequence corresponding to the audio frame and a target area image in the target facial image into a pre-trained end-to-end model to generate a target image corresponding to the audio frame, wherein the audio frame sequence corresponding to the audio frame is a sequence of continuous audio frames containing the audio frame in the target audio, the target area image is an area image excluding a mouth area image in the target facial image, and the target image corresponding to the audio frame is used to indicate that the person indicated by the target facial image emits the audio indicated by the audio frame; 生成单元,被配置成基于所生成的目标图像,生成数字人视频;A generating unit configured to generate a digital human video based on the generated target image; 其中,所述端到端模型通过如下方式训练得到:The end-to-end model is trained in the following way: 获取视频数据;从所述视频数据中提取音频帧和与音频帧相对应的人脸图像,将所提取的音频帧对应的音频帧序列作为样本音频,将所提取的人脸图像作为样本人脸图像;采用机器学习算法,获取初始生成式对抗网络,其中,初始生成式对抗网络包括第一子模型、第二子模型、第三子模型和第四子模型,第四子模型的输入数据为第一隐藏向量,第四子模型的输出数据为嘴部关键点;执行如下第一训练步骤:将样本音频输入至初始生成式对抗网络包括的第一子模型,得到该样本音频对应的第一隐藏向量;将该样本音频对应的第一隐藏向量输入至第四子模型,得到该样本音频对应的预测嘴部关键点;基于与该样本音频相对应的预测嘴部关键点和从与该样本音频相对应的样本人脸图像中提取的嘴部关键点,计算第一预设损失函数的第一函数值;如果所计算的第一函数值小于或等于第一预设阈值,则将当前的初始生成式对抗网络包括的第一子模型的模型参数确定为训练完成的端到端模型包括的第一子模型的模型参数。Acquire video data; extract audio frames and face images corresponding to the audio frames from the video data, use the audio frame sequence corresponding to the extracted audio frames as sample audio, and use the extracted face images as sample face images; adopt a machine learning algorithm to acquire an initial generative adversarial network, wherein the initial generative adversarial network includes a first sub-model, a second sub-model, a third sub-model and a fourth sub-model, the input data of the fourth sub-model is a first hidden vector, and the output data of the fourth sub-model is a mouth key point; perform the following first training step: input the sample audio into the first sub-model included in the initial generative adversarial network to obtain the first hidden vector corresponding to the sample audio; input the first hidden vector corresponding to the sample audio into the fourth sub-model to obtain the predicted mouth key point corresponding to the sample audio; calculate the first function value of the first preset loss function based on the predicted mouth key point corresponding to the sample audio and the mouth key point extracted from the sample face image corresponding to the sample audio; if the calculated first function value is less than or equal to the first preset threshold, determine the model parameters of the first sub-model included in the current initial generative adversarial network as the model parameters of the first sub-model included in the trained end-to-end model. 10.一种电子设备,其特征在于,包括:10. An electronic device, comprising: 存储器,用于存储计算机程序;Memory for storing computer programs; 处理器,用于执行所述存储器中存储的计算机程序,且所述计算机程序被执行时,实现上述权利要求1-8任一所述的方法。A processor, configured to execute a computer program stored in the memory, and when the computer program is executed, implement the method described in any one of claims 1 to 8. 11.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时,实现上述权利要求1-8任一所述的方法。11. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.
CN202111169280.5A 2021-09-30 2021-09-30 Digital human video generation method, device, electronic device and storage medium Active CN113987269B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111169280.5A CN113987269B (en) 2021-09-30 2021-09-30 Digital human video generation method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111169280.5A CN113987269B (en) 2021-09-30 2021-09-30 Digital human video generation method, device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN113987269A CN113987269A (en) 2022-01-28
CN113987269B true CN113987269B (en) 2025-02-14

Family

ID=79737725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111169280.5A Active CN113987269B (en) 2021-09-30 2021-09-30 Digital human video generation method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113987269B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278297B (en) * 2022-06-14 2023-11-28 北京达佳互联信息技术有限公司 Data processing method, device, equipment and storage medium based on drive video
CN115619897A (en) * 2022-10-14 2023-01-17 北京字跳网络技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN115937375B (en) * 2023-01-05 2023-09-29 深圳市木愚科技有限公司 Digital split synthesis method, device, computer equipment and storage medium
CN118870042A (en) * 2023-04-28 2024-10-29 南京硅基智能科技有限公司 A method and device for generating live video, electronic equipment and storage medium
CN117372553B (en) * 2023-08-25 2024-05-10 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal
CN117593473B (en) * 2024-01-17 2024-06-18 淘宝(中国)软件有限公司 Method, apparatus and storage medium for generating motion image and video

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797897A (en) * 2020-06-03 2020-10-20 浙江大学 A deep learning-based audio-generated face image method
CN112330781A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model and generating human face animation
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7472063B2 (en) * 2002-12-19 2008-12-30 Intel Corporation Audio-visual feature fusion and support vector machine useful for continuous speech recognition
CN111429885B (en) * 2020-03-02 2022-05-13 北京理工大学 A method for mapping audio clips to face and mouth keypoints
CN111432233B (en) * 2020-03-20 2022-07-19 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797897A (en) * 2020-06-03 2020-10-20 浙江大学 A deep learning-based audio-generated face image method
CN112330781A (en) * 2020-11-24 2021-02-05 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model and generating human face animation
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics

Also Published As

Publication number Publication date
CN113987269A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN113987269B (en) Digital human video generation method, device, electronic device and storage medium
WO2021103698A1 (en) Face swapping method, device, electronic apparatus, and storage medium
CN112333179B (en) Live broadcast method, device and equipment of virtual video and readable storage medium
US10776977B2 (en) Real-time lip synchronization animation
CN109068174B (en) Video frame rate up-conversion method and system based on cyclic convolution neural network
CN113886643A (en) Digital human video generation method and device, electronic equipment and storage medium
KR20210124312A (en) Interactive object driving method, apparatus, device and recording medium
WO2023173890A1 (en) Real-time voice recognition method, model training method, apparatus, device, and storage medium
JP2022172173A (en) Image editing model training method and device, image editing method and device, electronic apparatus, storage medium and computer program
CN111401101A (en) Video generation system based on portrait
CN113282791B (en) Video generation method and device
CN112733616A (en) Dynamic image generation method and device, electronic equipment and storage medium
CN117115317B (en) Avatar driving and model training method, apparatus, device and storage medium
CN117636481B (en) A multimodal joint gesture generation method based on diffusion model
WO2022062800A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN113886644A (en) Digital human video generation method and device, electronic equipment and storage medium
CN113763232B (en) Image processing method, device, equipment and computer readable storage medium
CN113689527B (en) Training method of face conversion model and face image conversion method
CN115376482B (en) Face action video generation method and device, readable medium and electronic equipment
CN118644596B (en) Face key point moving image generation method and related equipment
CN114567693A (en) Video generation method and device and electronic equipment
WO2024066549A1 (en) Data processing method and related device
CN114820891A (en) Lip shape generating method, device, equipment and medium
KR102514580B1 (en) Video transition method, apparatus and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant