[go: up one dir, main page]

CN112562720B - Lip-sync video generation method, device, equipment and storage medium - Google Patents

Lip-sync video generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN112562720B
CN112562720B CN202011372011.4A CN202011372011A CN112562720B CN 112562720 B CN112562720 B CN 112562720B CN 202011372011 A CN202011372011 A CN 202011372011A CN 112562720 B CN112562720 B CN 112562720B
Authority
CN
China
Prior art keywords
image
data
lip
network
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011372011.4A
Other languages
Chinese (zh)
Other versions
CN112562720A (en
Inventor
李�权
王伦基
叶俊杰
成秋喜
胡玉针
李嘉雄
朱杰
刘华清
韩蓝青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Original Assignee
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CYAGEN BIOSCIENCES (GUANGZHOU) Inc, Research Institute Of Tsinghua Pearl River Delta filed Critical CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Priority to CN202011372011.4A priority Critical patent/CN112562720B/en
Publication of CN112562720A publication Critical patent/CN112562720A/en
Application granted granted Critical
Publication of CN112562720B publication Critical patent/CN112562720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L21/14Transforming into visible information by displaying frequency domain information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a lip-sync video generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: after the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks, and can be widely applied to the technical field of video data processing.

Description

Lip-sync video generation method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of video data processing, in particular to a lip-sync video generation method, a lip-sync video generation device, lip-sync video generation equipment and a storage medium.
Background
With the increasing diversity of video content, new demands are put on the creation of video content, and it is also a critical problem that needs to be solved urgently to allow these videos to be viewed in different languages. Such as a series of lectures, or a large news lecture, a very nice movie, and even a very interesting animation. This allows viewers in more different language environments to better view and access the video if they are translated into the desired target language. The key problem to be solved in translating a face video of a utterance or creating a new video in this way is to correct the mouth shape and match it with the target speech.
Some current techniques require no complex changes in the motion and background of a particular character's still or video character as seen in training to achieve character lip generation. However, in complex dynamic background, unrestricted speaker face video, the lip motion of any identity cannot be accurately changed, resulting in the person's lip portion of the video not being synchronized with the new audio.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method, apparatus, device and storage medium for generating video with high accuracy and lip sync.
One aspect of the present invention provides a lip-sync video generation method, including:
Acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
performing character labeling on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each piece of voice data in a video image;
Performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;
constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, the method further comprises preprocessing the voice data and the image data in the original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
In some embodiments, the generation network comprises a vocoder, an image encoder, an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
The image decoding generator is used for generating a lip-shaped image of the person according to the sound characteristics and the image characteristics.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
In some embodiments, the input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a preprocessing module is further included;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
And
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
Another aspect of the invention also provides an electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
After the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an overall step diagram of a lip-sync video generating method according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Aiming at the problems existing in the prior art, the invention researches the problems of character lip shape generation and voice matching, and the face lip shape of any speaker can be matched with any target voice, including real voice and synthetic voice. And real world video contains rapidly changing pose, scale and illumination changes, the generated face results must also be seamlessly fused into the original target video.
The invention firstly adopts an end-to-end model to encode sound and video images, and then generates lip-shaped images matched with the sound through decoding. Meanwhile, the invention adopts a powerful lip synchronous discriminator, can accurately judge the synchronous accuracy and vivid lip movement of the generated lip and the voice, and is used for guiding the generation of more synchronous lips; the invention adopts a high-quality image quality discriminator, can accurately judge the true or false and the quality of the image, and is used for guiding the generation of more lifelike lip-shaped images. The present invention makes extensive quantitative and subjective human evaluations and is greatly superior to current methods on many bases.
The embodiment of the invention provides a lip-sync video generation method, as shown in fig. 1, comprising the following steps:
S1, acquiring original video data, wherein the original video data comprise voice data and image data of people in different scenes;
The voice data in the video of the embodiment of the invention is multi-person multi-language mixed voice data, the image data in the video is speaking face data of various scenes, proportions and illumination, and meanwhile, the resolution of the video is as high as 1080 p.
S2, performing character labeling on voice data in the original video data to obtain first data, wherein the first data are used for determining the position of a face corresponding to each piece of voice data in a video image;
Specifically, the embodiment of the invention divides the video into a plurality of small segments matched with the speaker video through labeling and stores the small segments. And carrying out matching labeling on the voice and the speaker of the acquired data, labeling the position of the face of the speaker corresponding to each section of voice in the video image, and simultaneously ensuring the synchronization of the voice and the video duration.
S3, carrying out face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
Specifically, the embodiment of the invention carries out face detection on each frame of the marked video segment, obtains the position of the face in each frame through face detection, and extends the obtained face position information to the chin direction for 5-50 pixels, thereby ensuring that the face detection frame can cover the whole face. Then, each frame of face image is intercepted and stored through the optimized face detection frame, and meanwhile, the voice data of the video clip is also stored.
S4, training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
s5, constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
It should be noted that, the embodiment of the present invention constructs a high-definition character lip generating model based on a condition GAN (generating countermeasure network), the overall model structure is divided into two parts, namely a high-definition character image generating network and a discriminating network, the generating network is mainly used for generating a high-definition character lip image, the input data is a preprocessed condition Mask, a reference frame and audio, and the output is a high-definition character lip image frame synchronous with the audio. The judging network is used in model training, and has the functions of judging whether the generated character image is truly synchronous with the lip shape and the audio frequency, and after calculating the difference value of the generated image and the real image and the synchronous value of the generated lip shape and the real lip shape, feeding back loss to the generating network, and optimizing the quality of the generated image and the synchronous quality of the lip shape of the generating network.
S6, processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, before the training step of step S4, the method further includes: preprocessing voice data and image data in original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
The embodiment of the invention respectively preprocesses the sound and the image before inputting the sound and the image into the conditional GAN network model. Sound preprocessing is the normalization of audio data followed by the conversion of audio waveform data into sound spectrograms including, but not limited to, mel-frequency spectrum, linear spectrum, and the like. The image data preprocessing is to put the lower half part of each frame image containing lip shape in the frames of the video sequence to be generated into 0, so that the generating network generates the full lip shape image, and meanwhile, the same number of reference frames as the generated video sequence are selected for encoding character characteristic information, thereby providing better generating effect. Meanwhile, in order to ensure the association of the front frames and the rear frames of the generated video, the invention sets different video series frame inputs during training, and the generating network learns the association relation of the front frames and the rear frames of the video during training, so that the generated video is smoother and more natural, and the number of frames of the generated video sequence can be selected to be 1, 3, 5, 7, 9 and the like according to the generation requirements of different video scenes and characters.
In some embodiments, the generation network comprises a vocoder, an image encoder, an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
The image decoding generator is used for generating a lip-shaped image of the person according to the sound characteristics and the image characteristics.
Specifically, the generation network of the embodiment of the present invention may be classified into a vocoder, an image encoder, and an image decoding generator. Firstly, inputting the preprocessed sound spectrogram into a sound encoder, and extracting sound characteristics through convolution coding. The preprocessed image sequence data is also input into an image encoder, image features are extracted by convolutional encoding, input image resolutions including, but not limited to, 96x96, 128x128, 256x256, 512x512, etc. The extracted sound and image features are then input to an image decoding generator, which ultimately generates a lip-shaped image of the character synchronized with the sound, which may include, but is not limited to, 96x96, 128x128, 256x256, 512x512, etc., according to different generation requirements.
Specifically, the discrimination network can be divided into a lip synchronization discrimination network and an image quality discrimination network, and the role of the discrimination network is to detect the image quality and lip synchronization generated by the generation network in training, and give out an image quality discrimination value and a lip synchronization discrimination value to guide the generation network to generate a higher-definition real image and a more real synchronous lip. The lip synchronous judging network is a pre-training network, the audio frequency of the current frame and the corresponding generated image frame are input, the synchronous matching degree of each generated lip image and the corresponding audio frequency is output, and the judgment device judges and gives a feedback value so as to guide the generation of the lip image which is more synchronous with the sound and is optimized and improved during the network training. The image quality judging network trains with the generating network simultaneously, inputs the generated image and the real image, outputs the probability value of the image reality, is used for judging whether the generated image quality is good or not, and guides the generating network to generate more realistic images in the training process.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
s w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network for discriminating the real image and the generated image
Specifically, the overall loss function loss in the formula is obtained by respectively carrying out weighted summation on loss of the image L1, loss of lip-shaped video and audio synchronization and loss of image quality. Sw and Sg are the weight coefficients of the lip synchronous discriminator and the image quality discriminator affecting the whole loss respectively, and the weight of the discriminator affecting the whole image generation can be adjusted according to the requirement. In the GAN loss, the network D is judged to continuously maximize the objective function through iteration, the network G is generated to continuously minimize the loss of the image L1, the loss of lip-shaped video and audio synchronization and the loss of image quality through iteration, and further the lip-shaped image with clearer details is ensured to be generated.
In some embodiments, the input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
Specifically, in order to generate a realistic lip image of a person, the input data is a sequence of pictures with label constraints, and the constraints may be variable-size edge pixel contours, face lip keypoint contour constraints, head contours, and a background. By including the limiting conditions in the picture, finer content control can be performed on the generated content, and a more controllable high-definition image can be generated. And new input limiting conditions can be added according to new requirements generated in subsequent use, so that the generated contents are expanded according to the requirements to be richer.
In summary, the invention can generate the high-definition character video which can be matched with the sound only by inputting the sound and the video to be translated, and can be used as a general high-definition video translation generation framework. In particular, the present invention trains an accurate lip sync arbiter that can be used to guide the generating network to generate accurate, natural lip movements. Face high-definition images which are different in image and matched with sound are generated for different application fields (public news, speech education, movie and television drama and the like). The invention is completely generated in an intelligent mode from nothing to nothing, and each video is not required to be recorded by a real person, so that the invention has faster manufacturing efficiency and richer expansion mode.
Compared with the prior art, the invention provides a novel lip generation and synchronization model of a video character, which can generate a face synchronous lip video of any speaker by using any voice, and is more accurate and better in generalization than lips generated by other works at present.
The invention also provides a new lip synchronous judging model so as to accurately judge the lip synchronous steps in various complex environment videos.
The model of the invention is independent of specific data training, is a speaker-independent generation model, and can generate lip shapes matched with voices even though lip shape data of people does not appear in training.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generation network is used for generating a character lip image, and the judging network is used for judging the synchronicity of the character lip and the character audio;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a preprocessing module is further included;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
And
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
Another aspect of the invention also provides an electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (7)

1. A lip-synchronized video generation method, comprising:
Acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
performing character labeling on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each piece of voice data in a video image;
Performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
Training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data;
Wherein the generating network comprises a sound encoder, an image encoder and an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
the image decoding generator is used for generating a lip-shaped image of a person according to the sound characteristics and the image characteristics;
The target loss function of the character lip generating model is as follows:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
2. The lip sync video generating method according to claim 1, further comprising preprocessing voice data and image data in the original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
3. A lip sync video generation method according to claim 1, wherein said input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
4. A lip-synchronized video generating apparatus, comprising:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
the generation module is used for processing the input sequence pictures through the character lip generation model to generate lip-synchronized image data;
Wherein the generating network comprises a sound encoder, an image encoder and an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
the image decoding generator is used for generating a lip-shaped image of a person according to the sound characteristics and the image characteristics;
The target loss function of the character lip generating model is as follows:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
5. The lip sync video generating apparatus as defined in claim 4, further comprising a preprocessing module;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum; and
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
6. An electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executing the program to implement the method of any one of claims 1-3.
7. A computer readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method of any one of claims 1-3.
CN202011372011.4A 2020-11-30 2020-11-30 Lip-sync video generation method, device, equipment and storage medium Active CN112562720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011372011.4A CN112562720B (en) 2020-11-30 2020-11-30 Lip-sync video generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011372011.4A CN112562720B (en) 2020-11-30 2020-11-30 Lip-sync video generation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112562720A CN112562720A (en) 2021-03-26
CN112562720B true CN112562720B (en) 2024-07-12

Family

ID=75045329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011372011.4A Active CN112562720B (en) 2020-11-30 2020-11-30 Lip-sync video generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112562720B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114338959A (en) * 2021-04-15 2022-04-12 西安汉易汉网络科技股份有限公司 End-to-end text-to-video video synthesis method, system medium and application
CN113179449B (en) * 2021-04-22 2022-04-12 清华珠三角研究院 Method, system, device and storage medium for driving image by voice and motion
CN113192161B (en) * 2021-04-22 2022-10-18 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium
CN113194348B (en) * 2021-04-22 2022-07-22 清华珠三角研究院 Virtual human lecture video generation method, system, device and storage medium
CN113362471A (en) * 2021-05-27 2021-09-07 深圳市木愚科技有限公司 Virtual teacher limb action generation method and system based on teaching semantics
CN113542624A (en) * 2021-05-28 2021-10-22 阿里巴巴新加坡控股有限公司 Method and device for generating commodity object explanation video
CN113380269B (en) * 2021-06-08 2023-01-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113242361B (en) * 2021-07-13 2021-09-24 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113628635B (en) * 2021-07-19 2023-09-15 武汉理工大学 Voice-driven speaker face video generation method based on teacher student network
WO2023035969A1 (en) * 2021-09-09 2023-03-16 马上消费金融股份有限公司 Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN113987269B (en) * 2021-09-30 2025-02-14 深圳追一科技有限公司 Digital human video generation method, device, electronic device and storage medium
CN113891079A (en) * 2021-11-11 2022-01-04 深圳市木愚科技有限公司 Automatic teaching video generation method, device, computer equipment and storage medium
CN114071204B (en) * 2021-11-16 2024-05-03 湖南快乐阳光互动娱乐传媒有限公司 Data processing method and device
CN114220172B (en) * 2021-12-16 2025-04-25 云知声智能科技股份有限公司 A method, device, electronic device and storage medium for lip movement recognition
CN114419702B (en) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 Digital person generation model, training method of model, and digital person generation method
CN114550720A (en) * 2022-03-03 2022-05-27 深圳地平线机器人科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN114663962B (en) * 2022-05-19 2022-09-16 浙江大学 Lip-shaped synchronous face counterfeiting generation method and system based on image completion
CN114998489A (en) * 2022-05-26 2022-09-02 中国平安人寿保险股份有限公司 Virtual character video generation method, device, computer equipment and storage medium
CN115345968B (en) * 2022-10-19 2023-02-07 北京百度网讯科技有限公司 Virtual object driving method, deep learning network training method and device
CN115376211B (en) * 2022-10-25 2023-03-24 北京百度网讯科技有限公司 Lip driving method, lip driving model training method, device and equipment
CN115580743A (en) * 2022-12-08 2023-01-06 成都索贝数码科技股份有限公司 Method and system for driving human mouth shape in video
CN116248974A (en) * 2022-12-29 2023-06-09 南京硅基智能科技有限公司 A method and system for video language conversion
CN116433807B (en) * 2023-04-21 2024-08-23 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116188637B (en) * 2023-04-23 2023-08-15 世优(北京)科技有限公司 Data synchronization method and device
CN116741198B (en) * 2023-08-15 2023-10-20 合肥工业大学 A lip synchronization method based on multi-scale dictionary
CN117150089B (en) * 2023-10-26 2023-12-22 环球数科集团有限公司 A character art image changing system based on AIGC technology
CN119028369B (en) * 2024-07-30 2025-06-17 浙江大学金华研究院 Face video generation method based on audio-driven face dialogue generation model
CN119211659B (en) * 2024-11-26 2025-06-20 杭州秋果计划科技有限公司 Stylized digital human video generation method, electronic device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108347578B (en) * 2017-01-23 2020-05-08 腾讯科技(深圳)有限公司 Method and device for processing video image in video call
US11003995B2 (en) * 2017-05-19 2021-05-11 Huawei Technologies Co., Ltd. Semi-supervised regression with generative adversarial networks
CN107767325A (en) * 2017-09-12 2018-03-06 深圳市朗形网络科技有限公司 Video processing method and device
CN109819313B (en) * 2019-01-10 2021-01-08 腾讯科技(深圳)有限公司 Video processing method, device and storage medium
US11017506B2 (en) * 2019-05-03 2021-05-25 Amazon Technologies, Inc. Video enhancement using a generator with filters of generative adversarial network
CN110706308B (en) * 2019-09-07 2020-09-25 创新奇智(成都)科技有限公司 GAN-based steel coil end face edge loss artificial sample generation method
CN110610534B (en) * 2019-09-19 2023-04-07 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111261187B (en) * 2020-02-04 2023-02-14 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111325817B (en) * 2020-02-04 2023-07-18 清华珠三角研究院 Virtual character scene video generation method, terminal equipment and medium
CN111783603B (en) * 2020-06-24 2025-05-09 有半岛(北京)信息科技有限公司 Generative adversarial network training method, image face swapping, video face swapping method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement

Also Published As

Publication number Publication date
CN112562720A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112562720B (en) Lip-sync video generation method, device, equipment and storage medium
CN113192161B (en) Virtual human image video generation method, system, device and storage medium
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
CN112562721B (en) Video translation method, system, device and storage medium
US11514634B2 (en) Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses
Cao et al. Expressive speech-driven facial animation
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
Sargin et al. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation
Zhou et al. An image-based visual speech animation system
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
CN113077537A (en) Video generation method, storage medium and equipment
JP2009533786A (en) Self-realistic talking head creation system and method
CN115761075A (en) Face image generation method, device, equipment, medium and product
EP4010899A1 (en) Audio-driven speech animation using recurrent neutral network
Liao et al. Speech2video synthesis with 3d skeleton regularization and expressive body poses
US7388586B2 (en) Method and apparatus for animation of a human speaker
CN116828129B (en) Ultra-clear 2D digital person generation method and system
CN117834935A (en) Digital person live broadcasting method and device, electronic equipment and storage medium
US20250140257A1 (en) Systems and methods for improved lip dubbing
Wang et al. Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild
CN114793300A (en) Virtual video customer service robot synthesis method and system based on generation countermeasure network
Kadam et al. A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation.
Rafiei Oskooei et al. Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages
CN119418713A (en) Speech-driven digital human construction method and device based on enhanced deformable convolution and spatiotemporal motion compensation
WO2024234089A1 (en) Improved generative machine learning architecture for audio track replacement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant