CN112562720B - Lip-sync video generation method, device, equipment and storage medium - Google Patents
Lip-sync video generation method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112562720B CN112562720B CN202011372011.4A CN202011372011A CN112562720B CN 112562720 B CN112562720 B CN 112562720B CN 202011372011 A CN202011372011 A CN 202011372011A CN 112562720 B CN112562720 B CN 112562720B
- Authority
- CN
- China
- Prior art keywords
- image
- data
- lip
- network
- generating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000001360 synchronised effect Effects 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000001514 detection method Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 18
- 238000001228 spectrum Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 7
- 230000000295 complement effect Effects 0.000 claims description 6
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L21/14—Transforming into visible information by displaying frequency domain information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/18—Details of the transformation process
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention discloses a lip-sync video generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: after the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks, and can be widely applied to the technical field of video data processing.
Description
Technical Field
The invention relates to the technical field of video data processing, in particular to a lip-sync video generation method, a lip-sync video generation device, lip-sync video generation equipment and a storage medium.
Background
With the increasing diversity of video content, new demands are put on the creation of video content, and it is also a critical problem that needs to be solved urgently to allow these videos to be viewed in different languages. Such as a series of lectures, or a large news lecture, a very nice movie, and even a very interesting animation. This allows viewers in more different language environments to better view and access the video if they are translated into the desired target language. The key problem to be solved in translating a face video of a utterance or creating a new video in this way is to correct the mouth shape and match it with the target speech.
Some current techniques require no complex changes in the motion and background of a particular character's still or video character as seen in training to achieve character lip generation. However, in complex dynamic background, unrestricted speaker face video, the lip motion of any identity cannot be accurately changed, resulting in the person's lip portion of the video not being synchronized with the new audio.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method, apparatus, device and storage medium for generating video with high accuracy and lip sync.
One aspect of the present invention provides a lip-sync video generation method, including:
Acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
performing character labeling on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each piece of voice data in a video image;
Performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;
constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, the method further comprises preprocessing the voice data and the image data in the original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
In some embodiments, the generation network comprises a vocoder, an image encoder, an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
The image decoding generator is used for generating a lip-shaped image of the person according to the sound characteristics and the image characteristics.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
In some embodiments, the input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the lip synchronous judging network is used for judging the synchronicity of the lip of the person and the audio of the person, and the image quality judging network is used for judging the true and false and the quality of the generated image;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a preprocessing module is further included;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
And
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
Another aspect of the invention also provides an electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
After the original video data are obtained, character labeling is carried out on voice data in the original video data to obtain first data, face detection is carried out on the labeled original video data to obtain second data, then a generating network, a lip synchronous judging network and an image quality judging network are obtained through training according to the first data and the second data, a character lip generating model is built according to the generating network, the lip synchronous judging network and the image quality judging network, and finally input sequence pictures are processed through the character lip generating model to generate lip synchronous image data. The invention can accurately generate the lip-shaped image when the person in the video speaks.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an overall step diagram of a lip-sync video generating method according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
Aiming at the problems existing in the prior art, the invention researches the problems of character lip shape generation and voice matching, and the face lip shape of any speaker can be matched with any target voice, including real voice and synthetic voice. And real world video contains rapidly changing pose, scale and illumination changes, the generated face results must also be seamlessly fused into the original target video.
The invention firstly adopts an end-to-end model to encode sound and video images, and then generates lip-shaped images matched with the sound through decoding. Meanwhile, the invention adopts a powerful lip synchronous discriminator, can accurately judge the synchronous accuracy and vivid lip movement of the generated lip and the voice, and is used for guiding the generation of more synchronous lips; the invention adopts a high-quality image quality discriminator, can accurately judge the true or false and the quality of the image, and is used for guiding the generation of more lifelike lip-shaped images. The present invention makes extensive quantitative and subjective human evaluations and is greatly superior to current methods on many bases.
The embodiment of the invention provides a lip-sync video generation method, as shown in fig. 1, comprising the following steps:
S1, acquiring original video data, wherein the original video data comprise voice data and image data of people in different scenes;
The voice data in the video of the embodiment of the invention is multi-person multi-language mixed voice data, the image data in the video is speaking face data of various scenes, proportions and illumination, and meanwhile, the resolution of the video is as high as 1080 p.
S2, performing character labeling on voice data in the original video data to obtain first data, wherein the first data are used for determining the position of a face corresponding to each piece of voice data in a video image;
Specifically, the embodiment of the invention divides the video into a plurality of small segments matched with the speaker video through labeling and stores the small segments. And carrying out matching labeling on the voice and the speaker of the acquired data, labeling the position of the face of the speaker corresponding to each section of voice in the video image, and simultaneously ensuring the synchronization of the voice and the video duration.
S3, carrying out face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
Specifically, the embodiment of the invention carries out face detection on each frame of the marked video segment, obtains the position of the face in each frame through face detection, and extends the obtained face position information to the chin direction for 5-50 pixels, thereby ensuring that the face detection frame can cover the whole face. Then, each frame of face image is intercepted and stored through the optimized face detection frame, and meanwhile, the voice data of the video clip is also stored.
S4, training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
s5, constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
It should be noted that, the embodiment of the present invention constructs a high-definition character lip generating model based on a condition GAN (generating countermeasure network), the overall model structure is divided into two parts, namely a high-definition character image generating network and a discriminating network, the generating network is mainly used for generating a high-definition character lip image, the input data is a preprocessed condition Mask, a reference frame and audio, and the output is a high-definition character lip image frame synchronous with the audio. The judging network is used in model training, and has the functions of judging whether the generated character image is truly synchronous with the lip shape and the audio frequency, and after calculating the difference value of the generated image and the real image and the synchronous value of the generated lip shape and the real lip shape, feeding back loss to the generating network, and optimizing the quality of the generated image and the synchronous quality of the lip shape of the generating network.
S6, processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, before the training step of step S4, the method further includes: preprocessing voice data and image data in original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
The embodiment of the invention respectively preprocesses the sound and the image before inputting the sound and the image into the conditional GAN network model. Sound preprocessing is the normalization of audio data followed by the conversion of audio waveform data into sound spectrograms including, but not limited to, mel-frequency spectrum, linear spectrum, and the like. The image data preprocessing is to put the lower half part of each frame image containing lip shape in the frames of the video sequence to be generated into 0, so that the generating network generates the full lip shape image, and meanwhile, the same number of reference frames as the generated video sequence are selected for encoding character characteristic information, thereby providing better generating effect. Meanwhile, in order to ensure the association of the front frames and the rear frames of the generated video, the invention sets different video series frame inputs during training, and the generating network learns the association relation of the front frames and the rear frames of the video during training, so that the generated video is smoother and more natural, and the number of frames of the generated video sequence can be selected to be 1, 3, 5, 7, 9 and the like according to the generation requirements of different video scenes and characters.
In some embodiments, the generation network comprises a vocoder, an image encoder, an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
The image decoding generator is used for generating a lip-shaped image of the person according to the sound characteristics and the image characteristics.
Specifically, the generation network of the embodiment of the present invention may be classified into a vocoder, an image encoder, and an image decoding generator. Firstly, inputting the preprocessed sound spectrogram into a sound encoder, and extracting sound characteristics through convolution coding. The preprocessed image sequence data is also input into an image encoder, image features are extracted by convolutional encoding, input image resolutions including, but not limited to, 96x96, 128x128, 256x256, 512x512, etc. The extracted sound and image features are then input to an image decoding generator, which ultimately generates a lip-shaped image of the character synchronized with the sound, which may include, but is not limited to, 96x96, 128x128, 256x256, 512x512, etc., according to different generation requirements.
Specifically, the discrimination network can be divided into a lip synchronization discrimination network and an image quality discrimination network, and the role of the discrimination network is to detect the image quality and lip synchronization generated by the generation network in training, and give out an image quality discrimination value and a lip synchronization discrimination value to guide the generation network to generate a higher-definition real image and a more real synchronous lip. The lip synchronous judging network is a pre-training network, the audio frequency of the current frame and the corresponding generated image frame are input, the synchronous matching degree of each generated lip image and the corresponding audio frequency is output, and the judgment device judges and gives a feedback value so as to guide the generation of the lip image which is more synchronous with the sound and is optimized and improved during the network training. The image quality judging network trains with the generating network simultaneously, inputs the generated image and the real image, outputs the probability value of the image reality, is used for judging whether the generated image quality is good or not, and guides the generating network to generate more realistic images in the training process.
In some embodiments, the objective loss function of the character lip generation model is:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
s w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network for discriminating the real image and the generated image
Specifically, the overall loss function loss in the formula is obtained by respectively carrying out weighted summation on loss of the image L1, loss of lip-shaped video and audio synchronization and loss of image quality. Sw and Sg are the weight coefficients of the lip synchronous discriminator and the image quality discriminator affecting the whole loss respectively, and the weight of the discriminator affecting the whole image generation can be adjusted according to the requirement. In the GAN loss, the network D is judged to continuously maximize the objective function through iteration, the network G is generated to continuously minimize the loss of the image L1, the loss of lip-shaped video and audio synchronization and the loss of image quality through iteration, and further the lip-shaped image with clearer details is ensured to be generated.
In some embodiments, the input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
Specifically, in order to generate a realistic lip image of a person, the input data is a sequence of pictures with label constraints, and the constraints may be variable-size edge pixel contours, face lip keypoint contour constraints, head contours, and a background. By including the limiting conditions in the picture, finer content control can be performed on the generated content, and a more controllable high-definition image can be generated. And new input limiting conditions can be added according to new requirements generated in subsequent use, so that the generated contents are expanded according to the requirements to be richer.
In summary, the invention can generate the high-definition character video which can be matched with the sound only by inputting the sound and the video to be translated, and can be used as a general high-definition video translation generation framework. In particular, the present invention trains an accurate lip sync arbiter that can be used to guide the generating network to generate accurate, natural lip movements. Face high-definition images which are different in image and matched with sound are generated for different application fields (public news, speech education, movie and television drama and the like). The invention is completely generated in an intelligent mode from nothing to nothing, and each video is not required to be recorded by a real person, so that the invention has faster manufacturing efficiency and richer expansion mode.
Compared with the prior art, the invention provides a novel lip generation and synchronization model of a video character, which can generate a face synchronous lip video of any speaker by using any voice, and is more accurate and better in generalization than lips generated by other works at present.
The invention also provides a new lip synchronous judging model so as to accurately judge the lip synchronous steps in various complex environment videos.
The model of the invention is independent of specific data training, is a speaker-independent generation model, and can generate lip shapes matched with voices even though lip shape data of people does not appear in training.
Another aspect of the present invention also provides a lip-synchronized video generating apparatus, including:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generation network is used for generating a character lip image, and the judging network is used for judging the synchronicity of the character lip and the character audio;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
And the generating module is used for processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data.
In some embodiments, a preprocessing module is further included;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
And
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
Another aspect of the invention also provides an electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.
Claims (7)
1. A lip-synchronized video generation method, comprising:
Acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
performing character labeling on voice data in the original video data to obtain first data, wherein the first data is used for determining the position of a face corresponding to each piece of voice data in a video image;
Performing face detection on the marked original video data to obtain second data, wherein the second data is used for determining the position of a face in each frame of image;
Training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
constructing a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
processing the input sequence pictures through the character lip generating model to generate lip-synchronized image data;
Wherein the generating network comprises a sound encoder, an image encoder and an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
the image decoding generator is used for generating a lip-shaped image of a person according to the sound characteristics and the image characteristics;
The target loss function of the character lip generating model is as follows:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
2. The lip sync video generating method according to claim 1, further comprising preprocessing voice data and image data in the original video data;
specifically, the preprocessing of the voice data in the original video data includes:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum;
The preprocessing of the image data in the original video data comprises:
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
3. A lip sync video generation method according to claim 1, wherein said input sequence picture is provided with a tag constraint;
the label constraints include variable size edge pixel contour constraints, face lip keypoint contour constraints, head contour constraints, and background constraints.
4. A lip-synchronized video generating apparatus, comprising:
The acquisition module is used for acquiring original video data, wherein the original video data comprises voice data and image data of people in different scenes;
The voice marking module is used for marking characters on voice data in the original video data to obtain first data, and the first data are used for determining the position of a face corresponding to each section of voice data in a video image;
The face detection module is used for carrying out face detection on the marked original video data to obtain second data, and the second data is used for determining the position of a face in each frame of image;
The training module is used for training to obtain a generating network, a lip synchronous judging network and an image quality judging network according to the first data and the second data; the generating network is used for generating a figure lip image, the lip synchronous judging network is used for judging the synchronicity of the figure lip and the figure audio, and the image quality judging network is used for judging the true and false of the generated image and the quality of the generated image;
the building module is used for building a character lip generating model according to the generating network, the lip synchronous judging network and the image quality judging network;
the generation module is used for processing the input sequence pictures through the character lip generation model to generate lip-synchronized image data;
Wherein the generating network comprises a sound encoder, an image encoder and an image decoding generator;
The voice coder is used for extracting voice features in the first data and the second data from a voice spectrogram obtained by preprocessing through convolutional coding;
The image encoder is used for extracting image features from the sequence frames of the image data obtained through preprocessing through convolutional encoding;
the image decoding generator is used for generating a lip-shaped image of a person according to the sound characteristics and the image characteristics;
The target loss function of the character lip generating model is as follows:
Loss=(1-Sw-Sg)·L1+Sw·Lsync+Sg·Lgen
S w is to synchronously judge the influence of the network on the overall loss value; s g is the influence of the image quality judging network on the overall loss value; loss is a function value of overall Loss of a character lip generation model; l 1 is the mean square error loss value of the real image and the generated image; l sync is used for generating a loss value of the lip-shaped video and audio synchronization rate; l gen is the loss value of the image discrimination network discrimination on the real image and the generated image.
5. The lip sync video generating apparatus as defined in claim 4, further comprising a preprocessing module;
The preprocessing module is used for:
Carrying out normalization processing on the voice data to obtain audio waveform data;
Converting the audio waveform data into a sound spectrogram, wherein the spectrogram comprises, but is not limited to, a mel frequency spectrum and a linear frequency spectrum; and
Setting 0 to the pixel point of the lower half part of the lip shape of each frame image in the sequence frames of the image data so as to enable the generating network to generate a complement lip shape image;
and determining reference frames with the same number as the sequence frames, wherein the reference frames are used for encoding character characteristic information.
6. An electronic device comprising a processor and a memory;
The memory is used for storing programs;
the processor executing the program to implement the method of any one of claims 1-3.
7. A computer readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011372011.4A CN112562720B (en) | 2020-11-30 | 2020-11-30 | Lip-sync video generation method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011372011.4A CN112562720B (en) | 2020-11-30 | 2020-11-30 | Lip-sync video generation method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112562720A CN112562720A (en) | 2021-03-26 |
CN112562720B true CN112562720B (en) | 2024-07-12 |
Family
ID=75045329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011372011.4A Active CN112562720B (en) | 2020-11-30 | 2020-11-30 | Lip-sync video generation method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562720B (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114338959A (en) * | 2021-04-15 | 2022-04-12 | 西安汉易汉网络科技股份有限公司 | End-to-end text-to-video video synthesis method, system medium and application |
CN113179449B (en) * | 2021-04-22 | 2022-04-12 | 清华珠三角研究院 | Method, system, device and storage medium for driving image by voice and motion |
CN113192161B (en) * | 2021-04-22 | 2022-10-18 | 清华珠三角研究院 | Virtual human image video generation method, system, device and storage medium |
CN113194348B (en) * | 2021-04-22 | 2022-07-22 | 清华珠三角研究院 | Virtual human lecture video generation method, system, device and storage medium |
CN113362471A (en) * | 2021-05-27 | 2021-09-07 | 深圳市木愚科技有限公司 | Virtual teacher limb action generation method and system based on teaching semantics |
CN113542624A (en) * | 2021-05-28 | 2021-10-22 | 阿里巴巴新加坡控股有限公司 | Method and device for generating commodity object explanation video |
CN113380269B (en) * | 2021-06-08 | 2023-01-10 | 北京百度网讯科技有限公司 | Video image generation method, apparatus, device, medium, and computer program product |
CN113242361B (en) * | 2021-07-13 | 2021-09-24 | 腾讯科技(深圳)有限公司 | Video processing method and device and computer readable storage medium |
CN113628635B (en) * | 2021-07-19 | 2023-09-15 | 武汉理工大学 | Voice-driven speaker face video generation method based on teacher student network |
WO2023035969A1 (en) * | 2021-09-09 | 2023-03-16 | 马上消费金融股份有限公司 | Speech and image synchronization measurement method and apparatus, and model training method and apparatus |
CN113987269B (en) * | 2021-09-30 | 2025-02-14 | 深圳追一科技有限公司 | Digital human video generation method, device, electronic device and storage medium |
CN113891079A (en) * | 2021-11-11 | 2022-01-04 | 深圳市木愚科技有限公司 | Automatic teaching video generation method, device, computer equipment and storage medium |
CN114071204B (en) * | 2021-11-16 | 2024-05-03 | 湖南快乐阳光互动娱乐传媒有限公司 | Data processing method and device |
CN114220172B (en) * | 2021-12-16 | 2025-04-25 | 云知声智能科技股份有限公司 | A method, device, electronic device and storage medium for lip movement recognition |
CN114419702B (en) * | 2021-12-31 | 2023-12-01 | 南京硅基智能科技有限公司 | Digital person generation model, training method of model, and digital person generation method |
CN114550720A (en) * | 2022-03-03 | 2022-05-27 | 深圳地平线机器人科技有限公司 | Voice interaction method and device, electronic equipment and storage medium |
CN114663962B (en) * | 2022-05-19 | 2022-09-16 | 浙江大学 | Lip-shaped synchronous face counterfeiting generation method and system based on image completion |
CN114998489A (en) * | 2022-05-26 | 2022-09-02 | 中国平安人寿保险股份有限公司 | Virtual character video generation method, device, computer equipment and storage medium |
CN115345968B (en) * | 2022-10-19 | 2023-02-07 | 北京百度网讯科技有限公司 | Virtual object driving method, deep learning network training method and device |
CN115376211B (en) * | 2022-10-25 | 2023-03-24 | 北京百度网讯科技有限公司 | Lip driving method, lip driving model training method, device and equipment |
CN115580743A (en) * | 2022-12-08 | 2023-01-06 | 成都索贝数码科技股份有限公司 | Method and system for driving human mouth shape in video |
CN116248974A (en) * | 2022-12-29 | 2023-06-09 | 南京硅基智能科技有限公司 | A method and system for video language conversion |
CN116433807B (en) * | 2023-04-21 | 2024-08-23 | 北京百度网讯科技有限公司 | Animation synthesis method and device, and training method and device for animation synthesis model |
CN116188637B (en) * | 2023-04-23 | 2023-08-15 | 世优(北京)科技有限公司 | Data synchronization method and device |
CN116741198B (en) * | 2023-08-15 | 2023-10-20 | 合肥工业大学 | A lip synchronization method based on multi-scale dictionary |
CN117150089B (en) * | 2023-10-26 | 2023-12-22 | 环球数科集团有限公司 | A character art image changing system based on AIGC technology |
CN119028369B (en) * | 2024-07-30 | 2025-06-17 | 浙江大学金华研究院 | Face video generation method based on audio-driven face dialogue generation model |
CN119211659B (en) * | 2024-11-26 | 2025-06-20 | 杭州秋果计划科技有限公司 | Stylized digital human video generation method, electronic device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111370020A (en) * | 2020-02-04 | 2020-07-03 | 清华珠三角研究院 | Method, system, device and storage medium for converting voice into lip shape |
CN111783566A (en) * | 2020-06-15 | 2020-10-16 | 神思电子技术股份有限公司 | Video synthesis method based on lip language synchronization and expression adaptation effect enhancement |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108347578B (en) * | 2017-01-23 | 2020-05-08 | 腾讯科技(深圳)有限公司 | Method and device for processing video image in video call |
US11003995B2 (en) * | 2017-05-19 | 2021-05-11 | Huawei Technologies Co., Ltd. | Semi-supervised regression with generative adversarial networks |
CN107767325A (en) * | 2017-09-12 | 2018-03-06 | 深圳市朗形网络科技有限公司 | Video processing method and device |
CN109819313B (en) * | 2019-01-10 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Video processing method, device and storage medium |
US11017506B2 (en) * | 2019-05-03 | 2021-05-25 | Amazon Technologies, Inc. | Video enhancement using a generator with filters of generative adversarial network |
CN110706308B (en) * | 2019-09-07 | 2020-09-25 | 创新奇智(成都)科技有限公司 | GAN-based steel coil end face edge loss artificial sample generation method |
CN110610534B (en) * | 2019-09-19 | 2023-04-07 | 电子科技大学 | Automatic mouth shape animation generation method based on Actor-Critic algorithm |
CN111261187B (en) * | 2020-02-04 | 2023-02-14 | 清华珠三角研究院 | Method, system, device and storage medium for converting voice into lip shape |
CN111325817B (en) * | 2020-02-04 | 2023-07-18 | 清华珠三角研究院 | Virtual character scene video generation method, terminal equipment and medium |
CN111783603B (en) * | 2020-06-24 | 2025-05-09 | 有半岛(北京)信息科技有限公司 | Generative adversarial network training method, image face swapping, video face swapping method and device |
-
2020
- 2020-11-30 CN CN202011372011.4A patent/CN112562720B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111370020A (en) * | 2020-02-04 | 2020-07-03 | 清华珠三角研究院 | Method, system, device and storage medium for converting voice into lip shape |
CN111783566A (en) * | 2020-06-15 | 2020-10-16 | 神思电子技术股份有限公司 | Video synthesis method based on lip language synchronization and expression adaptation effect enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN112562720A (en) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112562720B (en) | Lip-sync video generation method, device, equipment and storage medium | |
CN113192161B (en) | Virtual human image video generation method, system, device and storage medium | |
CN113194348B (en) | Virtual human lecture video generation method, system, device and storage medium | |
CN112562721B (en) | Video translation method, system, device and storage medium | |
US11514634B2 (en) | Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses | |
Cao et al. | Expressive speech-driven facial animation | |
CN116250036A (en) | System and method for synthesizing photo-level realistic video of speech | |
Sargin et al. | Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation | |
Zhou et al. | An image-based visual speech animation system | |
CN117237521A (en) | Speech driving face generation model construction method and target person speaking video generation method | |
CN113077537A (en) | Video generation method, storage medium and equipment | |
JP2009533786A (en) | Self-realistic talking head creation system and method | |
CN115761075A (en) | Face image generation method, device, equipment, medium and product | |
EP4010899A1 (en) | Audio-driven speech animation using recurrent neutral network | |
Liao et al. | Speech2video synthesis with 3d skeleton regularization and expressive body poses | |
US7388586B2 (en) | Method and apparatus for animation of a human speaker | |
CN116828129B (en) | Ultra-clear 2D digital person generation method and system | |
CN117834935A (en) | Digital person live broadcasting method and device, electronic equipment and storage medium | |
US20250140257A1 (en) | Systems and methods for improved lip dubbing | |
Wang et al. | Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild | |
CN114793300A (en) | Virtual video customer service robot synthesis method and system based on generation countermeasure network | |
Kadam et al. | A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation. | |
Rafiei Oskooei et al. | Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages | |
CN119418713A (en) | Speech-driven digital human construction method and device based on enhanced deformable convolution and spatiotemporal motion compensation | |
WO2024234089A1 (en) | Improved generative machine learning architecture for audio track replacement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |