CN114793300A

CN114793300A - Virtual video customer service robot synthesis method and system based on generation countermeasure network

Info

Publication number: CN114793300A
Application number: CN202110097183.3A
Authority: CN
Inventors: 张轩宇; 王逸超; 刘昱麟; 朱鹏飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2022-07-26

Abstract

The invention relates to the technical field of human face video synthesis, and discloses a virtual video customer service robot synthesis method and system based on a generation countermeasure network. The method and the system for synthesizing the virtual video customer service robot based on the generated countermeasure network have the innovativeness that two schemes for synthesizing the virtual video customer service robot are provided, so that a user can select the virtual video customer service robot independently according to the requirements; the synthesis scheme can enable a user to realize the synthesis of various languages, the arbitrary selection of customer service images and the application of various scenes, and the emotion of a speaker is integrated into the video synthesis process, so that the voice synthesis method has good reality; a set of system based on a Web end is integrated, and a user is supported to directly log in a website, upload audio and video materials, synthesize the audio and video materials on line and produce the audio and video materials in batches quickly.

Description

A synthetic method and system of virtual video customer service robot based on generative adversarial network

技术领域technical field

本发明涉及人脸视频合成技术领域，具体为一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统。The invention relates to the technical field of face video synthesis, in particular to a virtual video customer service robot synthesis method and system based on a generative confrontation network.

背景技术Background technique

人脸视频合成是计算机视觉中一个新兴的、具有挑战性的问题，基于该技术的虚拟视频机器人正在获得越来越多的关注。虚拟视频客服机器人包括唇形生成、表情生成、语音合成等模块，被期待能真实模仿出人说话时的唇动、声音和面部表情。Face video synthesis is an emerging and challenging problem in computer vision, and virtual video robots based on this technology are gaining more and more attention. The virtual video customer service robot includes modules such as lip shape generation, expression generation, and speech synthesis. It is expected to truly imitate the lip movements, voices and facial expressions of people when they speak.

受深度学习在计算机视觉领域成功应用的启发，基于深度学习的人脸视频合成取得了优异的性能和良好的视觉效果。目前，人脸视频合成领域提出了一些具有重要意义的基准数据集，如GRID [1] , TIMIT [2] 和LRW [3]等。这些数据集提供了大量的音频视频数据对，大力推动了人脸视频合成领域的发展。基于上述数据集，涌现出大量的优秀算法，如ObamaNet [4]、LipGAN [5]、ExprGAN [6]、Wav2Lip [7]等。以LipGAN为例，其通过生成对抗网络中生成器的编码解码结构提取音频视频特征，并用鉴别器将生成的视频与真实的视频进行比较，实现了端到端的训练，在静态的图像和动态的视频上均取得了较好的表现。这些算法对于推动人脸视频合成的发展起到了重要作用。近年来，基于人脸视频合成技术，百度、搜狐、科大讯飞等公司都设计出相应的虚拟视频机器人用来完成新闻播报、客服答疑等简单工作，促进了强人工智能的落地与发展。Inspired by the successful application of deep learning in the field of computer vision, deep learning-based face video synthesis has achieved excellent performance and good visual effects. At present, some important benchmark datasets have been proposed in the field of face video synthesis, such as GRID [1], TIMIT [2] and LRW [3]. These datasets provide a large number of audio-video data pairs, which greatly promote the development of the field of face-to-video synthesis. Based on the above datasets, a large number of excellent algorithms have emerged, such as ObamaNet [4], LipGAN [5], ExprGAN [6], Wav2Lip [7], etc. Taking LipGAN as an example, it extracts audio and video features through the encoder-decoding structure of the generator in the generative adversarial network, and uses the discriminator to compare the generated video with the real video to achieve end-to-end training. The video has achieved good performance. These algorithms have played an important role in promoting the development of face video synthesis. In recent years, based on face video synthesis technology, companies such as Baidu, Sohu, and iFLYTEK have designed corresponding virtual video robots to complete simple tasks such as news broadcasts and customer service questions, which has promoted the implementation and development of strong artificial intelligence.

现有技术存在以下缺陷与不足：The prior art has the following defects and deficiencies:

然而，现存的虚拟视频客服合成方法和系统大多无法实现真实可靠的、从文字到视频的一体式合成。具体体现在：无法很好地实现唇形与声音的对齐，无法根据用户需求切换说话者的语言，无法根据所表达词句的情感生成相应的面部表情和语音语调。这些系统虽然具备了视频客服的初级功能，但是无法更好地接近真人说话的习惯，人工处理的痕迹较为明显。However, most of the existing virtual video customer service synthesis methods and systems cannot achieve a true and reliable integrated synthesis from text to video. Specifically, it is not possible to align the lip shape with the voice well, the language of the speaker cannot be switched according to the user's needs, and the corresponding facial expressions and voice intonation cannot be generated according to the emotion of the expressed words and sentences. Although these systems have the primary functions of video customer service, they cannot better approach the habit of talking with real people, and the traces of manual processing are more obvious.

发明内容SUMMARY OF THE INVENTION

针对现有技术的不足，本发明提出了一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，其创新性在于提出了两种合成虚拟视频客服机器人的方案，可供用户根据需求自主选择；合成方案可以让用户实现各种语言的合成，客服形象的任意选择，多种场景的应用，并且将说话者的情感融入到视频合成的过程中，具有良好的真实性；集成了一套基于Web端的系统，支持用户直接登陆网站，上传音视频材料，在线合成，批量快速生产。In view of the deficiencies of the prior art, the present invention proposes a method and system for synthesizing a virtual video customer service robot based on a generative adversarial network. ; The synthesis scheme allows users to realize the synthesis of various languages, arbitrary selection of customer service images, application of various scenarios, and integrate the speaker's emotion into the process of video synthesis, which has good authenticity; The system on the web side supports users to log in directly to the website, upload audio and video materials, synthesize online, and quickly produce in batches.

为实现上述的一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统目的，本发明提供如下技术方案：一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，包括唇形生成器模块、表情生成器模块、文本情感分析模块、文本语音合成模块。In order to realize the above-mentioned synthetic method and system purpose of virtual video customer service robot based on generative adversarial network, the present invention provides the following technical solutions: a synthetic method and system of virtual video customer service robot based on generative adversarial network, including a lip generator module , Expression Generator Module, Text Sentiment Analysis Module, Text-to-Speech Synthesis Module.

所述一种基于生成对抗网络的虚拟视频客服机器人合成方法包括以下步骤：The method for synthesizing a virtual video customer service robot based on a generative adversarial network includes the following steps:

步骤一：收集1000段时长在15秒的中央电视台新闻联播视频作为相应的中文语料-视频数据集。在该数据集上训练Wav2Lip、First Order Motion Model模型，使其更加符合汉语发音的特征，作为唇形生成器。Step 1: Collect 1000 CCTV news broadcast videos with a duration of 15 seconds as the corresponding Chinese corpus-video dataset. The Wav2Lip and First Order Motion Model models are trained on this dataset to make them more in line with the characteristics of Chinese pronunciation as a lip generator.

步骤二：在Oulu-CASIA NIR&VIS面部表情数据集上训练ExprGAN模型作为表情生成器，训练双向LSTM模型作为文本情感分析模块，调用百度TTS接口合成带有感情的语音。Step 2: Train the ExprGAN model on the Oulu-CASIA NIR&VIS facial expression dataset as an expression generator, train a bidirectional LSTM model as a text sentiment analysis module, and call the Baidu TTS interface to synthesize speech with emotion.

步骤三：将上述四个模块集成，基于Web端开发。利用VUE框架搭建前端，利用Python的flask、django包封装接口、搭建后端，利用nginx进行反向代理，集成出具有两种方案的虚拟视频客服机器人合成网站和平台。Step 3: Integrate the above four modules and develop based on the web side. Use the VUE framework to build the front-end, use Python's flask and django packages to encapsulate the interface, build the back-end, use nginx for reverse proxy, and integrate a virtual video customer service robot synthesis website and platform with two schemes.

步骤四：用户根据自身的需求选择对应的两种合成方案。Step 4: The user selects two corresponding synthesis schemes according to their own needs.

步骤五：登录网站，提交上述原始材料，即可合成出虚拟客服的面部视频。Step 5: Log in to the website and submit the above-mentioned original materials to synthesize the facial video of the virtual customer service.

其中，步骤一所述方案为迁移合成，更加适用于对唇形对齐要求高的场景，能够清晰真实的人脸视频；Among them, the solution described in step 1 is migration synthesis, which is more suitable for scenes with high requirements for lip alignment, and can provide clear and real face videos;

其中，步骤二所述方案为文本合成，更加适用于大规模的商业级应用场景，能够根据文字直接合成出真实的唇形、表情、声音，合成视频具有良好的时序稳定性，合成迅速，效果逼真。Among them, the solution described in step 2 is text synthesis, which is more suitable for large-scale commercial-level application scenarios, and can directly synthesize real lips, expressions, and voices according to the text. The synthesized video has good timing stability, rapid synthesis, and good effect. lifelike.

进一步地，若用户选择步骤一所述方案，需要向平台服务器提供一段预先朗读过相应文字的源视频和视频客服的形象图片；Further, if the user chooses the solution described in step 1, it is necessary to provide the platform server with a source video that has read the corresponding text aloud in advance and an image picture of the video customer service;

进一步地，若用户选择步骤二所述方案，需要向平台服务器提供代表虚拟客服形象的任意视频和客服即将朗读的文字。Further, if the user chooses the solution described in step 2, the platform server needs to provide any video representing the virtual customer service image and the text that the customer service will read aloud.

其中，所述Wav2Lip模型具体为对连续帧的视频和音频进行特征提取，引入合成损失，通过生成对抗网络合成具有良好平滑性的唇动视频。所述First Order Motion Model模型具体为无需使用任何标签或先验信息进行图片的动画处理。即通过在一组描绘面部特征的视频上进行训练，就可以将模型用于唇形的迁移。Among them, the Wav2Lip model specifically performs feature extraction on video and audio of consecutive frames, introduces synthesis loss, and synthesizes lip motion video with good smoothness through generative adversarial network. The First Order Motion Model model specifically does not need to use any labels or prior information to perform animation processing of pictures. That is, by training on a set of videos depicting facial features, the model can be used for lip shape transfer.

其中，所述ExprGAN模型具体是具有可控制表情强度的表情编辑算法，可以将面部图像更改为具有多种样式的目标表情，表情强度也可以连续控制。所述双向LSTM模型具体为使用双向LSTM模型来分析文本情感，用于更好地处理程度词和捕捉双向的语义依赖。The ExprGAN model is specifically an expression editing algorithm with controllable expression intensity, which can change facial images to target expressions with various styles, and the expression intensity can also be continuously controlled. The bidirectional LSTM model specifically uses the bidirectional LSTM model to analyze text sentiment, so as to better process degree words and capture bidirectional semantic dependencies.

进一步地，若用户选择步骤一所述方案，则直接通过训练好的First OrderMotion模型生成具有精准唇动、自然的面部表情的视频；Further, if the user selects the scheme described in step 1, then directly generate a video with precise lip movement and natural facial expressions through the trained First OrderMotion model;

进一步地，若用户选择步骤二所述方案，则将文本输入情感分析模块，分析出相应情感；通过调用TTS生成相应语音语调的音频；将视频输入唇形生成器，与TTS生成的音频共同合成带有唇动的视频；将上述视频输入表情生成器，根据分析出的情感，调整面部表情，得到结果。Further, if the user selects the scheme described in step 2, then the text is input into the sentiment analysis module to analyze the corresponding emotion; the audio of the corresponding voice intonation is generated by calling TTS; the video is input into the lip generator, and the audio generated by the TTS is jointly synthesized. Video with lip movement; input the above video into the expression generator, adjust the facial expression according to the analyzed emotion, and get the result.

所述一种基于生成对抗网络的虚拟视频客服机器人合成系统，所述装置包括：云服务器、存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现所述的集成方案和方法步骤。The virtual video customer service robot synthesis system based on generative adversarial network, the device includes: a cloud server, a memory, a processor, and a computer program stored in the memory and running on the processor, the processor executing the The described integration scheme and method steps are implemented when the described program is executed.

与现有技术相比，本发明提供了一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，具备以下有益效果：Compared with the prior art, the present invention provides a virtual video customer service robot synthesis method and system based on a generative confrontation network, which has the following beneficial effects:

1、本一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，提出了两种合成虚拟视频客服机器人的方案，即步骤一所述方案为预先录制一段朗读相应文字的视频，并将该面部特征迁移到虚拟客服的形象图片上；步骤二所述方案为输入文字和虚拟客服的形象视频，直接合成带有情感和真实唇动的虚拟客服视频；1. This method and system for synthesizing virtual video customer service robots based on generative adversarial networks proposes two solutions for synthesizing virtual video customer service robots, that is, the solution described in step 1 is to pre-record a video of reading the corresponding text aloud, and combine the The facial features are migrated to the image picture of the virtual customer service; the solution described in step 2 is to input text and the image video of the virtual customer service, and directly synthesize the virtual customer service video with emotions and real lip movements;

2、本一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，提出的合成方案可以让用户实现各种语言的合成，客服形象的任意选择，多种场景的应用，并且将说话者的情感融入到视频合成的过程中，具有更好的真实性和良好的拓展性；2. This method and system for synthesizing virtual video customer service robots based on generative adversarial networks, the proposed synthesis scheme allows users to realize the synthesis of various languages, the arbitrary selection of customer service images, the application of various scenarios, and the integration of the speaker's Emotion is integrated into the process of video synthesis, which has better authenticity and good scalability;

3、本一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统，集成了一套基于Web端的系统，支持用户直接登陆网站，上传音视频材料，在线合成，批量快速生产。3. This method and system for synthesizing virtual video customer service robots based on generative adversarial networks integrates a web-based system, which supports users to directly log in to the website, upload audio and video materials, synthesize online, and quickly produce in batches.

附图说明Description of drawings

图1为本发明基于生成对抗网络的虚拟视频客服机器人合成方法流程图；Fig. 1 is the flow chart of the synthetic method of virtual video customer service robot based on generative adversarial network of the present invention;

图2为本发明步骤二所述方案 Wav2Lip整体网络结构示意图。FIG. 2 is a schematic diagram of the overall network structure of the solution Wav2Lip described in step 2 of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参阅图1-2，一种基于生成对抗网络的虚拟视频客服机器人合成系统，包括唇形生成器模块、表情生成器模块、文本情感分析模块、文本语音合成模块。Please refer to Figure 1-2, a virtual video customer service robot synthesis system based on generative adversarial network, including a lip generator module, an expression generator module, a text sentiment analysis module, and a text-to-speech synthesis module.

实施例1：Example 1:

本发明实施例提供了一种基于生成对抗网络的虚拟视频客服机器人合成方法，该方法包括以下步骤：An embodiment of the present invention provides a method for synthesizing a virtual video customer service robot based on a generative adversarial network, the method comprising the following steps:

101：使用you-get工具收集1000段不同人物的中央电视台新闻联播视频作为相应的中文语料-视频数据集，并按照LRS2数据集的格式整理。101: Use the you-get tool to collect 1,000 CCTV news broadcast videos of different characters as the corresponding Chinese corpus-video dataset, and organize them in the format of the LRS2 dataset.

进一步地，用ffmpeg工具从视频中提取音频，并通过python库librosa将音频文件转换为梅尔块供网络读取，将视频裁剪为分辨率256*256，时长为15秒的MP4格式文件完成数据集的预处理。Further, extract the audio from the video with the ffmpeg tool, and convert the audio file into mel blocks for network reading through the python library librosa, and crop the video into MP4 format files with a resolution of 256*256 and a duration of 15 seconds to complete the data. Preprocessing of the set.

102：在收集的中文数据集上训练Wav2Lip网络模型。该模型能通过面部解码器、音频解码器提取声音与唇形之间的映射关系，生成出合成唇形，并通过预训练好的唇形合成鉴别器和与生成器联合训练的视觉效果鉴别器不断修正合成效果，作为步骤二所述方案的唇形生成器。102: Train the Wav2Lip network model on the collected Chinese dataset. The model can extract the mapping relationship between sound and lip shape through the face decoder and audio decoder, generate a synthetic lip shape, and pass the pre-trained lip shape synthesis discriminator and the visual effect discriminator jointly trained with the generator. Constantly modifying the synthetic effect as a lip generator for the scheme described in step 2.

具体实现时，在原网络的预训练模型上进行训练，使得网络能够在保持原有性能的基础上兼顾中文发音的特点，提升唇形合成效果。In the specific implementation, training is performed on the pre-training model of the original network, so that the network can take into account the characteristics of Chinese pronunciation on the basis of maintaining the original performance, and improve the effect of lip synthesis.

103：训练First Order Motion、ExprGAN、Bi-LSTM三个模型分别作为上述方案1的唇形合成器、方案2的表情生成器和文本情感分析模块。103: Train the three models of First Order Motion, ExprGAN, and Bi-LSTM as the lip synthesizer of the above scheme 1, the expression generator of the scheme 2, and the text sentiment analysis module respectively.

104：调用百度语音合成API接口并集成上述训练好的两种方案的各个模型。用vue框架搭建前端，用python的flask、django包封装模型接口，搭建后端，基于Web端搭建网站。104: Invoke the Baidu speech synthesis API interface and integrate the models of the two trained schemes above. Use the vue framework to build the front end, use python's flask and django packages to encapsulate the model interface, build the back end, and build a website based on the web side.

105：用户可根据自身需求选择合成步骤一所述方案和步骤二所述方案。若选择步骤一所述方案，需要准备预先朗读好的视频和虚拟客服的形象图片；若选择步骤二所述方案，需要准备虚拟客服的形象视频（或图像）和需要机器人朗读的文字。105 : The user can choose to synthesize the scheme described in step 1 and the scheme described in step 2 according to his own needs. If you choose the solution in step 1, you need to prepare a pre-read video and an image of the virtual customer service; if you choose the solution in step 2, you need to prepare the image video (or image) of the virtual customer service and the text to be read by the robot.

106：用户登录网站，提交上述材料，即可获得合成结果。106: The user logs in to the website and submits the above materials, and the synthesis result can be obtained.

综上所述，本发明提出了一种基于生成对抗网络的虚拟视频客服机器人合成方法，其创新性在于提出了两种合成虚拟视频客服机器人的方案，可供用户根据需求自主选择；合成方案可以让用户实现不同语言的合成，客服形象的任意选择，多种场景的应用，并且将说话者的情感融入到视频合成的过程中，具有良好的真实性；集成了一套基于Web端的系统，支持用户直接登陆网站，上传音视频材料，快速批量合成。To sum up, the present invention proposes a method for synthesizing virtual video customer service robots based on generative adversarial networks. It allows users to realize the synthesis of different languages, arbitrary selection of customer service images, application of various scenarios, and integrates the speaker's emotions into the process of video synthesis, which has good authenticity; a set of web-based systems are integrated to support Users directly log in to the website, upload audio and video materials, and quickly synthesize them in batches.

实施例2：Example 2:

下面结合具体的实例、计算公式对实施例1中的方案进行进一步地介绍，详见下文描述：The scheme in Embodiment 1 is further introduced below in conjunction with specific examples and calculation formulas, and is described in detail below:

一、数据准备1. Data preparation

本发明使用you-get工具收集1000段不同人物的中央电视台新闻联播视频作为相应的中文语料-视频数据集，并按照LRS2数据集的格式整理。进一步地，用ffmpeg、librosa工具对数据集进行预处理。The invention uses the you-get tool to collect 1,000 CCTV news broadcast videos of different characters as the corresponding Chinese corpus-video data set, and organizes them according to the format of the LRS2 data set. Further, the dataset is preprocessed with ffmpeg and librosa tools.

该数据集由对应的音频和视频组成。视频部分共包含5个不同男主播和5个不同女主播的播音内容，帧率为25 fps，分辨率均被裁剪为256*256，时长为25秒，格式为MP4；音频部分是从视频中提取出的梅尔块，用于网络直接获取声音信息。The dataset consists of corresponding audio and video. The video part contains the broadcast content of 5 different male anchors and 5 different female anchors, the frame rate is 25 fps, the resolution is all cropped to 256*256, the duration is 25 seconds, and the format is MP4; the audio part is from the video The extracted Mel block is used for the network to obtain sound information directly.

二、模型的训练2. Model training

本发明共包含四个模块分别是：唇形生成模块、表情生成模块、文本情感分析模块和语音合成模块，具体如下。The present invention includes four modules: a lip shape generation module, an expression generation module, a text emotion analysis module and a speech synthesis module, the details are as follows.

（1）唇形生成模块：(1) Lip shape generation module:

步骤一所述方案中的唇形生成模块采用First Order Motion模型，无需使用任何标签或先验信息进行图片的动画处理。该模型通过在一组描绘面部特征的视频上进行训练，之后就可以将模型用于唇形的迁移。具体实现是使用生成对抗网络的方法将外观信息与运动信息分离。为了支持模型对复杂运动的鲁棒性，该模型提取源视频中的面部关键点和局部仿射变换，生成器网络对目标物体的运动进行建模，即从源图像中提取静态外观信息与驱动视频中获得的运动信息组合，得到合成视频。The lip shape generation module in the solution described in step 1 adopts the First Order Motion model, and does not need to use any labels or prior information for animation processing of pictures. The model was trained on a set of videos depicting facial features, and the model was then used for lip shape transfer. The specific implementation is to use a generative adversarial network approach to separate appearance information from motion information. To support the robustness of the model to complex motion, the model extracts facial keypoints and local affine transformations in the source video, and the generator network models the motion of the target object, i.e. extracts static appearance information from the source image and drives The motion information obtained in the video is combined to obtain a composite video.

步骤二所述方案中的唇形生成模块采用Wav2Lip模型，该模型由一个生成器和两个鉴别器组成。生成器可分为面部信息编码器、语音信息编码器、面部信息解码器。面部信息编码器由一系列跳跃连接的残差卷积块组成。其将一组随机视频帧R的唇部信息遮掩作为先验姿态P，与R按通道数级联作为编码器输入。该编码器提取输入中的唇形信息作为面部特征图供后续网络的解码重建；语音编码器由一系列二维卷积块编码，提取输入梅尔块S中的语音信息，之后与面部特征图级联；面部信息解码器对上述两个编码器编码出的特征进行解码，通过一系列的上采样和反卷积操作重建出与音频相匹配的唇形视频，具体的唇形重建L1损失如下：The lip generation module in the solution described in step 2 adopts the Wav2Lip model, which consists of a generator and two discriminators. The generator can be divided into face information encoder, voice information encoder, and face information decoder. The facial information encoder consists of a series of skip-connected residual convolutional blocks. It takes the lip information masking of a set of random video frames R as the prior pose P, which is cascaded with R according to the number of channels as the encoder input. The encoder extracts the lip shape information in the input as a facial feature map for the subsequent decoding and reconstruction of the network; the speech encoder is encoded by a series of two-dimensional convolution blocks, extracts the speech information in the input Mel block S, and then combines the facial feature map with the facial feature map. Cascading; the facial information decoder decodes the features encoded by the above two encoders, and reconstructs a lip video that matches the audio through a series of upsampling and deconvolution operations. The specific lip reconstruction L1 loss is as follows :

其中，Lg为生成器重建出的唇形，LG为真实的图像，N为输入图像的数量Among them, Lg is the lip shape reconstructed by the generator, LG is the real image, and N is the number of input images

唇形同步鉴别器用来惩罚与音频不同步的唇形生成。将生成的视频帧按时间维度级联输入预训练好的唇形同步鉴别器时，鉴别器将对生成面部的下半部分进行鉴别，最小化同步损失，具体如下：The lip-sync discriminator is used to penalize lip generation that is out of sync with the audio. When the generated video frames are cascaded into the pre-trained lip-sync discriminator according to the time dimension, the discriminator will discriminate the lower half of the generated face to minimize the synchronization loss, as follows:

该唇形同步生成器的权重在GAN网络的训练过程中保持不变，其对于判断唇形音频是否同步具有91%的准确率，能够很好地约束生成器的训练。The weight of the lip sync generator remains unchanged during the training of the GAN network, and it has a 91% accuracy rate for judging whether the lip audio is synchronized, which can well constrain the training of the generator.

视觉效果鉴别器与生成器网络联合训练，该鉴别器用来约束失真的面部生成。鉴别器D由一系列卷积块组成。每个块包含一个卷积层和一个Leaky ReLU激活层。在鉴别训练过程中，网络使得损失Ldisc最小化，具体如下：The visual effects discriminator is jointly trained with the generator network, which is used to constrain distorted face generation. The discriminator D consists of a series of convolutional blocks. Each block contains a convolutional layer and a Leaky ReLU activation layer. During the discriminative training process, the network minimizes the loss Ldisc as follows:

最终，网络的总损失为Ultimately, the total loss of the network is

其中，sw、sg为预先设定的参数，Lrecon为上文所提的重建损失，Esync为上文所提的同步损失，Lgen为生成器损失。Among them, sw and sg are preset parameters, Lrecon is the reconstruction loss mentioned above, Esync is the synchronization loss mentioned above, and Lgen is the generator loss.

（2）表情生成模块：(2) Expression generation module:

表情生成模块由ExprGAN模型构成。ExprGAN是可控制表情强度的表情编辑算法，可以将面部图像更改为具有多种样式的目标表情，表情强度也可以连续控制。ExprGAN的生成器由编码器和解码器组成，编码器的输入是面部图像，解码器的输出是对重建的图像；ExprGAN的鉴别器用来约束表情的强度和真实性。整个网络可分为三个阶段：控制器学习阶段，图像重建阶段和图像细化阶段。经过三个阶段的生成即可得到具有指定表情的面部视频。The expression generation module consists of the ExprGAN model. ExprGAN is an expression editing algorithm with controllable expression intensity, which can change facial images to target expressions with various styles, and expression intensity can also be continuously controlled. The generator of ExprGAN consists of an encoder and a decoder, the input of the encoder is the face image, and the output of the decoder is the reconstructed image; the discriminator of ExprGAN is used to constrain the strength and authenticity of the expression. The whole network can be divided into three stages: controller learning stage, image reconstruction stage and image refinement stage. After three stages of generation, facial videos with specified expressions can be obtained.

（3）文本情感分析模块：(3) Text sentiment analysis module:

文本情感分析即分析句子情感倾向，本发明使用双向LSTM模型来分析文本情感，可以更好的处理程度词和捕捉双向的语义依赖。双向LSTM模型是由前向的LSTM和后向的LSTM两个模型合成得到的。LSTM模型由t时刻的输入词，细胞状态，临时细胞状态，隐层状态，遗忘门，记忆门，输出门组成。其计算过程可以概括为，门控细胞状态进行新信息的遗忘和记忆，使得对后续时刻计算有用的信息得以传递，而无用的信息被丢弃；其中前一步的隐层状态和新的输入参与了每一步的运算，决定了每一步的遗忘与记忆的内容。将正向和反向的LSTM合成，即将所得到的两个LSTM的隐层状态的输出结果拼接，就可以得到所需要的情感倾向的判断。本发明系统中将情感分为中立、开心、愤怒、伤心、惊喜、恐惧六种情感。Text sentiment analysis is to analyze the sentiment tendency of sentences. The present invention uses a two-way LSTM model to analyze text sentiment, which can better process degree words and capture two-way semantic dependencies. The bidirectional LSTM model is synthesized from the forward LSTM and the backward LSTM. The LSTM model consists of the input word at time t, the cell state, the temporary cell state, the hidden layer state, the forgetting gate, the memory gate, and the output gate. The calculation process can be summarized as: gating the cell state to forget and memorize new information, so that the information useful for subsequent calculations can be transmitted, and the useless information is discarded; the hidden layer state and new input of the previous step are involved. The operation of each step determines the content of forgetting and memory in each step. By synthesizing the forward and reverse LSTMs, that is, by splicing the output results of the hidden layer states of the two LSTMs, the required judgment of emotional tendencies can be obtained. In the system of the present invention, emotions are divided into six kinds of emotions: neutral, happy, angry, sad, surprise, and fear.

（4）语音合成模块：(4) Speech synthesis module:

本发明在该模块调用百度的TTS接口。该技术可以较好地完成中文语音合成，韵律处理能够自然地处理文本的断句，多音字等问题，效果较为逼真，能很好地为整个系统服务。The present invention calls Baidu's TTS interface in this module. This technology can well complete Chinese speech synthesis, and prosody processing can naturally deal with text segmentation, polyphonic words and other problems, the effect is more realistic, and can serve the whole system well.

三、模型的集成3. Model integration

四个模块的集成方法为若用户选择步骤一所述方案，则直接通过训练好的FirstOrder Motion模型生成具有精准唇动、自然的面部表情的视频；若用户选择步骤二所述方案，则将文本输入情感分析模块，分析出相应情感；通过调用TTS生成相应语音语调的音频；将视频输入唇形生成器，与TTS生成的音频共同合成音画同步的唇动视频；将上述视频输入表情生成器，根据分析出的情感，调整面部表情，得到结果。The integration method of the four modules is that if the user selects the solution described in step 1, the trained FirstOrder Motion model will directly generate a video with precise lip movements and natural facial expressions; if the user chooses the solution described in step 2, the text will be Input the emotion analysis module to analyze the corresponding emotions; generate the audio of the corresponding voice intonation by calling TTS; input the video into the lip generator, and synthesize the lip motion video synchronized with the audio and picture together with the audio generated by the TTS; input the above video into the expression generator , according to the analyzed emotion, adjust the facial expression to get the result.

本模型实施例具有以下三个关键创造点：This model embodiment has the following three key creation points:

一、提出了两种合成虚拟视频客服机器人的方案；1. Two schemes for synthesizing virtual video customer service robots are proposed;

技术效果：步骤一所述方案为迁移合成，合成效果逼真，适用于对视频真实要求性较高的场景；步骤二所述方案为文本合成，可快速根据文字一站式合成出真实的唇形、表情、声音，合成视频具有良好的时序稳定性，更加适用于大规模的商业级应用场景。Technical effect: The solution described in step 1 is migration synthesis, and the synthesis effect is realistic, which is suitable for scenes with high requirements for video authenticity; the solution described in step 2 is text synthesis, which can quickly synthesize real lips according to the text in one stop. , expressions, voices, and synthetic videos have good timing stability and are more suitable for large-scale commercial-level application scenarios.

二、提出了让合成的中文视频具有精准唇动、自然表情的方法；2. Propose a method to make the synthesized Chinese video have precise lip movements and natural expressions;

技术效果：训练的模型具有优异的性能，唇形同步误差LSE-D由原来的10.33降到了6.39，唇形同步置信度LSE-C由3.199上升到了7.789，视觉质量由3.91提升到4.12。同时模型由单纯的合成唇形融入了表情的合成。Technical effect: The trained model has excellent performance, the lip-sync error LSE-D is reduced from 10.33 to 6.39, the lip-sync confidence LSE-C is increased from 3.199 to 7.789, and the visual quality is improved from 3.91 to 4.12. At the same time, the model is composed of simple synthetic lips and expressions.

三、集成了合成虚拟视频客服机器人的系统。3. A system integrating a synthetic virtual video customer service robot.

技术效果：将四个模块集成为一套系统，并搭建网站，能够很好的实现两种方案的一站式合成。Technical effect: Integrating the four modules into a system and building a website can well realize the one-stop synthesis of the two schemes.

综上所述，本方法通过四个模块、两种方案实现了虚拟视频客服机器人的合成，能够精准地驱动唇形、自然地合成表情与语音，视觉效果良好。同时，集成的系统能够让用户快速批量的生产虚拟视频客服机器人。To sum up, this method realizes the synthesis of virtual video customer service robots through four modules and two schemes, which can accurately drive the lip shape, naturally synthesize expressions and voices, and has a good visual effect. At the same time, the integrated system enables users to quickly mass-produce virtual video customer service robots.

实施例3：Example 3:

本发明实施例不仅可以用在虚拟视频客服的生成中，也可以用在如下应用场景。The embodiments of the present invention can be used not only in the generation of virtual video customer service, but also in the following application scenarios.

如让历史人物、静态图画完成唱歌、说节日祝福等特定动作，如提前导入问题语料库，即可将虚拟视频客服机器人的系统应用到校园迎新机器人、心理咨询机器人等，能够使学生与机器人实现面对面的真实交流，实现更好的人机交互。For example, let historical figures and static pictures complete specific actions such as singing and saying holiday wishes, such as importing the question corpus in advance, the system of virtual video customer service robot can be applied to campus welcome robots, psychological counseling robots, etc., enabling students and robots to achieve face-to-face Real communication, to achieve better human-computer interaction.

实施例4：Example 4:

一种基于生成对抗网络的虚拟视频客服机器人合成系统，该系统包括：网站域名、云服务器、存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，该处理器执行程序时实施实施例1和2中的方法步骤。A virtual video customer service robot synthesis system based on generative adversarial network, the system includes: a website domain name, a cloud server, a memory, a processor and a computer program stored in the memory and running on the processor, when the processor executes the program The method steps in Examples 1 and 2 were carried out.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, and substitutions can be made in these embodiments without departing from the principle and spirit of the invention and modifications, the scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. a virtual video customer service robot synthesis method and system based on generative confrontation network, it is characterized in that: described a kind of virtual video customer service robot synthesis system based on generative confrontation network, comprises lip shape generator module, expression generator module, Text sentiment analysis module, text-to-speech synthesis module.

2. a kind of virtual video customer service robot synthesis method and system based on generative confrontation network according to claim 1, is characterized in that: described a kind of virtual video customer service robot synthesis method based on generative confrontation network comprises the following steps:

Step 1: Collect 1000 CCTV news broadcast videos with a duration of 15 seconds as the corresponding Chinese corpus-video data set, and train the Wav2Lip and First Order Motion Model models on this data set to make it more in line with the characteristics of Chinese pronunciation. lip generator;

Step 2: Train the ExprGAN model as an expression generator on the Oulu-CASIA NIR&VIS facial expression dataset, train the bidirectional LSTM model as a text sentiment analysis module, and call the Baidu TTS interface to synthesize speech with emotion;

Step 3: Integrate the above four modules, develop based on the web, use the VUE framework to build the front-end, use Python's flask and django packages to encapsulate the interface, build the back-end, use nginx for reverse proxy, and integrate a virtual system with two solutions. Video customer service robot synthesis website and platform;

Step 4: The user selects the corresponding two synthesis schemes according to their own needs;

Step 5: Log in to the website and submit the above-mentioned original materials to synthesize the facial video of the virtual customer service.

3. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 1, is characterized in that: the scheme described in step 1 is migration synthesis, and is more suitable for lip-shaped alignment requirements high Scene, can clear and real face video.

4. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2, it is characterized in that: the scheme described in step 2 is text synthesis, which is more suitable for large-scale commercial application scenarios , which can directly synthesize real lips, expressions, and sounds according to the text. The synthetic video has good timing stability, rapid synthesis, and realistic effects.

5. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 1, is characterized in that: if the user selects the scheme described in step 1, it is necessary to provide a section of pre-reading corresponding to the platform server. The source video of the text and the image picture of the video customer service.

6. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 2, it is characterized in that: if the user selects the scheme described in step 2, it is necessary to provide the platform server with the image representing the virtual customer service image. Any video and text that the agent will read aloud.

7. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 1, is characterized in that: described Wav2Lip model specifically is to carry out feature extraction to the video and audio of continuous frame, introduce synthetic Loss, synthesizing a lip motion video with good smoothness through a generative adversarial network, the First OrderMotion Model model specifically does not need to use any labels or prior information to animate pictures, that is, by performing on a set of videos depicting facial features. After training, the model can be used for lip transfer.

8. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 2, it is characterized in that: described ExprGAN model specifically has the expression editing algorithm of controllable expression intensity, can be facial expression editing algorithm. The image is changed to a target expression with multiple styles, and the expression intensity can also be continuously controlled. The bidirectional LSTM model specifically uses a bidirectional LSTM model to analyze text sentiment, which is used to better process degree words and capture bidirectional semantic dependencies.

9. a kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 1, is characterized in that: if the user selects the scheme described in step 1, then directly generates by trained First OrderMotion model Videos with precise lip movements, natural facial expressions.

10. A kind of virtual video customer service robot synthesis method and system based on generative adversarial network according to claim 2 step 2, it is characterized in that: if the user selects the scheme described in step 2, then the text is input into the sentiment analysis module, and the result is analyzed. Corresponding emotion, generate the audio of the corresponding voice tone by calling TTS, input the video into the lip generator, synthesize the video with lip movement together with the audio generated by TTS, input the above video into the expression generator, and adjust the emotion according to the analyzed emotion. Facial expressions, get results.

11. The method and system for synthesizing a virtual video customer service robot based on a generative adversarial network according to claim 2, characterized in that: the virtual video customer service robot synthesizing system based on a generative adversarial network, the device comprises: A cloud server, a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implements the integration scheme and method steps when the processor executes the program.