CN110543846A

CN110543846A - A method of frontalizing multi-pose face images based on generative adversarial networks

Info

Publication number: CN110543846A
Application number: CN201910806159.5A
Authority: CN
Inventors: 张星明; 容昌乐; 林育蓓
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-12-06
Anticipated expiration: 2039-08-29
Also published as: CN110543846B

Abstract

The invention discloses a multi-pose human face frontalization method based on a generative confrontation network. In the training stage, face pictures of various poses are first collected as a data set, and then multiple groups of frontal face images and non-frontal faces of the same person are input. For images, through the newly designed loss function, alternately train the generative network and the discriminative network until the value of the loss function converges stably. In the test stage after the training is completed, the present invention can correct the input face pictures in various postures into frontal face images. The rectified image is not only clear, but also retains the identity features of the original face, which can be used for face recognition. The invention can effectively slow down the negative impact of posture factors on face recognition, and is beneficial to the development of practical application of face recognition under unrestricted conditions.

Description

A method of frontalizing multi-pose face images based on generative adversarial networks

技术领域technical field

本发明涉及图像处理的技术领域，尤其是指一种基于生成对抗网络的多姿态人脸图像正面化方法。The present invention relates to the technical field of image processing, in particular to a method for frontalizing multi-pose face images based on generative confrontation networks.

背景技术Background technique

目前，人脸识别技术已经广泛应用于门禁安保、网络社交和金融等诸多领域。但是，在绝大多数的实际场景中，人脸识别技术需要在严格标准的环境下才能高效使用。通常，被检测人需要身处光照充足均匀的场景，保持平静的表情，配合图像采集装置调整标准的姿态。然而在不少的实际应用领域，如嫌犯追踪，以上的条件往往很难满足，这导致了不少人脸识别技术的性能大幅度下降，人脸识别技术难以在这些领域推广。在影响人脸识别技术性能的不利因子中，照片人脸的姿态是最重要的。处理好姿态问题，人脸识别技术在非限制环境下的应用将会跨出一大步。At present, face recognition technology has been widely used in many fields such as access control security, social networking and finance. However, in most practical scenarios, face recognition technology needs to be used efficiently in a strictly standard environment. Usually, the person to be detected needs to be in a scene with sufficient and uniform lighting, maintain a calm expression, and adjust the standard posture with the image acquisition device. However, in many practical application fields, such as suspect tracking, the above conditions are often difficult to meet, which leads to a significant decline in the performance of many face recognition technologies, making it difficult to promote face recognition technologies in these fields. Among the unfavorable factors affecting the performance of face recognition technology, the posture of the face in the photo is the most important. Handle the posture problem well, and the application of face recognition technology in an unrestricted environment will take a big step forward.

处理姿态问题的其中一种方法是对输入的侧脸图像做正面化矫正，即把一张侧脸图像矫正成为同一人物的正脸图像，然后对合成的正脸图像进行人物身份的识别。目前大部分的多姿态人脸图像正面化方法缺乏对偏转角超过60°人脸图像的处理能力，它们合成的人脸图像具有严重的形变并且会丢失人物的身份特征，导致后续的人脸识别工作难以进行。目前效果较好的多姿态人脸图像正面化方法基本上是以生成对抗网络为基础的。One of the methods to deal with the posture problem is to perform frontal correction on the input side face image, that is, to correct a side face image into a front face image of the same person, and then identify the identity of the person on the synthesized front face image. At present, most of the frontalization methods for multi-pose face images lack the ability to process face images with deflection angles exceeding 60°. Work is difficult. At present, the multi-pose face image frontalization method with good effect is basically based on generative adversarial network.

相比于其他基于生成对抗网络的多姿态人脸图像正面化方法，本发明采用了不同的网络结构和损失函数。即使输入了偏转角超过60°的人脸图像，本模型也能够合成逼真的人脸图像并且保留更多的人物身份信息，大幅度提高了后续人脸识别工作的效率。Compared with other multi-pose face image frontalization methods based on generative confrontation networks, the present invention adopts different network structures and loss functions. Even if a face image with a deflection angle of more than 60° is input, this model can synthesize a realistic face image and retain more identity information, which greatly improves the efficiency of subsequent face recognition work.

发明内容Contents of the invention

本发明目的在于克服现有多姿态人脸图像正面化方法的不足，提出了一种基于生成对抗网络的多姿态人脸图像正面化方法，解决同类方法在输入人脸偏转角超过60°的情况下效果不佳的问题，提高了合成人脸的逼真度，保留了更多图像人脸的身份信息。The purpose of the present invention is to overcome the deficiencies of the existing multi-pose face image frontalization method, and proposes a multi-pose face image frontalization method based on generative confrontation network, which solves the situation where the input face deflection angle exceeds 60° in similar methods It improves the fidelity of the synthesized face and preserves more identity information of the face in the image.

为了实现上述目的，本发明提供的技术方案为：一种基于生成对抗网络的多姿态人脸图像正面化方法，包括以下步骤：In order to achieve the above object, the technical solution provided by the present invention is: a method for frontalizing multi-pose face images based on generative confrontation network, comprising the following steps:

1)收集各个姿态的人脸图像作为训练集和测试集，必须确保输入的每一张任意姿态的人脸图像I^a，都能在数据集中找到同一人物非合成的正脸图像I^g；1) Collect face images of each pose as a training set and a test set. It must be ensured that each input face image I ^a of any pose can find a non-synthesized front face image I ^g of the same person in the data set;

2)在训练阶段，把训练集中的任意姿态的人脸图像I^a输入生成器G，得到修正编码X₂和合成的正脸图像I^f，把非合成的正脸图像I^g输入生成器G，得到正脸编码X₃；2) In the training phase, input the face image I ^a of any pose in the training set into the generator G, obtain the corrected code X ₂ and the synthesized front face image I ^f , and input the non-synthesized front face image I ^g into the generator G , to get the front face code X ₃ ;

3)把合成的正脸图像I^f或非合成的正脸图像I^g输入判别网络D，判别网络D判断输入的人脸图像是合成的还是非合成的，再把合成的正脸图像I^f或非合成的正脸图像I^g输入人脸身份特征提取器F，通过F提取人脸图像的人物身份特征；3) Input the synthesized front face image I ^f or the non-synthesized front face image I ^g into the discriminant network D, and the discriminant network D judges whether the input face image is synthesized or non-synthesized, and then inputs the synthesized front face image I ^f Or non-synthetic positive face image ^Ig input face identity feature extractor F, extract the character identity feature of face image by F;

4)把步骤3)的判别结果和提取的人物身份特征、合成的正脸图像I^f、非合成的正脸图像I^g、修正编码X₂和正脸编码X₃带入各个预先设计好的损失函数，交替训练生成器G和判别网络D直至训练结束；4) Bring the discriminant result of step 3) and the extracted identity features, synthesized frontal image If, non-synthesized frontal image ^Ig , modified code X ₂ and ^frontal code X ₃ into each pre-designed loss function, alternately train the generator G and the discriminant network D until the end of training;

5)在测试阶段，把任意姿态的人脸图像I^a输入已经训练完成的生成器G，能够得到一张合成的正脸图像I^f，通过直接观测合成的正脸图像I^f的质量来验证效果。5) In the test phase, input a face image I ^a of any pose into the trained generator G, and a synthesized front face image I ^f can be obtained, which can be verified by directly observing the quality of the synthesized front face image ^If Effect.

在步骤1)中，所述数据集使用的各个姿态的人脸图像全部来源于数据集Multi_Pie；数据集中的图片数目超过75万，包含了337个人的20种光照和15种姿态下的图像；图片的光照由光照标号01到20从暗变亮，其中光照标号07为标准光照条件；把数据集所有人脸图像标记为I^a，对每一张图像I^a，在数据集中找到同一个人、人脸偏转角为0°并且光照标号为07的图像，把它们标记为I^g。In step 1), the face images of each posture used in the data set are all derived from the data set Multi_Pie; the number of pictures in the data set exceeds 750,000, including images under 20 kinds of illumination and 15 kinds of postures of 337 people; The illumination of the picture changes from dark to bright from illumination labels 01 to 20, wherein the illumination label 07 is the standard illumination condition; label all face images in the dataset as I ^a , and find the same person in the dataset for each image I ^a , For the images whose face deflection angle is 0° and the illumination number is 07, mark them as I ^g .

在步骤2)中，所述生成器G由姿态估算器P、编码器En，编码残差网络R和解码器D组成；所述姿态估算器P采用了PnP算法估算人脸姿态，进而求取人脸在yaw方向上的偏转角，由开源的opencv库的函数cv2.solvePnP()实现；所述编码器En是卷积神经网络，所述编码残差网络R是两层全连接神经网络，所述解码器De是反卷积神经网络；In step 2), the generator G is composed of a pose estimator P, an encoder En, an encoded residual network R and a decoder D; the pose estimator P uses the PnP algorithm to estimate the face pose, and then obtains The deflection angle of the face in the yaw direction is realized by the function cv2.solvePnP () of the opencv library; the encoder En is a convolutional neural network, and the encoding residual network R is a two-layer fully connected neural network, The decoder De is a deconvolutional neural network;

所述生成器G合成图片的过程为：对生成器G输入任意姿态的人脸图像I^a，编码器En把它转化为初始编码X₁，编码残差网络R通过初始编码X₁估算出编码残差R(X₁)，姿态估算器P求取输入图像中人脸在yaw方向上的精确偏转角度γ，偏转角γ输入函数Y，得到编码残差的权重Y(γ)，初始编码X₁和编码残差融合得到修正编码X₂，其中X₂＝X₁+Y(γ)×R(X₁)，把修正编码X₂输入到解码器De，通过反卷积生成正脸图像I^f。The process of the generator G synthesizing pictures is as follows: input the face image I ^a of any pose to the generator G, the encoder En converts it into the initial code X ₁ , and the coded residual network R estimates the code by the initial code X ₁ The residual R(X ₁ ), the attitude estimator P obtains the precise deflection angle γ of the face in the input image in the yaw direction, and the deflection angle γ is input into the function Y to obtain the weight Y(γ) of the coding residual, and the initial coding X ₁ and the coded residual are fused to obtain the modified code X ₂ , where X ₂ =X ₁ +Y(γ)×R(X ₁ ), the modified code X ₂ is input to the decoder De, and the front face image I is generated by deconvolution ^f .

在步骤3)中，所述判别网络D是一个以卷积神经网络为基础的二分类器，用来判断输入的图像来源于生成器G还是原始图像数据；所述人脸身份特征提取器F采用已经开源的Light-CNN-29，Light-CNN29是一个轻量级卷积神经网络，深度为29层，参数数量有1200万。In step 3), the discriminant network D is a binary classifier based on a convolutional neural network, which is used to determine whether the input image is from the generator G or the original image data; the face identity feature extractor F Using the open source Light-CNN-29, Light-CNN29 is a lightweight convolutional neural network with a depth of 29 layers and 12 million parameters.

在步骤4)中，所述损失函数的目标是最小化合成的正脸图像I^f和非合成的正脸图像I^g之间的差异，从而使合成的正脸图像I^f能够保留更多输入人脸图像的身份信息；步骤4)中用到的损失函数除了同类型方法常用的像素损失函数、身份损失函数、对称损失函数和对抗损失函数，还包括了自创的编码损失函数，所述编码损失函数的第一个目标是让输入人脸图像I^a和非合成的正脸图像I^g分别通过编码器En所得到的编码更加接近，因为目前的人脸姿态矫正法在输入人脸偏转角度越小的情况下效果越好，所以输入人脸图像的编码越接近同一人脸在0°偏转角情况下的编码，合成的正脸效果就越好；因此，编码损失函数的第一部分公式如下：In step 4), the goal of the loss function is to minimize the difference between the synthesized ^frontal image If and the non-synthesized frontal image ^Ig , so that the synthesized ^frontal image If can retain more input The identity information of the face image; the loss function used in step 4) includes a self-created encoding loss function in addition to the pixel loss function, identity loss function, symmetric loss function and adversarial loss function commonly used in the same type of method. The first goal of the encoding loss function is to make the encoding of the input face image ^Ia and the non-synthesized frontal image ^Ig respectively through the encoder En closer, because the current face pose correction method is in the input face deflection The smaller the angle, the better the effect, so the closer the encoding of the input face image is to the encoding of the same face at 0° deflection angle, the better the effect of the synthesized frontal face; therefore, the first part of the encoding loss function formula as follows:

式中，N是各个编码的维数，En是编码器，En(I^a)ⁱ是初始编码En(I^a)的第i维的值；R是编码残差网络，R(En(I^a))ⁱ是编码残差R(En(I^a))的第i维的值；En(I^a)ⁱ+R(En(I^a))ⁱ等于修正编码的第i维的值；是非合成的正脸图像通过编码器所得到的正脸编码X₃的第i维的值；所述编码损失函数的第一部分是指修正编码和正脸编码之间的曼哈顿距离；In the formula, N is the dimension of each encoding, En is the encoder, En(I ^a ) ⁱ is the value of the i-th dimension of the initial encoding En(I ^a ); R is the encoding residual network, R(En(I ^a )) ⁱ is the value of the i-th dimension of the coding residual R(En(I ^a )); En(I ^a ) ⁱ +R(En(I ^a )) ⁱ is equal to the value of the i-th dimension of the correction code; Is the value of the i-th dimension of the frontal encoding X ₃ obtained by the non-synthesized frontal image through the encoder; the first part of the encoding loss function refers to the Manhattan distance between the correction encoding and the frontal encoding;

所述编码损失函数的第二个目标是让不同人物之间的修正编码得到区分，在修正编码En(I^a)+R(En(I^a))的后面构建一个全连接层C，全连接层C的神经元个数等于训练集中的人物个数M，编码损失函数的第二部分采用了交叉熵损失函数：The second goal of the encoding loss function is to differentiate the corrected encodings between different characters, and build a fully connected layer C behind the corrected encoded En(I ^a )+R(En(I ^a )), fully connected The number of neurons in layer C is equal to the number M of characters in the training set, and the second part of the encoding loss function uses the cross-entropy loss function:

式中，M是训练集的人物个数；y_i是one-hot向量y的第i维的值，向量y表明了输入人脸图像I^a属于训练集中的哪一个人物，若图像I^a属于第j个人，那么向量y的第j维的值为1，其余的维度的值为0，y的维度为M；En(I^a)+R(En(I^a))是修正编码；C(En(I^a)+R(En(I^a)))_i是修正编码经过全连接层C所的到的特征向量C(En(I^a)+R(En(I^a)))的第i维的值；In the formula, M is the number of characters in the training set; y _i is the value of the i-th dimension of the one-hot vector y, and the vector y indicates which character in the training set the input face image I ^a belongs to. If the image I ^a belongs to For the jth person, then the value of the jth dimension of the vector y is 1, the value of the remaining dimensions is 0, and the dimension of y is M; En(I ^a )+R(En(I ^a )) is a modified code; C( En(I ^a )+R(En(I ^a ))) _i is the i-th feature vector C(En(I ^a )+R(En(I ^a ))) obtained through the fully connected layer C after the correction code the value of the dimension;

因此，完整的编码损失函数为：Therefore, the complete encoding loss function is:

L_code＝L_code1+λL_code2 L _code = L _code1 + _{λL code2}

式中，λ是一个值为0.1的常数，表示权重。In the formula, λ is a constant with a value of 0.1, representing the weight.

在步骤4)中，交替地训练生成器G和判别网络D能够让两者对抗中互相优化提升；在初始阶段，生成器G生成的人脸图像模糊不清，判别网络D能够轻易判断输入图像的来源，从而激励生成器G生成更加清晰的图像，提高生成器G的质量；在后续阶段，生成器G生成的图像更清晰并且接近原始图像数据了，从而激励判别网络对输入图像做出更加精确的判断，提高判别网络D的判别能力。In step 4), alternately training the generator G and the discriminant network D can optimize and improve each other in the confrontation; in the initial stage, the face image generated by the generator G is blurred, and the discriminant network D can easily judge the input image The source of the source, thereby encouraging the generator G to generate a clearer image and improve the quality of the generator G; in the subsequent stage, the image generated by the generator G is clearer and closer to the original image data, thereby stimulating the discriminative network to make more Precise judgment improves the discriminative ability of the discriminant network D.

本发明与现有技术相比，具有如下优点与有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明充分利用了输入人脸图像的旋转角信息，并且定义了相关的编码损失函数，有助于合成质量更高的正脸图像。1. The present invention makes full use of the rotation angle information of the input face image, and defines a related encoding loss function, which helps to synthesize a higher-quality frontal face image.

2、本发明在输入的人脸图像偏转角超过60°的时候，也可以生成清晰逼真的正脸图像，并且不会产生形变。2. The present invention can also generate a clear and realistic frontal face image without deformation when the deflection angle of the input face image exceeds 60°.

3、本发明合成的正脸图像能够保留输入人脸图片的身份信息，因而它有助于减少人脸姿态变换给人脸识别带来的不利影响，为后续的人脸身份识别工作带来便利。3. The front face image synthesized by the present invention can retain the identity information of the input face picture, so it helps to reduce the adverse effects of face posture transformation on face recognition, and brings convenience to the follow-up face recognition work .

4、从现实应用场景分析，本发明有望推动嫌犯追踪等领域的发展。通过把目标人物的侧脸图像矫正为正脸图像，从而提高相关工作的效率。4. From the analysis of actual application scenarios, the present invention is expected to promote the development of suspect tracking and other fields. By correcting the side face image of the target person into a front face image, the efficiency of related work is improved.

附图说明Description of drawings

图1是本发明方法的流程图。Figure 1 is a flow chart of the method of the present invention.

图2是生成器合成图像流程图。Figure 2 is a flow chart of the generator compositing images.

图3是编码器神经网络结构图。Figure 3 is a structural diagram of the encoder neural network.

图4是解码器神经网络结构图。Figure 4 is a structural diagram of the decoder neural network.

图5是判别网络结构图。Figure 5 is a structure diagram of the discriminant network.

图6是本发明合成效果展示图。Fig. 6 is a diagram showing the synthesis effect of the present invention.

具体实施方式Detailed ways

为了更为具体地描述本发明，下面结合步骤、附图和具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solution of the present invention will be described in detail below in conjunction with steps, drawings and specific implementation methods.

如图1所示，本实施例所提供的基于生成对抗网络的多姿态人脸图像正面化方法，包括以下步骤：As shown in Figure 1, the multi-pose face image frontalization method based on generating confrontation network provided by this embodiment includes the following steps:

1)收集各个姿态的人脸图像作为训练集和测试集，必须确保输入的每一张任意姿态的人脸图像I^a，都能在数据集中找到同一张脸并且人脸偏转角为0°的非合成的正脸图像I^g。1) Collect face images of various poses as the training set and test set. It must be ensured that each input face image I ^a of any pose can find the same face in the data set and the face deflection angle is 0° Non-synthesized frontal image I ^g .

所用数据集的图像来源于Multi_Pie人脸数据集，它包含了337个人的20种光照和15种姿态下的图像，数据集中的图片数目超过75万，包含了337个人的20种光照和15种姿态下的图像；图片的光照由光照标号01到20从暗变亮，其中光照标号07为标准光照条件；把数据集所有人脸图像标记为I^a，对每一张图像I^a，在数据集中找到同一个人、人脸偏转角为0°并且光照标号为07的图像，把它们标记为I^g。数据集在使用前已经进行了人脸检测，人脸截取等预处理步骤。选取偏转角90°以内的13种姿态，所有光照条件下的图像作为数据集，并把其中前200人的图像划分为训练集，把剩余137人的图像划分为测试集。对数据集所有图像进行归一化和resize操作。其中归一化是指把图像的所有像素的值除以255.0，使图像所有像素的取值范围为[0,1]，resize是指用双线性插值法把数据集中的所有图像的维度调整在128×128×3。The images of the data set used come from the Multi_Pie face data set, which contains images of 337 people under 20 kinds of lighting and 15 postures. The number of pictures in the data set exceeds 750,000, including 20 kinds of lighting and 15 kinds of 337 people. The image under the attitude; the illumination of the picture changes from dark to bright from the illumination label 01 to 20, and the illumination label 07 is the standard illumination condition; mark all the face images in the data set as I ^a , for each image I ^a , in the data Find the images of the same person, the face deflection angle is 0°, and the illumination label is 07, and mark them as I ^g . The data set has been subjected to preprocessing steps such as face detection and face interception before use. Select 13 poses with deflection angles within 90° and images under all lighting conditions as the data set, and divide the images of the first 200 people into the training set, and divide the images of the remaining 137 people into the test set. Normalize and resize all images in the dataset. Among them, normalization refers to dividing the value of all pixels of the image by 255.0, so that the value range of all pixels of the image is [0,1], and resize refers to adjusting the dimensions of all images in the data set by bilinear interpolation. At 128x128x3.

2)在训练阶段，把训练集中的任意姿态的人脸图像I^a输入生成器G，得到修正编码X₂和合成的正脸图像I^f。把非合成的正脸图像I^g输入生成器G，得到正脸编码X₃。2) In the training phase, input the face image I ^a of any pose in the training set into the generator G, and obtain the corrected code X ₂ and the synthesized front face image I ^f . Input the non-synthesized front face image I ^g into the generator G to obtain the front face code X ₃ .

生成器G的作用是把输入的人脸图像I^a转化为合成的正脸图像I^f，由姿态估计器P，编码器En，编码残差网络和R和解码器De组成。如图2所示，生成器G合成图片的过程为：对生成器G输入任意姿态的人脸图像I^a，编码器En把它转化为初始编码X₁，编码残差网络R通过初始编码X₁估算出编码残差R(X₁)，姿态估算器P求取输入图像中人脸在yaw方向上的精确偏转角度γ，偏转角γ输入函数Y，得到编码残差的权重Y(γ)，初始编码X₁和编码残差融合得到修正编码X₂，其中X₂＝X₁+Y(γ)×R(X₁)，把修正编码X₂输入到解码器De，通过反卷积生成正脸图像I^f。The function of the generator G is to convert the input face image I ^a into a ^synthetic front face image If, which consists of a pose estimator P, an encoder En, an encoded residual network and R, and a decoder De. As shown in Figure 2, the process of the generator G synthesizing pictures is as follows: input the face image I ^a of any pose to the generator G, the encoder En converts it into the initial code X ₁ , and the coded residual network R passes the initial code X ₁ Estimate the encoding residual R(X ₁ ), the pose estimator P obtains the precise deflection angle γ of the face in the input image in the yaw direction, and the deflection angle γ is input into the function Y to obtain the weight Y(γ) of the encoding residual , the initial code X ₁ and the coded residual are fused to obtain a modified code X ₂ , where X ₂ =X ₁ +Y(γ)×R(X ₁ ), the modified code X ₂ is input to the decoder De, and generated by deconvolution Front ^face image If.

姿态估计器P的作用是求出输入人脸图像在yaw方向上较为精确的人脸偏转角度γ，进而得到编码残差的权重：姿态估计器P采用PnP算法估算人脸的姿态。在实现的时候直接使用了opencv库中的cv2.solvePnP()函数。此函数所需要的参数包括人脸图像的2D特征点和各个特征点对应的3D位置。其中人脸的2D特征点用开源的dlib库所提供的人脸特征点检测算法直接求取。各个特征点的3D位置采用平均人脸模型上特征点的位置，这些3D位置是固定的，由dlib库中2D特征点检测算法的相关文档提供。The role of the pose estimator P is to find the more accurate face deflection angle γ of the input face image in the yaw direction, and then obtain the weight of the coding residual: The pose estimator P uses the PnP algorithm to estimate the pose of the face. The cv2.solvePnP() function in the opencv library is directly used in the implementation. The parameters required by this function include the 2D feature points of the face image and the 3D positions corresponding to each feature point. The 2D feature points of the face are directly obtained by the face feature point detection algorithm provided by the open source dlib library. The 3D position of each feature point adopts the position of the feature point on the average face model. These 3D positions are fixed and provided by the relevant documents of the 2D feature point detection algorithm in the dlib library.

编码器En是卷积神经网络，其网络结构如图3所示，它的作用是将输入的人脸I^a转化为初始编码X₁。其中输入图像I^a的维度是128×128×3，编码器最后一层的激活函数为maxout，输出的初始编码X₁维度是256。The encoder En is a convolutional neural network, and its network structure is shown in Figure 3. Its function is to convert the input face I ^a into an initial code X ₁ . The dimension of the input image ^Ia is 128×128×3, the activation function of the last layer of the encoder is _maxout , and the dimension of the output initial coded X1 is 256.

编码残差网络R是一个两层的全连接神经网络，两层的神经元个数都是256。设非合成的正脸图像I^g通过编码器所得到的编码为X₃，编码残差网络R的作用就是估算初始编码X₁和正脸编码X₃之间的编码残差R(X₁)。编码残差乘以权重并且融合初始编码X₁，得到修正编码X₂，其中X₂＝X₁+Y(γ)×R(X₁)。The coded residual network R is a two-layer fully connected neural network, and the number of neurons in both layers is 256. Assuming that the code obtained by the non-synthesized frontal image I ^g through the encoder is X ₃ , the function of the coding residual network R is to estimate the coding residual R(X ₁ ) between the initial coding X ₁ and the frontal coding X ₃ . The coded residual is multiplied by the weight and the initial code X ₁ is fused to obtain the modified code X ₂ , where X ₂ =X ₁ +Y(γ)×R(X ₁ ).

解码器De属于反卷积神经网络，其网络结构如图4所示。它的作用则是将修正编码X₂解码，通过反卷积步骤得到合成的正脸图像I^f。对目前的以生成对抗网络为基础的人脸矫正法，输入的人脸偏转角度越小，合成人脸图像的质量就越高，图像中能够保留的人物身份信息就越多。由于修正编码X₂比初始编码X₁更加接近正脸编码X₃，用修正编码X₂代替初始编码X₁输入解码器，能够合成质量更高的人脸图像。The decoder De belongs to the deconvolutional neural network, and its network structure is shown in Figure 4. Its function is to decode the correction code X ₂ and obtain the synthesized front face image If through the ^{deconvolution} step. For the current face correction method based on generative confrontation network, the smaller the deflection angle of the input face, the higher the quality of the synthetic face image, and the more identity information of the person can be retained in the image. Since the modified code X ₂ is closer to the frontal face code X ₃ than the initial code X ₁ , the modified code X ₂ is used instead of the initial code X ₁ to input the decoder to synthesize a higher-quality face image.

3)合成的正脸图像I^f或非合成的正脸图像I^g输入判别网络D，判别网络D判断输入的人脸图像是合成的还是非合成的。再把合成的正脸图像I^f或非合成的正脸图像I^g输入人脸身份特征提取器F，得到输入人脸图像的人物身份特征。3) The synthesized front face image If or the non-synthesized front face image ^Ig ^is input into the discrimination network D, and the discrimination network D judges whether the input face image is synthesized or non-synthesized. Then input the synthesized front face image I ^f or non-synthesized front face image I ^g into the face identity feature extractor F to obtain the character identity feature of the input face image.

判别网络D是一个以卷积神经网络为基础的二分类器，其网络结构如图5所示，它的作用是判断输入的图像来源于生成器G还是原始图像数据，判别网络的最终输出一个值，用来表示输入图像来源于生成器G的可能性，这个值越大说明输入图像来源于生成器G的可能性越大。The discriminant network D is a binary classifier based on the convolutional neural network. Its network structure is shown in Figure 5. Its function is to judge whether the input image comes from the generator G or the original image data. The final output of the discriminant network is a The value is used to indicate the possibility that the input image comes from the generator G. The larger the value, the greater the possibility that the input image comes from the generator G.

人脸身份特征提取器F采用已经开源的Light-CNN-29。Light-CNN29是一个轻量级卷积神经网络，深度为29层，参数数量大约1200万，能够提取出输入人脸图片的身份特征，最后提取的身份特征维度为256。The face identity feature extractor F uses the open source Light-CNN-29. Light-CNN29 is a lightweight convolutional neural network with a depth of 29 layers and about 12 million parameters. It can extract the identity features of the input face image, and the final extracted identity feature dimension is 256.

4)把步骤3)的判别结果和提取的人物身份特征、合成的正脸图像I^f、非合成的正脸图像I^g、修正编码X₂和正脸编码X₃带入各个预先设计好的损失函数，交替训练生成器G和判别网络D直至训练结束。4) Bring the discriminant result of step 3) and the extracted identity features, synthesized frontal image If, non-synthesized frontal image ^Ig , modified code X ₂ and ^frontal code X ₃ into each pre-designed loss function, alternately train the generator G and discriminative network D until the end of training.

训练生成器的损失函数除了同类型方法常用的像素损失函数、身份损失函数、对称损失函数和对抗损失函数，还包括了自创的编码损失函数。In addition to the pixel loss function, identity loss function, symmetric loss function, and adversarial loss function commonly used in similar methods, the loss function of the training generator also includes a self-created encoding loss function.

首先是像素损失函数，它表示合成的正脸图像I^f和非合成的正脸图像I^g之间的像素差距，公式如下：The first is the pixel loss function, which represents the pixel gap between the synthesized ^frontal image If and the non-synthesized frontal image ^Ig , the formula is as follows:

其中W和H分别代表图像的宽和高。和分别表示合成的正脸图像I^f和非合成的正脸图像I^g在坐标(x，y)下的像素值。Where W and H represent the width and height of the image, respectively. and represent the pixel values of the synthesized frontal image If and the non-synthesized frontal image ^Ig at the coordinates ( ^x , y), respectively.

然后是对称损失函数，鉴于人脸具有对称的特点，合成的正脸图像I^f应该和它经过左右翻转后得到的图像I^sym尽可能接近，对称损失函数公式如下：Then there is the symmetric loss function. In view of the symmetry of the human face, the synthesized front face image If should be as close as possible to the image I ^sym ^obtained after it has been flipped left and right. The formula of the symmetric loss function is as follows:

其中W和H分别代表图像的宽和高。和分别表示合成的正脸图像I^f和它经过左右翻转后得到的图像I^sym在坐标(x，y)下的像素值。Where W and H represent the width and height of the image, respectively. and Denote the pixel values of the synthesized front face image If and the image I ^sym obtained after it has been flipped left and right respectively under the coordinates ( ^x , y).

然后是身份损失函数，人脸身份特征提取器F能够高效地求出正脸图像的身份特征。分别将合成的正脸图像I^f和非合成的正脸图像I^g分别输入F，得到它们的身份特征F(I^f)和F(I^g)。为了确保合成的正脸图像I^f能够包含非合成的正脸图像I^g的身份信息，需要最小化把它们输入F后所得到的最后两层身份特征图之间的曼哈顿距离。身份损失函数公式如下：Then there is the identity loss function. The face identity feature extractor F can efficiently find the identity features of the front face image. The synthesized front face image If and the non-synthesized front face image I ^g are respectively input into F, and their identity features F(I ^f ) and F(I ^g ⁾ are obtained. In order to ensure that the synthesized frontal image If can contain the identity information of the non-synthesized frontal image ^Ig , it is necessary to minimize the Manhattan distance between the last two layers of identity feature maps obtained after inputting them into ^F. The identity loss function formula is as follows:

W_i和H_i分别代表倒数第i层的人脸身份特征图的宽和高。F是人脸身份特征提取器。和分别表示正脸图像I^f和非合成的正脸图像I^g在倒数第i层的身份特征图上的坐标(x，y)的值。W _i and H _i represent the width and height of the face identity feature map of the penultimate i-th layer, respectively. F is a face identity feature extractor. and ^{represent the values of the coordinates (x, y) of the front face image If and the non-synthesized front face image I g} ^on the identity feature map of the i-th layer from the bottom respectively.

最后是对抗损失函数，它的目标是使合成图像能够混淆判别网络D，从而让合成的图像更加接近真实图像，增强了合成图像的逼真程度。对抗损失公式如下：Finally, there is the confrontation loss function, whose goal is to enable the synthetic image to confuse the discriminative network D, so that the synthetic image is closer to the real image and enhances the realism of the synthetic image. The adversarial loss formula is as follows:

其中N是当前训练批次的大小，G和D分别是生成器和判别网络，I^a和G(I^a)分别代表输入人脸图像和生成器合成的正脸图像。D(G(I^a))的值反映了合成图像G(I^a)被判别网络D判断为合成图片的可能性。最小化损失函数L_adv的目的是让生成器合成的正脸图像G(I^a)能够通过判别网络的检测，从而提高合成的正脸图像G(I^a)的逼真程度。where N is the size of the current training batch, G and D are the generator and discriminative network, respectively, and I ^a and G(I ^a ) represent the input face image and the front face image synthesized by the generator, respectively. The value of D(G(I ^a )) reflects the possibility that the synthesized image G(I ^a ) is judged as a synthesized picture by the discriminant network D. The purpose of minimizing the loss function L _adv is to make the frontal image G(I ^a ) synthesized by the generator pass the detection of the discriminative network, thereby improving the realism of the synthesized frontal image G(I ^a ).

编码损失函数的第一个目标是让输入人脸图像I^a和非合成的正脸图像I^g分别通过编码器En所得到的编码更加接近，因为目前的人脸姿态矫正法在输入人脸偏转角度越小的情况下效果越好，所以输入人脸图像的编码越接近同一人脸在0°偏转角情况下的编码，合成的正脸效果就越好；因此，编码损失函数的第一部分公式如下：The first goal of the encoding loss function is to make the encoding of the input face image ^Ia and the non-synthesized frontal image ^Ig respectively through the encoder En closer, because the current face pose correction method is in the input face deflection The smaller the angle, the better the effect, so the closer the encoding of the input face image is to the encoding of the same face at 0° deflection angle, the better the effect of the synthesized frontal face; therefore, the first part of the encoding loss function formula as follows:

式中，N是各个编码的维数，En是编码器，En(I^a)ⁱ是初始编码En(I^a)的第i维的值；R是编码残差网络，R(En(I^a))ⁱ是编码残差R(En(I^a))的第i维的值；En(I^a)ⁱ+R(En(I^a))ⁱ等于修正编码的第i维的值；是非合成的正脸图像通过编码器所得到的正脸编码X₃的第i维的值；所述的编码损失函数的第一部分是指修正编码和正脸编码之间的曼哈顿距离。In the formula, N is the dimension of each encoding, En is the encoder, En(I ^a ) ⁱ is the value of the i-th dimension of the initial encoding En(I ^a ); R is the encoding residual network, R(En(I ^a )) ⁱ is the value of the i-th dimension of the coding residual R(En(I ^a )); En(I ^a ) ⁱ +R(En(I ^a )) ⁱ is equal to the value of the i-th dimension of the correction code; is the value of the i-th dimension of the frontal coding X ₃ obtained by the non-synthesized frontal image through the encoder; the first part of the coding loss function refers to the Manhattan distance between the modified coding and the frontal coding.

所述编码损失函数的第二个目标是让不同人物之间的修正编码得到区分，在修正编码En(I^a)+R(En(I^a))的后面构建了一个全连接层C，全连接层C的神经元个数等于训练集中的人物个数M，编码损失函数的第二部分采用了交叉熵损失函数：The second goal of the encoding loss function is to distinguish the corrected encodings between different characters. A fully connected layer C is constructed behind the corrected encodings En(I ^a )+R(En(I ^a )). The number of neurons in the connection layer C is equal to the number M of characters in the training set, and the second part of the encoding loss function uses the cross-entropy loss function:

式中，M是训练集的人物个数；y_i是one-hot向量y的第i维的值，向量y表明了输入人脸图像I^a属于训练集中的哪一个人物，若图像I^a属于第j个人，那么向量y的第j维的值为1，其余的维度的值为0，y的维度为M；En(I^a)+R(En(I^a))是修正编码；C(En(I^a)+R(En(I^a)))_i是修正编码经过全连接层C所的到的特征向量C(En(I^a)+R(En(I^a)))的第i维的值。In the formula, M is the number of characters in the training set; y _i is the value of the i-th dimension of the one-hot vector y, and the vector y indicates which character in the training set the input face image I ^a belongs to. If the image I ^a belongs to For the jth person, then the value of the jth dimension of the vector y is 1, the value of the remaining dimensions is 0, and the dimension of y is M; En(I ^a )+R(En(I ^a )) is a modified code; C( En(I ^a )+R(En(I ^a ))) _i is the i-th feature vector C(En(I ^a )+R(En(I ^a ))) obtained through the fully connected layer C after the correction code dimension value.

L_code＝L_code1+λL_code2 L _code = L _code1 + _{λL code2}

综上，训练生成器的损失函数为：In summary, the loss function of the training generator is:

L_total＝L_pixel+λ₁L_sym+λ₂L_id+λ₃L_adv+λ₄L_code L _total ＝L _pixel +λ ₁ L _sym +λ ₂ L _id +λ ₃ L _adv +λ ₄ L _code

其中L_pixel、L_sym、L_id、L_adv和L_code分别代表像素损失函数，对称损失函数，身份损失函数，对抗损失函数和编码损失函数。λ₁、λ₂、λ₃和λ₄代表不同损失函数的权重。where L _pixel , L _sym , L _id , L _adv and L _code represent pixel loss function, symmetric loss function, identity loss function, adversarial loss function and encoding loss function, respectively. λ ₁ , λ ₂ , λ ₃ and λ ₄ represent the weights of different loss functions.

参考同类型方法的参数设置以及大量实验经验，各个损失函数的权重λ₁、λ₂、λ₃和λ₄分别被设置为0.2、0.003、0.001和0.002。训练生成器的同时需要训练判别网络D，训练D的目标是使判别网络能够分辨出输入的正脸图像是来源于生成器G还是原始数据集，训练判别网络的损失函数如下：Referring to the parameter settings of the same type of methods and a large amount of experimental experience, the weights λ ₁ , λ ₂ , λ ₃ and λ ₄ of each loss function are set to 0.2, 0.003, 0.001 and 0.002, respectively. While training the generator, it is necessary to train the discriminant network D. The goal of training D is to enable the discriminant network to distinguish whether the input frontal image is from the generator G or the original data set. The loss function of the trained discriminant network is as follows:

其中N是当前训练批次的大小，G和D分别是生成器和判别网络，I^a是输入图像。G(I^a)和I^g分别代表合成的正脸图像和非合成的正脸图像。logD(G(I^a))和logD(I^a)的值分别反映了合成的正脸图像G(I^a)和非合成的正脸图像I^g被判别网络D判断为合成的正脸图像的可能性。最小化L_adv2的目的就是让判别网络D能够准确地反映出输入图片为合成图片的可能性。where N is the size of the current training batch, G and D are the generator and discriminative network, respectively, and I ^a is the input image. G(I ^a ) and I ^g denote the synthesized and non-synthesized frontal images, respectively. The values of logD(G(I ^a )) and logD(I ^a ) respectively reflect the synthetic frontal image G(I ^a ) and the non-synthetic frontal image I ^g judged by the discriminative network D as the synthetic frontal image possibility. The purpose of minimizing L _adv2 is to allow the discriminant network D to accurately reflect the possibility that the input picture is a synthetic picture.

交替地训练生成器G和判别网络D能够让两者对抗中互相优化提升。在初始阶段，生成器G生成的人脸图像模糊不清，判别网络D能够轻易判断输入图像的来源，从而激励生成器G生成更加清晰的图像，提高生成器G的质量。在后续阶段，生成器G生成的图像比较清晰并且接近原始图像数据了，从而激励判别网络对输入图像做出更加精确的判断，提高判别网络D的判别能力。Alternately training the generator G and the discriminant network D can optimize and improve each other in the confrontation. In the initial stage, the face image generated by the generator G is blurred, and the discriminant network D can easily judge the source of the input image, thereby motivating the generator G to generate a clearer image and improving the quality of the generator G. In the subsequent stage, the image generated by the generator G is clearer and closer to the original image data, thereby stimulating the discriminant network to make more accurate judgments on the input image and improving the discriminative ability of the discriminant network D.

生成器G和判别网络D的损失函数设计完成后，通过Adam下降法最小化生成对抗网络的参数，设置学习率为0.0002，设置批的大小为12。每一次训练生成器G后，就训练一次判别网络D。随着训练次数的增加，生成器生成的图像质量不断提高，判别网络判别输入图像的能力不断加强，最终完成训练。本实验所用的深度学习框架为Tensorflow，计算机的显卡为1080ti，训练到2万批的时候停止训练。After the loss function design of the generator G and discriminant network D is completed, the parameters of the generated confrontation network are minimized by the Adam descent method, the learning rate is set to 0.0002, and the batch size is set to 12. After each training of the generator G, the discriminant network D is trained once. With the increase of the training times, the quality of the image generated by the generator is continuously improved, and the ability of the discriminant network to discriminate the input image is continuously strengthened, and the training is finally completed. The deep learning framework used in this experiment is Tensorflow, the graphics card of the computer is 1080ti, and the training stops when the training reaches 20,000 batches.

5)在测试阶段，对输入的任意姿态的人脸图像I^a，用训练完成的生成器G可以合成同一张人脸的正脸图像I^f，而后通过直接观测合成的正脸图像I^f的质量可以验证本发明的效果，生成图像的效果如图6所示，对每行图片，第一张是输入的偏转角超过45°的人脸图像，第二张是生成器合成的正脸图像，第三张是数据集中同一人的非合成的正脸图像。由图可见，对偏转角超过45°的人脸图像，本发明都可以合成它们的正脸图像，并且能够保留原始人脸的身份信息。5) In the test phase, for the input face image I ^a of any ^pose , the trained generator G can synthesize the front face image If of the same face, and then directly ^observe the synthesized front face image If The quality can verify the effect of the present invention. The effect of the generated image is shown in Figure 6. For each row of pictures, the first one is a face image with an input deflection angle exceeding 45°, and the second one is a frontal face image synthesized by the generator , and the third is a non-synthesized frontal face image of the same person in the dataset. It can be seen from the figure that for the face images whose deflection angle exceeds 45°, the present invention can synthesize their front face images, and can preserve the identity information of the original face.

以上所述实施例只为本发明之较佳实施例，并非以此限制本发明的实施范围，故凡依本发明之形状、原理所作的变化，均应涵盖在本发明的保护范围内。The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Therefore, all changes made according to the shape and principles of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for frontalization of multi-pose face images based on generation confrontation network, characterized in that, comprising the following steps:

1) Collect face images of each pose as a training set and a test set. It must be ensured that each input face image I ^a of any pose can find a non-synthesized front face image I ^g of the same person in the data set;

2) In the training phase, input the face image I ^a of any pose in the training set into the generator G, obtain the corrected code X ₂ and the synthesized front face image I ^f , and input the non-synthesized front face image I ^g into the generator G , to get the front face code X ₃ ;

3) Input the synthesized front face image I ^f or the non-synthesized front face image I ^g into the discriminant network D, and the discriminant network D judges whether the input face image is synthesized or non-synthesized, and then inputs the synthesized front face image I ^f Or non-synthetic positive face image ^Ig input face identity feature extractor F, extract the character identity feature of face image by F;

4) Bring the discriminant result of step 3) and the extracted identity features, synthesized frontal image If, non-synthesized frontal image ^Ig , modified code X ₂ and ^frontal code X ₃ into each pre-designed loss function, alternately train the generator G and the discriminant network D until the end of training;

5) In the test phase, input a face image I ^a of any pose into the trained generator G, and a synthesized front face image I ^f can be obtained, which can be verified by directly observing the quality of the synthesized front face image ^If Effect.

2. A kind of method for frontalization of multi-pose face images based on generation confrontation network according to claim 1, characterized in that: in step 1), the face images of each pose used in the data set are all derived from Dataset Multi_Pie; the number of pictures in the data set exceeds 750,000, including images of 337 people under 20 kinds of lighting and 15 postures; the lighting of the picture changes from dark to bright from light number 01 to 20, of which light number 07 is the standard light Conditions: label all face images in the data set as I ^a , for each image I ^a , find the same person in the data set, the face deflection angle is 0° and the image with the illumination label 07, and mark them as I ^g .

3. A kind of multi-pose face image frontalization method based on generation confrontation network according to claim 1, is characterized in that: in step 2), described generator G is composed of pose estimator P, encoder En, The coded residual network R and the decoder D are composed; the pose estimator P adopts the PnP algorithm to estimate the face pose, and then obtains the deflection angle of the face in the yaw direction, and the function cv2.solvePnP( ) implementation; the encoder En is a convolutional neural network, the encoding residual network R is a two-layer fully connected neural network, and the decoder De is a deconvolutional neural network;

The process of the generator G synthesizing pictures is as follows: input the face image I ^a of any pose to the generator G, the encoder En converts it into the initial code X ₁ , and the coded residual network R estimates the code by the initial code X ₁ The residual R(X ₁ ), the attitude estimator P obtains the precise deflection angle γ of the face in the input image in the yaw direction, and the deflection angle γ is input into the function Y to obtain the weight Y(γ) of the coding residual, and the initial coding X ₁ and the coded residual are fused to obtain the modified code X ₂ , where X ₂ =X ₁ +Y(γ)×R(X ₁ ), the modified code X ₂ is input to the decoder De, and the front face image I is generated by deconvolution ^f .

4. A kind of multi-pose face image frontalization method based on generation confrontation network according to claim 1, is characterized in that: in step 3), described discriminant network D is a based on convolutional neural network The two classifiers are used to judge whether the input image comes from the generator G or the original image data; the face identity feature extractor F uses the already open source Light-CNN-29, and Light-CNN29 is a lightweight convolutional neural network The network has a depth of 29 layers and a number of parameters of 12 million.

5. a kind of multi-pose face image frontalization method based on generation confrontation network according to claim 1, is characterized in that: in step 4) in, the target of described loss function is to minimize the synthetic frontal image I The difference between ^f and the non-synthesized front face image I ^g , so that the synthetic front face image I ^f can retain more identity information of the input face image; the loss function used in step 4) is not only commonly used in the same type of method The pixel loss function, identity loss function, symmetric loss function and adversarial loss function, also includes self-created encoding loss function, the first goal of the encoding loss function is to let the input face image I ^a and the non-synthesized positive The encoding obtained by the face image I ^g through the encoder En is closer, because the current face posture correction method works better when the deflection angle of the input face is smaller, so the encoding of the input face image is closer to the same person. The encoding of the face in the case of 0° deflection angle, the better the effect of the synthetic frontal face; therefore, the first part of the encoding loss function formula is as follows:

In the formula, N is the dimension of each encoding, En is the encoder, En(I ^a ) ⁱ is the value of the i-th dimension of the initial encoding En(I ^a ); R is the encoding residual network, R(En(I ^a )) ⁱ is the value of the i-th dimension of the coding residual R(En(I ^a )); En(I ^a ) ⁱ +R(En(I ^a )) ⁱ is equal to the value of the i-th dimension of the correction code; Is the value of the i-th dimension of the frontal encoding X ₃ obtained by the non-synthesized frontal image through the encoder; the first part of the encoding loss function refers to the Manhattan distance between the correction encoding and the frontal encoding;

The second goal of the encoding loss function is to differentiate the corrected encodings between different characters, and build a fully connected layer C behind the corrected encoded En(I ^a )+R(En(I ^a )), fully connected The number of neurons in layer C is equal to the number M of characters in the training set, and the second part of the encoding loss function uses the cross-entropy loss function:

In the formula, M is the number of characters in the training set; y _i is the value of the i-th dimension of the one-hot vector y, and the vector y indicates which character in the training set the input face image I ^a belongs to. If the image I ^a belongs to For the jth person, then the value of the jth dimension of the vector y is 1, the value of the remaining dimensions is 0, and the dimension of y is M; En(I ^a )+R(En(I ^a )) is a modified code; C( En(I ^a )+R(En(I ^a ))) _i is the i-th feature vector C(En(I ^a )+R(En(I ^a ))) obtained through the fully connected layer C after the correction code the value of the dimension;

Therefore, the complete encoding loss function is:

L _code = L _code1 + _{λL code2}

In the formula, λ is a constant with a value of 0.1, representing the weight.

6. A kind of multi-pose face image frontalization method based on generation confrontation network according to claim 1, is characterized in that: in step 4), alternately training generator G and discriminant network D can make both confrontation mutual optimization and improvement; in the initial stage, the face image generated by the generator G is blurred, and the discriminant network D can easily judge the source of the input image, thereby motivating the generator G to generate a clearer image and improving the quality of the generator G; In the subsequent stage, the image generated by the generator G is clearer and closer to the original image data, thereby stimulating the discriminant network to make more accurate judgments on the input image and improving the discriminative ability of the discriminant network D.