CN112949565A

CN112949565A - Single-sample partially-shielded face recognition method and system based on attention mechanism

Info

Publication number: CN112949565A
Application number: CN202110320104.0A
Authority: CN
Inventors: 钟福金; 侯梦军; 王润生
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Jiangsu Chunshumuyun Software Technology Co ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-11
Anticipated expiration: 2041-03-25
Also published as: CN112949565B

Abstract

The invention belongs to the field of single-sample partially occluded face recognition, and in particular relates to a single-sample partially occluded face recognition method and system based on an attention mechanism, comprising: acquiring a partially occluded test face image and an unoccluded single-sample face image Gallery set and preprocess it; input the preprocessed data into the differential network composed of ReseNet‑34, and extract the shallow feature map through a convolutional layer; use the spatial attention module to adjust the shallow depth from the spatial position information The weight of the layer feature map, and multiply the weight with the feature map output by the last convolution layer to highlight its local detail features; subtract the feature maps of the occluded image and the unoccluded image after highlighting the local detail features; After the difference value is processed by absolute value, it is calibrated by channel on the original feature map through the channel attention module; the calibrated feature map is sent to the full connection layer to output the classification result; Identification.

Description

A single-sample partial occlusion face recognition method and system based on attention mechanism

技术领域technical field

本发明属于单样本部分遮挡人脸识别领域，特别涉及到一种基于注意力机制的单样本部分遮挡人脸识别方法及系统。The invention belongs to the field of single-sample partial occlusion face recognition, and particularly relates to a single-sample partial occlusion face recognition method and system based on an attention mechanism.

背景技术Background technique

人脸识别是基于人的脸部特征信息进行身份识别的一种生物识别技术，通过使用摄像机或摄像头采集含有人脸的图像或视频，对采集到的人脸进行一系列技术处理，以达到识别不同人身份的目的。经过多年的研究，人脸识别在受控条件下已经取得了很好的结果，然而在无约束条件下的人脸识别仍然面临诸多挑战。在一些特殊的应用场景中，比如身份证管理系统、刑侦执法系统、护照验证和登记口身份识别等，只能收集到每个人的一张人脸图像(证件照人脸图像)作为训练样本，当测试样本可能受到包括剧烈的脸部变化如光照变化、姿态变化、表情变化和外部物体遮挡等的影响等，这些影响会导致人脸的面部特征在类内的差异大于类间的差异，从而出现了单样本部分遮挡人脸识别问题，即利用仅有的单样本设计人脸识别方法来识别未知的人脸图像。Face recognition is a biometric recognition technology based on human facial feature information. By using cameras or cameras to collect images or videos containing human faces, a series of technical processing is performed on the collected faces to achieve recognition. The purpose of the identity of different people. After years of research, face recognition has achieved good results under controlled conditions, but face recognition under unconstrained conditions still faces many challenges. In some special application scenarios, such as ID card management system, criminal investigation and law enforcement system, passport verification and registration port identification, etc., only one face image of each person (face image of ID photo) can be collected as a training sample. When the test samples may be affected by drastic facial changes such as illumination changes, posture changes, expression changes, and occlusion by external objects, etc., these influences will cause the facial features of the face to have greater intra-class differences than inter-class differences, so The single-sample partial occlusion face recognition problem occurs, that is, the only single-sample design face recognition method is used to identify unknown face images.

在现实生活中，造成面部遮挡的情形可以分为以下三类：1)受外部物体遮挡(如太阳镜、围巾、帽子等)；2)极端光照(如阴影)；3)姿态变化所引起的自遮挡(如侧脸)，我们以下所说的遮挡主要是受外部物体遮挡的人脸识别。In real life, the situations that cause face occlusion can be divided into the following three categories: 1) occlusion by external objects (such as sunglasses, scarves, hats, etc.); 2) extreme lighting (such as shadows); 3) self-confidence caused by pose changes Occlusion (such as a side face), the occlusion we say below is mainly face recognition that is occluded by external objects.

虽然人脸识别在遮挡这个研究领域中已经得到了很好的处理，但仍然存在一些问题：1)目前的方法仍然不能完全的去除遮挡所带来的影响。如果可以完全去除遮挡的影响，识别结果将会更加理想。2)目前的方法在面临单样本问题时，现有一些成熟的人脸识别算法无法利用单训练样本提取类内变化信息，所以它们识别效果会差很多。Although face recognition has been well handled in the research field of occlusion, there are still some problems: 1) Current methods still cannot completely remove the impact of occlusion. If the influence of occlusion can be completely removed, the recognition results will be more ideal. 2) When the current method faces the single-sample problem, some existing mature face recognition algorithms cannot use a single training sample to extract intra-class variation information, so their recognition effect will be much worse.

现在多数研究工作都集中在如何提高识别系统的精度上，忽略了面部数据库所存在的问题，比如由于采集样本的困难或系统的存储限制等，数据库中每个人可能只有一个样本图像，在这种情况下，大多数传统方法，如PCA和LDA将出现性能下降甚至无法工作的情况，而且采用深度学习路线时，每个人仅存储有一张图像的面部数据库，缺乏大量的样本来学习丰富的类内变化信息，在这个前提下，处理遮挡问题并没有良好的表现。综上所述，单样本部分遮挡人脸识别在现实应用场景中是不可避免的，单样本约束下的人脸识别问仍面临巨大的挑战。Most of the research work now focuses on how to improve the accuracy of the recognition system, ignoring the problems of face databases, such as the difficulty of collecting samples or the storage limitations of the system, each person in the database may have only one sample image. In this case, most traditional methods, such as PCA and LDA, will suffer from performance degradation or even fail to work, and when the deep learning route is adopted, each person only stores a face database of one image, lacking a large number of samples to learn rich intra-class Change information, under this premise, dealing with occlusion problems does not perform well. To sum up, single-sample partial occlusion face recognition is unavoidable in real-world application scenarios, and face recognition under single-sample constraints still faces huge challenges.

发明内容SUMMARY OF THE INVENTION

鉴于上述提到的缺乏大量的样本来学习丰富的类内变化信息问题以及由于遮挡的存在造成面部特征信息丢失影响识别精度问题，本发明提出一种基于注意力机制的单样本部分遮挡人脸识别方法及系统，所述方法包括：In view of the above-mentioned problems of lack of a large number of samples to learn rich intra-class variation information and the loss of facial feature information due to the existence of occlusion affecting recognition accuracy, the present invention proposes a single-sample partial occlusion face recognition based on attention mechanism A method and system, the method comprising:

输入带有类别标签的人脸图像对集作为源域数据集，人脸图像对包括一张干净正脸与同一身份带有部分遮挡的人脸，对人脸图像对数据集进行预处理；Input the face image pair set with category labels as the source domain dataset, the face image pair includes a clean frontal face and a face with the same identity with partial occlusion, and preprocess the face image pair dataset;

将预处理后的人脸图像输入到由两个ResNet-34构成的差分网络中，经过一个卷积层提取出浅层特征图；Input the preprocessed face image into the differential network composed of two ResNet-34, and extract the shallow feature map through a convolutional layer;

将上述浅层特征图输入由依次级联的四个残差模块组构成的残差网络，提取出人脸图像的全局特征；Input the above shallow feature map into a residual network composed of four residual module groups cascaded in sequence to extract the global features of the face image;

在第二层与第四层残差模块之间嵌入空间注意力模块，调节浅层特征图未被遮挡区域像素的权重，输出一个空间位置权重特征图；Embed a spatial attention module between the second layer and the fourth layer residual module, adjust the weight of the pixels in the unoccluded area of the shallow feature map, and output a spatial position weight feature map;

将所述空间位置权重特征图与第四层残差模块输出的特征图通过相乘连接，通过跨层信息的融合，获取到来自低层的局部细节特征；The spatial position weight feature map and the feature map output by the fourth-layer residual module are connected by multiplication, and the local detail features from the lower layer are obtained through the fusion of cross-layer information;

将突出局部细节特征后的遮挡图像与未遮挡图像特征图差值的绝对值作为通道注意力模块的输入；The absolute value of the difference between the occluded image after highlighting the local detail features and the feature map of the unoccluded image is used as the input of the channel attention module;

通道注意力模块根据输入的绝对值在突出局部细节特征后的遮挡图像与未遮挡图像特征图上按通道进行标定，标定后的特征图送入全连接层输出分类结果；According to the absolute value of the input, the channel attention module calibrates the feature maps of the occluded image and the unoccluded image after highlighting the local detail features by channel, and the calibrated feature map is sent to the fully connected layer to output the classification result;

联合优化交叉熵损失函数和人脸对图像由于差异引起的对比损失，对网络进行迭代训练；The network is iteratively trained by jointly optimizing the cross-entropy loss function and the contrast loss caused by the difference between the face and the image;

经过多轮训练后，网络损失趋于稳定，迭代训练过程结束，得到训练好的网络模型；After multiple rounds of training, the network loss tends to be stable, the iterative training process ends, and the trained network model is obtained;

将目标域单样本人脸Gallery集和部分遮挡的测试人脸图像输入到训练好的网络模型中，模型根据人脸图像特征余弦距离计算输出最终遮挡的人脸所属类别。The single-sample face gallery set of the target domain and the partially occluded test face image are input into the trained network model, and the model calculates and outputs the category of the final occluded face according to the cosine distance of the face image feature.

进一步的，对人脸图像对数据集进行预处理包括将人脸图像裁剪为128×128大小，并对裁剪后的人脸图像进行像素归一化操作，表示为：Further, preprocessing the face image data set includes cropping the face image to a size of 128×128, and performing a pixel normalization operation on the cropped face image, which is expressed as:

X_pix＝(X_pix-128)/128；X _pix = (X _pix -128)/128;

其中，X_pix为人脸图像对应的像像素值。Among them, X _pix is the image pixel value corresponding to the face image.

进一步的，经过一个卷积层提取出浅层特征图包括：将通道数为3的人脸图像输入卷积核大小为3×3、通道数为64、步长为1的卷积层进行特征提取，输出大小为128×128、输出通道变为64的特征图，并将该特征图经过最大池化层，得到人脸图像的浅层特征图。Further, extracting a shallow feature map through a convolutional layer includes: inputting a face image with a channel number of 3 into a convolutional layer with a convolution kernel size of 3 × 3, a channel number of 64, and a stride of 1 to perform features. Extract, output a feature map with a size of 128×128 and an output channel of 64, and pass the feature map through the maximum pooling layer to obtain a shallow feature map of the face image.

进一步的，在依次级联的四个残差模块组中，照级联顺序每个残差模块组依次包括3、4、6、3个残差模块，按照级联顺序每个残差模块组输出的特征图大小为64×64、32×32、16×16、8×8。Further, in the four residual module groups that are cascaded in sequence, each residual module group includes 3, 4, 6, and 3 residual modules in the cascade order, and each residual module group is in the cascade order. The output feature map sizes are 64×64, 32×32, 16×16, 8×8.

进一步的，获取来自低层的局部细节特征的过程包括：Further, the process of obtaining local detail features from lower layers includes:

第二层残差模块与第四层残差模块之间嵌入空间注意力模块；A spatial attention module is embedded between the second-layer residual module and the fourth-layer residual module;

将输入的h′×w′×c′的三维张量，利用全局平均池化和全局最大池化，得到权向量特征图；The input 3D tensor of h′×w′×c′ is obtained by global average pooling and global maximum pooling to obtain the weight vector feature map;

使用一个卷积核为7×7、填充大小为3、通道数为1的卷积层以及Sigmoid非线性激活层对权向量特征图进行处理；The weight vector feature map is processed using a convolutional layer with a convolution kernel of 7 × 7, a padding size of 3, and a number of channels of 1 and a Sigmoid nonlinear activation layer;

将第四层残差组模块输出的大小为8×8特征图通过双线性插值法上采样为32×32的特征图；The size of the output of the fourth-layer residual group module is 8 × 8 feature maps through bilinear interpolation method upsampling into 32 × 32 feature maps;

将上采样后的特征图与处理后的权向量特征图进行相乘，再通过下采样输出8×8大小的特征图，得到来自低层的局部细节特征。The upsampled feature map is multiplied by the processed weight vector feature map, and then the 8×8 feature map is output by downsampling to obtain the local detail features from the lower layers.

进一步的，通道注意力模块根据输入的特征图差的绝对值在突出局部细节特征后的遮挡图像和未遮挡图像的特征图上按通道进行标定包括：Further, the channel attention module performs channel-by-channel calibration on the feature maps of the occluded image and the unoccluded image after highlighting the local detail features according to the absolute value of the input feature map difference, including:

将突出局部细节特征后的遮挡图像和未遮挡图像的特征图做一个相减操作，并将相减得到的差的绝对值作为通道注意力模块的输入；Perform a subtraction operation on the feature map of the occluded image after highlighting the local detail features and the feature map of the unoccluded image, and use the absolute value of the difference obtained by the subtraction as the input of the channel attention module;

通道注意力模块分别采用全局平均池化以及全局最大池化操作来获得两个1×1×C的通道描述；The channel attention module adopts global average pooling and global maximum pooling operations to obtain two 1×1×C channel descriptions;

两个通道描述分别送入一个浅层的神经网络，浅层的神经网络中第一层神经元个数是C/r，激活函数为ReLU，第二层神经元个数为C；The two channel descriptions are respectively sent to a shallow neural network. In the shallow neural network, the number of neurons in the first layer is C/r, the activation function is ReLU, and the number of neurons in the second layer is C;

将浅层的神经网络提取的两个特征图相加合并，利用Sigmoid激活函数获得各个通道的权重系数；Add and merge the two feature maps extracted by the shallow neural network, and use the Sigmoid activation function to obtain the weight coefficients of each channel;

得到特征通道的权重之后，通过乘法逐通道加权到原来的特征上，完成在通道维度上的原始特征重标定；After the weight of the feature channel is obtained, it is weighted to the original feature by multiplication channel by channel, and the original feature re-calibration in the channel dimension is completed;

其中，C为通道数，r为降维因子。Among them, C is the number of channels, and r is the dimension reduction factor.

进一步的，交叉熵损失函数表示为：Further, the cross-entropy loss function is expressed as:

其中，

为交叉熵损失函数；

表示网络中样本yⁱ的身份分类概率，

表示第i个人脸图像对中存在遮挡的人脸图像，F为差分网络中最后一个卷积层后的全连接层，

表示进行通道注意力掩模运算后的存在局部遮挡的人脸的特征，μ(·)表示通道注意力输出的权重特征图，是值为[0,1]之间的掩膜；f(·)代表卷积层最后输出的特征；n表示人脸图像的训练集总样本数。in,

is the cross entropy loss function;

represents the identity classification probability of the sample ^yi in the network,

Indicates the occluded face image in the i-th face image pair, F is the fully connected layer after the last convolutional layer in the differential network,

Represents the feature of the face with partial occlusion after the channel attention mask operation, μ(·) represents the weight feature map of the channel attention output, which is a mask with a value between [0, 1]; f(· ) represents the final output feature of the convolutional layer; n represents the total number of samples in the training set of face images.

进一步的，对比损失函数表示：Further, the contrastive loss function expresses:

其中，

为由于遮挡区域的存在所造成的两张图片对比损失，μ(·)表示通道注意力输出的权重特征图，是值为[0,1]之间的掩膜；

表示第i个人脸图像对中存在遮挡的人脸图像；xⁱ表示第i个人脸图像对中不存在遮挡的人脸图像；f(·)代表卷积层最后输出的特征；n表示人脸图像的训练集总样本数。in,

is the contrast loss of the two images caused by the existence of the occluded area, μ( ) represents the weight feature map of the channel attention output, which is a mask with a value between [0, 1];

represents the occluded face image in the ith face image pair; x ⁱ represents the ith face image pair without occlusion; f( ) represents the final output feature of the convolutional layer; n represents the face The total number of samples in the training set of images.

本发明还提出一种基于注意力机制的单样本部分遮挡人脸识别系统，包括图像获取模块、数据预处理模块、神经网络模块以及输出模块，其中：The present invention also proposes a single-sample partial occlusion face recognition system based on an attention mechanism, including an image acquisition module, a data preprocessing module, a neural network module and an output module, wherein:

图像获取模块，用于输入数据集，获取人脸图像信息或待测人脸图像；The image acquisition module is used to input the data set and obtain the face image information or the face image to be tested;

数据预处理模块，用于对人脸图像作像素归一化处理，同时在源域数据集中做数据增强，通过随机添加遮挡物的操作对源域的训练集进行扩展；The data preprocessing module is used to normalize the pixels of the face image, and at the same time, perform data enhancement in the source domain data set, and expand the source domain training set by randomly adding occluders;

神经网络模块，用于构建并训练由两个嵌入注意力机制的相同ReseNet-34所形成的差分神经网络；Neural network module for building and training a differential neural network formed by two identical ReseNet-34 embedded with attention mechanism;

输出模块，用于输出待测人脸图像的最终所属身份类别，即将利用源域训练好的模型迁移到目标域数据集上，将单样本人脸Gallery集和部分遮挡的测试人脸图像送入模型，判断出部分遮挡的测试人脸图像的身份。The output module is used to output the final identity category of the face image to be tested. The model trained in the source domain will be transferred to the target domain data set, and the single-sample face gallery set and the partially occluded test face image will be sent into The model determines the identity of the partially occluded test face image.

本发明的有益技术效果：Beneficial technical effects of the present invention:

(1)本发明具有速度快、精度高的效果，能够对任意输入的带有部分遮挡的人脸图像进行准确的身份判别。(1) The present invention has the effects of high speed and high precision, and can accurately identify the identity of any input face image with partial occlusion.

(2)本发明提出了一种新颖的利用差分网络同时兼顾高层-低层信息融合的特征提取架构，通过全局信息与局部信息的连接，得到跨层信息的融合，增强了网络的特征表征能力，以获得更具有判别性的表示，实现高精度的单样本部分遮挡人脸识别效果。(2) The present invention proposes a novel feature extraction architecture that utilizes differential networks while taking into account high-level-low-level information fusion. Through the connection of global information and local information, the fusion of cross-layer information is obtained, which enhances the feature representation capability of the network. In order to obtain a more discriminative representation and achieve a high-precision single-sample partial occlusion face recognition effect.

(3)本发明在差分网络中嵌入了空间注意力和通道注意力模块，空间注意力模块引导模型关注哪里的特征是有意义的，及重点关注未被遮挡区域的特征；通道注意力模块通过建模通道之间的相关性，完成在通道维度上的原始特征重标定，抑制了对遮挡区域积极响应的通道，克服了现有单样本部分遮挡人脸识别方法中存在的缺陷。(3) The present invention embeds spatial attention and channel attention modules in the differential network. The spatial attention module guides the model to focus on where the features are meaningful, and focuses on the features of the unoccluded areas; the channel attention module passes the The correlation between the modeling channels is completed, the original feature re-calibration in the channel dimension is completed, the channels that respond positively to the occlusion area are suppressed, and the defects existing in the existing single-sample partial occlusion face recognition methods are overcome.

附图说明Description of drawings

图1为本发明实施例提供的一种基于注意力机制的单样本部分遮挡人脸识别方法过程示意图；1 is a schematic process diagram of a single-sample partial occlusion face recognition method based on an attention mechanism provided by an embodiment of the present invention;

图2为本发明实施例的空间注意力模块示意图；2 is a schematic diagram of a spatial attention module according to an embodiment of the present invention;

图3为本发明实施例的通道注意力模块示意图；3 is a schematic diagram of a channel attention module according to an embodiment of the present invention;

图4为本发明实施例的训练流程示意图；4 is a schematic diagram of a training process according to an embodiment of the present invention;

图5为本发明实施例的一种基于注意力机制的单样本部分遮挡人脸识别网络结构示意图；5 is a schematic structural diagram of a single-sample partial occlusion face recognition network based on an attention mechanism according to an embodiment of the present invention;

图6为本发明实施例的一个应用效果图。FIG. 6 is an application effect diagram of an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明提出一种基于注意力机制的单样本部分遮挡人脸识别方法，具体包括以下步骤：The present invention proposes a single-sample partial occlusion face recognition method based on an attention mechanism, which specifically includes the following steps:

在一个实施例中，本发明意在使用一个训练好的模型，在训练阶段所采用的数据集为CASIA_WebFace人脸数据集，该数据集作为源域，包含从互联网收集的494414幅人脸图像，来自10575人，标签由相同身份的同一个数值表示。为了处理遮挡区域的存在引起识别精度下降的问题，本发明将CASIA_WebFace数据集随机添加黑色块或口罩、树叶等真实遮挡物做数据增强，将数据增强后的CASIA_WebFace数据集对模型进行一个训练。对模型训练之后，将模型迁移到目标域数据集上，为了保证训练集和测试集的划分的公平性，本实施例中所有对比实验中的目标域单样本人脸Gallery集和部分遮挡的测试人脸图像集划分结构皆相同。目标域数据集包括AR、Extended Yale B、CAS-PEAL-R1三个单样本人脸数据集。In one embodiment, the present invention intends to use a trained model, and the data set used in the training phase is the CASIA_WebFace face data set, which serves as the source domain and includes 494,414 face images collected from the Internet, From 10575 people, the labels are represented by the same numerical value for the same identity. In order to deal with the problem that the recognition accuracy decreases due to the existence of occluded areas, the present invention randomly adds black blocks or real occlusions such as masks and leaves to the CASIA_WebFace data set for data enhancement, and performs a training on the model with the CASIA_WebFace data set after data enhancement. After the model is trained, the model is migrated to the target domain data set. In order to ensure the fairness of the division of the training set and the test set, all the comparative experiments in this embodiment are the target domain single-sample face Gallery set and partial occlusion test. The division structure of the face image set is the same. The target domain datasets include three single-sample face datasets: AR, Extended Yale B, and CAS-PEAL-R1.

对源域和目标域数据集进行预处理：将所有图片统一裁剪为128×128大小，并对处理后的人脸图像作像素归一化处理，其公式包括：Preprocess the source domain and target domain datasets: uniformly crop all images to 128×128 size, and normalize the pixels of the processed face images. The formula includes:

X_pix＝(X_pix-128)/128；X _pix = (X _pix -128)/128;

其中，X_pix对应为所输入的人脸图像像素值，具体而言，为输入至差分网络的人脸图像像素值。Wherein, X _pix corresponds to the input face image pixel value, specifically, the face image pixel value input to the difference network.

将预处理后的样本图像依次输送至神经网络中，利用反向传播最小化损失函数，进行网络的训练。相比于传统的单样本部分遮挡人脸识别算法，本发明采用ResNet-34以缩减模型大小且提高模型精度，ResNet-34在原始卷积层外部加入越层连接(shortcut)支路构成基本残差模块，原始的映射H(X)被表示为H(X)＝F(X)+x，其中，F(X)为残差映射，x为输入信号，通过残差模块结构将卷积层对H(X)的学习转化为对F(X)的学习，而对F(X)的学习较H(X)更为简单，这种结构缩减了计算量的同时有效解决了因网络层数过深导致的衰减问题。The preprocessed sample images are sequentially sent to the neural network, and the loss function is minimized by backpropagation to train the network. Compared with the traditional single-sample partial occlusion face recognition algorithm, the present invention adopts ResNet-34 to reduce the size of the model and improve the accuracy of the model. ResNet-34 adds a shortcut branch outside the original convolution layer to form a basic residual. Difference module, the original mapping H(X) is expressed as H(X)=F(X)+x, where F(X) is the residual mapping, and x is the input signal. The learning of H(X) is transformed into the learning of F(X), and the learning of F(X) is simpler than that of H(X). This structure reduces the amount of calculation and effectively solves the problem due to the number of network layers. Attenuation problem caused by too deep.

将人脸图像输入到ResNet-34网络中，经过一个卷积层进行浅层特征提取作为接下来残差网络的输入特征图，具体的，输入通道数为3的特征图首先通过核大小为3×3、通道数为64、步长为1的卷积层进行特征提取，输出特征图的大小为128×128，输出通道变为64，再经过最大池化层，此时的输出特征图作为后续残差模块输入特征图。Input the face image into the ResNet-34 network, and perform shallow feature extraction through a convolutional layer as the input feature map of the next residual network. Specifically, the feature map with the input channel number of 3 first passes through the kernel size of 3. ×3, the number of channels is 64, the stride is 1 for feature extraction, the size of the output feature map is 128 × 128, the output channel becomes 64, and then after the maximum pooling layer, the output feature map at this time is as Subsequent residual modules input feature maps.

在提取浅层特征图后连接有四组顺次相连的残差模块，并构成残差网络，将所述残差网络作为全局支路并提取出人脸图像的全局特征。After the shallow feature map is extracted, four groups of residual modules connected in sequence are connected to form a residual network, and the residual network is used as a global branch to extract the global features of the face image.

可以理解的是，本发明的核心改进在于本发明所提出的注意力机制和跨层信息融合两个模块，而对于注意力机制模块，又分为空间注意力机制和通道注意力机制，主要嵌入在ResNet-34中；为了每次同时输入一对人脸图像，同时对ResNet-34进行了改进，使其作为一个差分网络，最开始卷积层提取出浅层特征图，在最大池化层后面连接有四组顺次相连的残差模块，构成残差网络，将残差网络作为全局支路提取出人脸图像的全局特征；另一方面是在残差网络第二层与第四层之间嵌入一个空间注意力模块并将二者输出的特征图连接起来，即跨层信息融合，将差分网络第四层特征图的差的绝对值作为下一步通道注意力模块的输入。在本发明中，若未特意强调说明，本发明的残差网络主要指的是在ResNet-34后面由多组残差模块以及注意力机制模块所构成的差分网络，当然上述划分指代只是为了更突出体现本发明的改进之处，本领域技术人员可以根据本发明的整体实施例和附图进行适应性理解。It can be understood that the core improvement of the present invention lies in the two modules of attention mechanism and cross-layer information fusion proposed by the present invention, and the attention mechanism module is divided into spatial attention mechanism and channel attention mechanism, which are mainly embedded in In ResNet-34; in order to input a pair of face images at the same time, ResNet-34 is improved at the same time as a differential network. At first, the convolutional layer extracts the shallow feature map, and in the maximum pooling layer There are four groups of residual modules connected in sequence to form a residual network, and the residual network is used as a global branch to extract the global features of the face image; on the other hand, the second and fourth layers of the residual network are used. A spatial attention module is embedded between them and the feature maps output by the two are connected, that is, cross-layer information fusion, and the absolute value of the difference of the fourth layer feature map of the differential network is used as the input of the next channel attention module. In the present invention, if not specifically emphasized, the residual network of the present invention mainly refers to the differential network composed of multiple groups of residual modules and attention mechanism modules behind ResNet-34. Of course, the above division refers only to The improvements of the present invention are more prominently embodied, and those skilled in the art can make adaptive understanding according to the overall embodiment of the present invention and the accompanying drawings.

进一步的，差分网络是由残差模块和注意力机制模块构成。Further, the difference network is composed of a residual module and an attention mechanism module.

构建差分网络过程包括以下步骤：The process of building a differential network includes the following steps:

将第一个卷积层输出的浅层特征图输入到两个ResNet-34网络分支中，ResNet-34网络分支由4组残差模块串联而成，各组残差模块的输入通道数分别为64、128、256、512，而每个残差模块又由卷积操作、批量标准化(BatchNormalization，BN)操作和修正线性单元(RectifiedLinerUnit，ReLU)操作构成，这一系列操作作用于全局特征的映射，其对应的输出通道分别为64、128、256、512；The shallow feature map output by the first convolutional layer is input into two ResNet-34 network branches. The ResNet-34 network branch is composed of 4 groups of residual modules in series. The number of input channels of each group of residual modules is 64, 128, 256, 512, and each residual module is composed of convolution operations, batch normalization (BN) operations and Rectified Linear Unit (ReLU) operations, which act on the mapping of global features , the corresponding output channels are 64, 128, 256, 512;

在差分网络的每条分支第二层残差模块和第四层残差模块之间嵌入空间注意力模块，空间注意力模块的嵌入过程包括：在残差网络插入空间注意力模块用于引导模型关注有意义的特征在哪儿，具体地，将第二层残差模块卷积层输出的h′×w′×c′的三维张量，利用全局平均池化和全局最大池化，不同的是，这里是在通道这个维度上进行的操作，也就是说把所有输入通道池化成2个实数，由(h′×w′×c′)形状的输入得到两个(h′×w′×1)的权向量，此时将两个(h′×w′×1)权向量基于通道进行拼接成(h′×w′×2)的权向量特征图，其中h′、w′分别为输入人脸图像的长度和宽度、c′为通道数；A spatial attention module is embedded between the second-layer residual module and the fourth-layer residual module of each branch of the differential network. The embedding process of the spatial attention module includes: inserting a spatial attention module into the residual network to guide the model Pay attention to where the meaningful features are. Specifically, the three-dimensional tensor of h′×w′×c′ output by the convolutional layer of the second-layer residual module is used for global average pooling and global maximum pooling. The difference is , here is the operation in the dimension of the channel, that is to say, pool all input channels into 2 real numbers, and get two (h'×w'×1 from the input of (h'×w'×c') shape ) weight vector, at this time, two (h'×w'×1) weight vectors are spliced into a (h'×w'×2) weight vector feature map based on the channel, where h' and w' are input respectively. The length and width of the face image, c' is the number of channels;

使用一个7×7的卷积核进行卷积，填充大小为3，将通道数压缩成1，其次经过Sigmoid非线性激活操作，卷积后形成新的(h′×w′×1)的权向量特征图；A 7×7 convolution kernel is used for convolution, the padding size is 3, the number of channels is compressed to 1, and then the Sigmoid nonlinear activation operation is performed to form a new (h′×w′×1) weight after convolution. vector feature map;

将第四层残差模块输出的8×8特征图通过双线性插值法上采样为32×32的特征图，与所述的(h′×w′×1)权向量特征图在通道级别上相乘，再通过下采样输出8×8大小的特征图，获得缩放后的新特征；The 8 × 8 feature map output by the residual module of the fourth layer is upsampled to a 32 × 32 feature map through bilinear interpolation, and the (h' × w' × 1) weight vector feature map is at the channel level. Multiply up, and then output an 8×8 feature map by downsampling to obtain new features after scaling;

在网络末端两个分支处将缩放后的新特征图进行相减，求得差的绝对值作为通道注意力模块的输入。The scaled new feature maps are subtracted at the two branches at the end of the network, and the absolute value of the difference is obtained as the input of the channel attention module.

在一个实施例中，第二层残差模块进行卷积操作后，输入是32×32×128的三维张量，分别为输入feature map的长度，宽度和通道数，将三维张量先通过全局最大池化和全局平均池化，即在列通道的维度上池化，取一列通道的最大值和平均值，一次池化一列通道变成了一个值就是一个通道，长宽不变。输入feature map是32×32×128，经过一次池化后就变成了32×32×1的feature map。此时将两个(32×32×1)权向量基于通道进行拼接成(32×32×1)的权向量特征图。In one embodiment, after the second-layer residual module performs the convolution operation, the input is a 32×32×128 three-dimensional tensor, which are the length, width, and number of channels of the input feature map, respectively, and the three-dimensional tensor is first passed through the global maximum pool. Pooling and global average pooling, that is, pooling in the dimension of the column channel, taking the maximum and average value of a column of channels, pooling a column of channels at a time and becoming a value is a channel, the length and width are unchanged. The input feature map is 32×32×128, and after one pooling, it becomes a 32×32×1 feature map. At this time, two (32×32×1) weight vectors are spliced into a (32×32×1) weight vector feature map based on the channel.

使用一个7×7的卷积核进行卷积，填充大小为3，将通道数压缩成1，其次经过Sigmoid非线性激活操作，卷积后形成新的(32×32×1)的权向量特征图；A 7×7 convolution kernel is used for convolution, the padding size is 3, the number of channels is compressed to 1, and then the Sigmoid nonlinear activation operation is performed to form a new (32×32×1) weight vector feature after convolution picture;

将第四层残差模块输出的8×8大小的特征图通过双线性插值法上采样为32×32的特征图，与所述的(32×32×1)权向量特征图在通道级别上相乘，即融合了来自低层的特征，再通过下采样输出8×8大小的特征图，获得缩放后的新特征；The 8 × 8 feature map output by the residual module of the fourth layer is upsampled to a 32 × 32 feature map through bilinear interpolation, and the (32 × 32 × 1) weight vector feature map is at the channel level. The upper multiplication, that is, the features from the lower layer are fused, and then the feature map of 8×8 size is output by downsampling, and the new features after scaling are obtained;

通道注意力模块对各通道权重调整具体过程如下：The specific process of the channel attention module adjusting the weight of each channel is as follows:

网络末端两个分支处将缩放后的新特征图进行相减，将特征差的绝对值作为通道注意力模块的输入。为了汇聚空间特征，通道注意力模块接收相减后的新的特征图，对此采取全局平均池化以及全局最大池化来获得两个1×1×512的通道描述，两种不同的池化意味着提取的高层次特征更加丰富；The scaled new feature maps are subtracted at the two branches at the end of the network, and the absolute value of the feature difference is used as the input of the channel attention module. In order to aggregate spatial features, the channel attention module receives the new feature map after subtraction, and adopts global average pooling and global max pooling to obtain two 1×1×512 channel descriptions, two different pooling It means that the extracted high-level features are richer;

将所述的两个通道描述分别送入一个浅层的神经网络，第一层神经元个数是512/16，激活函数为ReLU，第二层神经元个数为512，这个两层的神经网络是共享的；The two channel descriptions are respectively sent to a shallow neural network. The number of neurons in the first layer is 512/16, the activation function is ReLU, and the number of neurons in the second layer is 512. the network is shared;

将获得的两个特征图通过相加合并，利用一个Sigmoid激活函数获得各个通道的权重系数；The two obtained feature maps are combined by addition, and a sigmoid activation function is used to obtain the weight coefficients of each channel;

将经过通道注意力模块重标定的特征作为最终人脸特征表示，根据全连接层输出分类的结果。The feature re-calibrated by the channel attention module is used as the final face feature representation, and the classification result is output according to the fully connected layer.

对网络进行联合交叉熵损失函数和人脸对图像由于差异引起的对比损失交叉熵损失求解，通过反向传播最小化损失函数，对整个差分网络进行联合优化，并对网络进行迭代训练。The joint cross-entropy loss function and the face-to-image contrast loss caused by the difference are solved for the network, and the loss function is minimized through backpropagation, the entire difference network is jointly optimized, and the network is iteratively trained.

进一步的，所述交叉熵损失函数表示如下：Further, the cross entropy loss function is expressed as follows:

所述对比损失表示如下：The contrastive loss is expressed as follows:

其中，

为交叉熵损失函数；

表示网络中样本yⁱ的身份分类概率，

表示第i个人脸图像对中存在遮挡的人脸图像，xⁱ表示第i个人脸图像对中不存在遮挡的人脸图像，F为差分网络中最后一个卷积层后的全连接层，

为由于遮挡区域的存在所造成的两张图片对比损失，

is the cross entropy loss function;

Indicates the occluded face image in the ith face image pair, x ⁱ represents the ith face image pair without occlusion in the face image, F is the fully connected layer after the last convolutional layer in the differential network,

is the comparison loss of the two images caused by the existence of the occluded area,

使用Adam优化器进行训练调节，经过多轮训练后，网络趋于稳定，迭代过程结束，得到训练好的网络模型，其中训练过程如图4所示，The Adam optimizer is used for training adjustment. After multiple rounds of training, the network tends to be stable, and the iterative process ends, and a trained network model is obtained. The training process is shown in Figure 4.

获取图像数据集后，对人脸图像进行预处理；After acquiring the image dataset, preprocess the face image;

构建出基于注意力机制的差分网络模型，即本发明所构建出的网络模型；A differential network model based on the attention mechanism is constructed, that is, the network model constructed by the present invention;

使用数据集训练网络，并进行多次迭代；Train the network using the dataset and perform multiple iterations;

将网络输出的结果与该人脸图像所对应的真实身份值标签进行损失求解，直至损失趋于稳定。The result of the network output and the real identity value label corresponding to the face image are used to solve the loss until the loss tends to be stable.

此时，结束训练并输出训练好的网络模型。At this point, end the training and output the trained network model.

训练完成的网络模型如图5所示。The trained network model is shown in Figure 5.

使用训练好的神经网络模型时，将目标域单样本人脸Gallery集和部分遮挡的测试人脸图像输入到训练好的网络模型中，模型根据人脸图像特征余弦距离计算输出最终遮挡的人脸所属类别。When using the trained neural network model, input the target domain single-sample face gallery set and the partially occluded test face image into the trained network model, and the model calculates and outputs the final occluded face according to the cosine distance of the face image features. category.

神经网络模块，用于构建并训练由两个嵌入注意力机制的相同残差网络所形成的差分神经网络；A neural network module for building and training a differential neural network formed by two identical residual networks embedded with an attention mechanism;

进一步的，神经网络模块中包括ResNet-34，当数据预处理模块输出的数据输入神经网络模块中后，经过一个卷积层提取出浅层特征图，并将述浅层特征图输入由依次级联的四个残差模块组构成的残差网络，提取出人脸图像的全局特征；在第二层残差模块与第四层残差模块之间嵌入有空间注意力模块，调节浅层特征图未被遮挡区域像素的权重，输出一个空间位置权重特征图；将所述空间位置权重特征图与第四层残差模块输出的特征图连接，通过跨层信息的融合，以获取来自低层的局部细节特征；Further, the neural network module includes ResNet-34. When the data output by the data preprocessing module is input into the neural network module, a shallow feature map is extracted through a convolutional layer, and the shallow feature map is input in the order of A residual network composed of four residual module groups connected to extract the global features of face images; a spatial attention module is embedded between the second-layer residual module and the fourth-layer residual module to adjust shallow features The weight of the pixels in the unoccluded area of the map is output, and a spatial position weight feature map is output; the spatial position weight feature map is connected with the feature map output by the fourth-layer residual module, and the information from the lower layer is obtained through the fusion of cross-layer information. local detail features;

通道注意力模块对各通道进行权重调整，包括：The channel attention module adjusts the weights of each channel, including:

将通道注意力模块接收到的相减后的值采取全局平均池化以及全局最大池化来获得两个通道描述；The subtracted value received by the channel attention module adopts global average pooling and global max pooling to obtain two channel descriptions;

将所述的两个通道描述分别送入一个浅层的神经网络提取特征，并将提取的特征通过相加合并；The two channel descriptions are respectively sent to a shallow neural network to extract features, and the extracted features are combined by addition;

合并后的特征利用一个Sigmoid激活函数获得各个通道的权重系数；The combined features use a sigmoid activation function to obtain the weight coefficients of each channel;

得到特征通道的权重之后，通过乘法逐通道加权到突出局部细节特征后的遮挡图像和未遮挡图像的特征图上，完成在通道维度上的原始特征重标定；After the weight of the feature channel is obtained, it is weighted by multiplication to the feature map of the occluded image and the unoccluded image after highlighting the local detail features, and the original feature re-calibration in the channel dimension is completed;

经过通道注意力模块重标定的特征作为最终人脸特征表示，根据全连接层输出分类的结果。The features re-calibrated by the channel attention module are used as the final face feature representation, and the classification results are output according to the fully connected layer.

联合优化交叉熵损失函数和人脸对图像由于差异引起的对比损失，对网络进行迭代训练，训练过程详见方法部分，此处不再赘述。Jointly optimize the cross-entropy loss function and the contrast loss caused by the difference between the face and the image, and iteratively train the network. The training process is detailed in the Methods section, and will not be repeated here.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, and substitutions can be made in these embodiments without departing from the principle and spirit of the invention and modifications, the scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. A single-sample partial occlusion face recognition method based on an attention mechanism, characterized in that it specifically comprises the following steps:

Input the face image pair set with category labels as the source domain dataset, the face image pair includes a clean frontal face and a face with the same identity with partial occlusion, and preprocess the face image pair dataset;

Input the preprocessed face image into the differential network composed of two ResNet-34, and extract the shallow feature map through a convolutional layer;

Input the above shallow feature map into a residual network composed of four residual module groups cascaded in sequence to extract the global features of the face image;

Embed a spatial attention module between the second layer and the fourth layer residual module, adjust the weight of the pixels in the unoccluded area of the shallow feature map, and output a spatial position weight feature map;

The spatial position weight feature map and the feature map output by the fourth-layer residual module are connected by multiplication, and the local detail features from the lower layer are obtained through the fusion of cross-layer information;

The absolute value of the difference between the occluded image after highlighting the local detail features and the feature map of the unoccluded image is used as the input of the channel attention module;

The channel attention module calibrates the feature maps of the occluded image and unoccluded image after highlighting the local detail features by channel according to the absolute value of the input, and the calibrated feature map is sent to the fully connected layer to output the classification result;

The network is iteratively trained by jointly optimizing the cross-entropy loss function and the contrast loss caused by the difference between the face and the image;

After multiple rounds of training, the network loss tends to be stable, the iterative training process ends, and the trained network model is obtained;

The single-sample face gallery set of the target domain and the partially occluded test face image are input into the trained network model, and the model calculates and outputs the category of the final occluded face according to the cosine distance of the face image feature.

2. The single-sample partial occlusion face recognition method based on the attention mechanism according to claim 1, wherein the preprocessing of the face image to the dataset comprises cropping the face image to a size of 128×128, and The pixel normalization operation is performed on the cropped face image, which is expressed as:

X _pix = (X _pix -128)/128;

Among them, X _pix is the pixel value corresponding to the face image.

3. the single-sample partial occlusion face recognition method based on the attention mechanism according to claim 1, is characterized in that, extracting the shallow feature map through a convolutional layer comprises: the face image input that the number of channels is 3 The convolutional layer with the convolution kernel size of 3 × 3, the number of channels is 64, and the stride size of 1 is used for feature extraction, and the output size is 128 × 128, and the output channel becomes a feature map of 64. The layer is used to obtain the shallow feature map of the face image.

4. The single-sample partial occlusion face recognition method based on an attention mechanism according to claim 1, wherein in the four residual module groups cascaded in sequence, each residual module group is cascaded in sequence. It includes 3, 4, 6, and 3 residual modules in sequence, and the size of the feature map output by each residual module group is 64 × 64, 32 × 32, 16 × 16, 8 × 8 according to the cascade order.

5. The single-sample partial occlusion face recognition method based on an attention mechanism according to claim 4, wherein the process of obtaining the local detail features from the lower layer comprises:

A spatial attention module is embedded between the second-layer residual module and the fourth-layer residual module;

The input 3D tensor of h′×w′×c′ is obtained by global average pooling and global maximum pooling to obtain the weight vector feature map;

The weight vector feature map is processed using a convolutional layer with a convolution kernel of 7 × 7, a padding size of 3, and a number of channels of 1 and a Sigmoid nonlinear activation layer;

Upsampling the 8 × 8 feature map output by the fourth layer residual group module into a 32 × 32 feature map through bilinear interpolation;

The upsampled feature map is multiplied by the processed weight vector feature map, and then the 8×8 feature map is output by downsampling to obtain the local detail features from the lower layers.

6. The single-sample partial occlusion face recognition method based on the attention mechanism according to claim 1, is characterized in that, the channel attention module according to the absolute value of the input is the occlusion image and the unoccluded image after highlighting the local detail feature. The channel-by-channel calibration on the feature map includes:

Perform a subtraction operation on the feature map of the occluded image after highlighting the local detail features and the feature map of the unoccluded image, and use the absolute value of the difference obtained by the subtraction as the input of the channel attention module;

The channel attention module adopts global average pooling and global maximum pooling operations to obtain two 1×1×C channel descriptions;

The two channel descriptions are respectively sent to a shallow neural network. In the shallow neural network, the number of neurons in the first layer is C/r, the activation function is ReLU, and the number of neurons in the second layer is C;

Add and merge the two feature maps extracted by the shallow neural network, and use the Sigmoid activation function to obtain the weight coefficients of each channel;

After the weight of the feature channel is obtained, the channel-by-channel weighting is applied to the features of the occluded image and the unoccluded image after highlighting the local detail features, and the original feature re-calibration in the channel dimension is completed;

Among them, C is the number of channels, and r is the dimension reduction factor.

7. The single-sample partial occlusion face recognition method based on an attention mechanism according to claim 1, wherein the cross-entropy loss function is expressed as:

in,

is the cross entropy loss function;

8. The single-sample partial occlusion face recognition method based on an attention mechanism according to claim 1, wherein the contrast loss function represents:

in,

9. A single-sample partial occlusion face recognition system based on an attention mechanism, characterized in that it includes an image acquisition module, a data preprocessing module, a neural network module and an output module, wherein:

The image acquisition module is used to input the data set and obtain the face image information or the face image to be tested;

The data preprocessing module is used to normalize the pixels of the face image, and at the same time, perform data enhancement in the source domain data set, and expand the source domain training set by randomly adding occluders;

A neural network module for building and training a differential neural network consisting of two identical ReseNet-34 embedded with attention mechanisms;

The output module is used to output the final identity category of the face image to be tested, that is, the model trained in the source domain is transferred to the target domain dataset, and the single-sample face gallery set and the partially occluded test face image are sent into The model determines the identity of the partially occluded test face image.