CN117854160A

CN117854160A - Human face living body detection method and system based on artificial multi-mode and fine-granularity patches

Info

Publication number: CN117854160A
Application number: CN202311758536.5A
Authority: CN
Inventors: 林峰; 马哲; 傅凯强; 任奎
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-09

Abstract

The invention provides a human face living body detection method and a human face living body detection system based on artificial multi-mode and fine-granularity patches. The invention adopts the patch cut out from the facial image to identify the local characteristics, thereby improving the detection accuracy. The invention adopts the classification loss and the self-supervision similarity loss based on the asymmetric edges to standardize the feature embedding space, thereby improving the stability and the reliability of detection.

Description

A method and system for face liveness detection based on artificial multimodality and fine-grained patches

技术领域Technical Field

本发明属于人脸识别领域，尤其涉及一种基于人工多模态和细粒度补丁的人脸活体检测方法及系统。The present invention belongs to the field of face recognition, and in particular relates to a face liveness detection method and system based on artificial multimodality and fine-grained patches.

背景技术Background technique

近年来，人脸识别技术在日常生活中得到了广泛应用，尤其是在智能手机登录、访问控制、移动支付等领域。然而，随着社交媒体的普及，个人人脸照片的泄露风险增大，攻击者可以轻易地制作目标受害者的照片、视频或3D面具，这种攻击严重威胁了人脸识别系统的可靠性和安全性。因此，随着人脸识别技术的进一步成熟，人脸活体检测已经成为保障人脸识别系统安全性与可靠性的关键环节，具有重要的应用价值。In recent years, face recognition technology has been widely used in daily life, especially in the fields of smartphone login, access control, mobile payment, etc. However, with the popularity of social media, the risk of personal face photos being leaked has increased. Attackers can easily create photos, videos or 3D masks of target victims. This attack seriously threatens the reliability and security of face recognition systems. Therefore, with the further maturity of face recognition technology, face liveness detection has become a key link in ensuring the security and reliability of face recognition systems, and has important application value.

目前，最接近的现有技术是基于深度学习的人脸活体检测技术。这种技术主要通过对人脸图像进行深度学习，从而识别出人脸是否为活体。At present, the closest existing technology is the face liveness detection technology based on deep learning. This technology mainly identifies whether the face is alive by performing deep learning on the face image.

虽然现有的人脸活体检测技术在一定程度上能够保障安全性，但其存在许多局限性。这些技术通常采用响应式方法，捕捉单一模态的图像信息，要求用户配合完成指定的随机动作，如点头、摇头、张嘴等。通过输入多帧连续图像，提取动态的时空特征进行检测。然而，这种方式在需要快速决策的非合作式部署条件下并不适用。此外，如果攻击者预先录制一段相应动作的视频或者戴上假头套，也能通过活体检测。因此，非配合式的人脸活体检测方法更具有实际应用价值和优势，能够实现更快、更有效的检测。另外，基于单一模态的方法在训练阶段无法利用其他模态的数据，导致模态之间的互补信息被丢弃，从而影响了检测性能。Although existing face liveness detection technologies can guarantee security to a certain extent, they have many limitations. These technologies usually adopt a responsive approach to capture image information of a single modality and require users to cooperate in completing specified random actions, such as nodding, shaking their heads, opening their mouths, etc. By inputting multiple frames of continuous images, dynamic spatiotemporal features are extracted for detection. However, this approach is not applicable under non-cooperative deployment conditions that require rapid decision-making. In addition, if the attacker pre-records a video of the corresponding action or wears a fake headgear, liveness detection can also be passed. Therefore, non-cooperative face liveness detection methods have more practical application value and advantages, and can achieve faster and more effective detection. In addition, methods based on a single modality cannot utilize data from other modalities during the training phase, resulting in the discarding of complementary information between modalities, which affects the detection performance.

另外，随着深度相机等消费级传感器的普及，一些新兴的活体检测方法开始出现。这些方法采用深度图像和红外图像等多模态信息来提取互补特征，并将多个模态的特征进行融合，以提高检测准确率。然而，多模态人脸活体检测存在以下局限性，限制了其在某些应用场景的广泛使用：1)难以获取多模态的人脸图像，每个模态都需要一个相应的传感器来捕获，而近红外和深度信息传感器的高昂成本使得获取相应模态数据变得困难；2)多模态传感器难以集成，并且难以广泛部署在移动终端上。In addition, with the popularity of consumer-grade sensors such as depth cameras, some emerging liveness detection methods have begun to emerge. These methods use multimodal information such as depth images and infrared images to extract complementary features, and fuse the features of multiple modalities to improve detection accuracy. However, multimodal face liveness detection has the following limitations, which restrict its widespread use in certain application scenarios: 1) It is difficult to obtain multimodal face images. Each modality requires a corresponding sensor to capture, and the high cost of near-infrared and depth information sensors makes it difficult to obtain corresponding modality data; 2) Multimodal sensors are difficult to integrate and difficult to be widely deployed on mobile terminals.

发明内容Summary of the invention

本发明的目的是针对现有技术的不足，The purpose of the present invention is to address the deficiencies in the prior art.

本发明采用的技术方案具体如下：The technical solution adopted by the present invention is specifically as follows:

一种基于人工多模态和细粒度补丁的人脸活体检测方法，包括：A human face liveness detection method based on artificial multimodality and fine-grained patches, comprising:

收集目标区域的彩色图像序列；Collecting a sequence of color images of the target area;

对彩色图像序列进行人脸关键点检测，获取初始人脸框；Perform facial key point detection on the color image sequence to obtain the initial face frame;

基于获取的初始人脸框和人脸关键点信息对采集的彩色图像序列进行对齐和裁剪，获取有效人脸图像序列；Based on the acquired initial face frame and face key point information, the collected color image sequence is aligned and cropped to obtain a valid face image sequence;

基于有效人脸图像序列提取若干种模态，包括摩尔纹模态、深度图模态、稠密光流模态和时间池化模态中的一种或多种；Extracting several modalities based on a valid face image sequence, including one or more of a moiré modality, a depth map modality, a dense optical flow modality, and a temporal pooling modality;

将有效人脸图像序列及若干种模态输入至一预训练的人脸活体检测模型，获得检测结果；所述人脸活体检测模型包括若干个人工模态特征提取模块、CBAM注意力模块、裁剪模块、融合特征提取模块和全连接层；其中若干个人工模态特征提取模块与有效人脸图像序列及若干种模态一一对应，分别用于对有效人脸图像序列及若干种模态进行特征提取；CBAM注意力模块将若干个人工模态特征提取模块输出的特征进行融合；裁剪模块用于将CBAM注意力模块输出的融合特征进行随机水平翻转、随机旋转后裁剪成固定大小的多个人脸图块；融合特征提取模块用于分别对多个人脸图块进行特征提取，全连接层用于基于融合特征提取模块输出的特征输出检测结果；Input a valid face image sequence and several modalities into a pre-trained face liveness detection model to obtain a detection result; the face liveness detection model includes several artificial modality feature extraction modules, a CBAM attention module, a cropping module, a fusion feature extraction module and a fully connected layer; wherein the several artificial modality feature extraction modules correspond to the valid face image sequence and several modalities one by one, and are respectively used to extract features from the valid face image sequence and several modalities; the CBAM attention module fuses the features output by the several artificial modality feature extraction modules; the cropping module is used to randomly horizontally flip and randomly rotate the fusion features output by the CBAM attention module and then crop them into multiple face blocks of fixed size; the fusion feature extraction module is used to extract features from multiple face blocks respectively, and the fully connected layer is used to output the detection result based on the features output by the fusion feature extraction module;

所述人脸活体检测模型基于收集的数据集通过最小化损失函数进行训练获得，数据集的每一样本包含N个补丁类型采集的彩色图像序列对应获得的有效人脸图像序列及若干种模态，以及对应的真实标签；N个补丁类型由k个真实人脸的类和N-k个欺骗类组成；所述损失函数包括非对称边缘的分类损失和自监督相似性损失。其中，真实人脸的类由采集图像的设备在不同参数下采集的真实人脸对应的类，如不同的捕获分辨率；欺骗类是由采集图像的设备在不同参数下采集的包含不同的欺骗介质人脸对应的类。The face liveness detection model is trained based on the collected data set by minimizing the loss function. Each sample of the data set contains a valid face image sequence and several modalities corresponding to the color image sequence collected by N patch types, as well as the corresponding real labels; the N patch types are composed of k real face classes and N-k deception classes; the loss function includes the classification loss of asymmetric edges and the self-supervised similarity loss. Among them, the real face class is the class corresponding to the real face collected by the image collection device under different parameters, such as different capture resolutions; the deception class is the class corresponding to the face containing different deception media collected by the image collection device under different parameters.

进一步地，采用预训练的彩色图像人脸检测模型对彩色图像序列进行人脸候选框和人脸关键点检测，获取初始人脸框和人脸关键点信息；所述彩色图像人脸检测模型基于SCRFD人脸检测算法构建获得。Furthermore, a pre-trained color image face detection model is used to detect face candidate frames and face key points on the color image sequence to obtain initial face frame and face key point information; the color image face detection model is constructed based on the SCRFD face detection algorithm.

进一步地，所述摩尔纹模态使用预训练的MoireDet网络从有效人脸图像序列中提取获得。Furthermore, the moiré modality is extracted from a valid face image sequence using a pre-trained MoireDet network.

进一步地，所述深度图模态使用预训练的PRNet网络从有效人脸图像序列中提取获得。Furthermore, the depth map modality is extracted from a valid face image sequence using a pre-trained PRNet network.

进一步地，所述稠密光流模态的提取方法如下：Furthermore, the method for extracting the dense optical flow modality is as follows:

使用基于金字塔的光流算法对有效人脸图像序列的第一帧图像和最后一帧图像计算出每个像素点在两帧图像之间的运动向量，通过计算运动向量的大小和方向获得所述稠密光流模态。A pyramid-based optical flow algorithm is used to calculate the motion vector of each pixel between the first frame and the last frame of a valid face image sequence, and the dense optical flow modality is obtained by calculating the size and direction of the motion vector.

进一步地，所述时间池化模态的提取方法如下：Furthermore, the extraction method of the time pooling mode is as follows:

采用Rank Pooling方法对有效人脸图像序列进行编码，将有效人脸图像序列转化为一张动态/时序图像获得所述时间池化模态。The Rank Pooling method is used to encode the valid face image sequence, and the valid face image sequence is converted into a dynamic/time-series image to obtain the time pooling mode.

进一步地，所述人工模态特征提取模块采用ResNet18、Light CNN、Xception或MobileNet。Furthermore, the artificial modality feature extraction module adopts ResNet18, Light CNN, Xception or MobileNet.

进一步地，非对称边缘的分类损失为：Furthermore, the classification loss of asymmetric margins is:

其中，下标i表示第i个样本，m_l和m_s是活体和欺骗类的角度边界，是融合特征提取模块提取的第t个人脸图块对应的补丁特征，t∈p，p是裁剪模块裁剪的人脸图块数量，W_j是全连接层的第j列，y_i是第i个样本的真实标签，/>是全连接层中y_i对应列的值，s是超参数，用来对值进行缩放，/>为来自同一样本的任意两个不同补丁特征，t₁,t₂∈P，n表示样本数量；L是k个真实人脸的类集合，S是N-k个欺骗类集合；Where, the subscript i represents the i-th sample, _ml and _ms are the angle boundaries of the live and spoof classes, is the patch feature corresponding to the t-th face patch extracted by the fusion feature extraction module, t∈p, p is the number of face patches cropped by the cropping module, _Wj is the j-th column of the fully connected layer, _yi is the true label of the ith sample,/> is the value of the corresponding column of _yi in the fully connected layer, s is a hyperparameter used to scale the value, /> are any two different patch features from the same sample, t ₁ ,t ₂ ∈P, n represents the number of samples; L is the set of k real face classes, S is the set of Nk deception classes;

自监督相似性损失为：The self-supervised similarity loss is:

|*|₂表示L2范数。|*| ₂ indicates the L2 norm.

一种基于人工多模态和细粒度补丁的人脸活体检测系统，用于实现一种基于人工多模态和细粒度补丁的人脸活体检测方法，包括：A human face liveness detection system based on artificial multimodality and fine-grained patches is used to implement a human face liveness detection method based on artificial multimodality and fine-grained patches, comprising:

采集相机：用于采集目标区域的彩色图像序列。Acquisition camera: used to acquire color image sequences of the target area.

图像检测模块：包括彩色人脸检测单元和图像预处理单元；其中，彩色人脸检测单元用于检测人脸并获取初始人脸框及人脸关键点信息；图像预处理单元用于基于人脸关键点信息对采集的彩色图像序列进行人脸对齐；Image detection module: including a color face detection unit and an image preprocessing unit; wherein the color face detection unit is used to detect faces and obtain initial face frames and face key point information; the image preprocessing unit is used to perform face alignment on the collected color image sequence based on the face key point information;

图像裁剪模块：基于获取的初始人脸框对对齐后的彩色图像序列进行裁剪，获取有效人脸图像序列；Image cropping module: crops the aligned color image sequence based on the obtained initial face frame to obtain a valid face image sequence;

图像人工模态生成模块：基于有效人脸图像序列提取若干种模态；Image artificial modality generation module: extracts several modalities based on valid face image sequences;

活体判断模块：包括一预训练的人脸活体检测模型，用于将有效人脸图像序列及若干种模态输入至预训练的人脸活体检测模型，获得检测结果。Liveness judgment module: includes a pre-trained face liveness detection model, which is used to input a valid face image sequence and several modalities into the pre-trained face liveness detection model to obtain the detection result.

进一步地，所述图像人工模态生成模块包含：Furthermore, the image artificial modality generation module comprises:

摩尔纹模态生成单元，用于基于有效人脸图像序列生成摩尔纹模态；A moiré pattern generation unit, used for generating a moiré pattern based on a valid face image sequence;

深度图模态生成单元，用于基于有效人脸图像序列生成深度图模态；A depth map modality generating unit, used for generating a depth map modality based on a valid face image sequence;

稠密光流模态生成单元，用于基于有效人脸图像序列生成稠密光流模态；A dense optical flow modality generation unit, used to generate a dense optical flow modality based on a valid face image sequence;

时间池化模态生成单元，用于基于有效人脸图像序列生成时间池化模态。The temporal pooling modality generation unit is used to generate a temporal pooling modality based on a valid face image sequence.

本发明的有益效果是：The beneficial effects of the present invention are:

1.本发明通过从RGB图像中人工创造稠密光流模态、时间池化模态、摩尔纹模态和深度图模态，并将这些人工模态进行融合，旨在解决单RGB图像检测准确率下降的问题，同时避免了使用额外传感器采集多模态信息带来的高昂部署成本。1. The present invention artificially creates dense optical flow modality, temporal pooling modality, moiré modality and depth map modality from RGB images and fuses these artificial modalities, aiming to solve the problem of decreased detection accuracy of single RGB images, while avoiding the high deployment cost of using additional sensors to collect multimodal information.

2.现有技术往往忽略了人脸图像中的本地特征(即采集图像的设备和欺骗图像的材料)，因此，本发明采用从面部图像中裁剪出的补丁来对局部特征进行识别，从而提高了检测的准确性。2. The existing technology often ignores the local features in the face image (i.e. the device that collects the image and the material that deceives the image). Therefore, the present invention uses patches cropped from the facial image to identify local features, thereby improving the accuracy of detection.

3.为了提高网络提取欺骗特征的泛化能力，使得本发明可以广泛应用于一些未知的攻击中，本发明采用了基于非对称边缘的分类损失和自监督相似性损失来规范特征嵌入空间，从而提高了检测的稳定性和可靠性。3. In order to improve the generalization ability of the network to extract deceptive features so that the present invention can be widely used in some unknown attacks, the present invention adopts asymmetric edge-based classification loss and self-supervised similarity loss to standardize the feature embedding space, thereby improving the stability and reliability of detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明一种基于人工多模态和细粒度补丁的人脸活体检测方法的流程图；FIG1 is a flow chart of a method for face liveness detection based on artificial multimodality and fine-grained patches according to the present invention;

图2是本发明一种基于人工多模态和细粒度补丁的人脸活体检测系统的结构图。FIG2 is a structural diagram of a face liveness detection system based on artificial multimodality and fine-grained patches according to the present invention.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present application. Instead, they are merely examples of devices and methods consistent with some aspects of the present application as detailed in the appended claims.

在本申请使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application.

在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。As used in this application and the appended claims, the singular forms "a", "an", "said", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本申请可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本申请范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present application, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".

本发明提供的一种基于人工多模态和结合细粒度补丁的人脸活体检测方法，包括如下步骤：The present invention provides a method for face liveness detection based on artificial multimodality and combined with fine-grained patches, comprising the following steps:

S1：收集目标区域的彩色图像序列；S1: Collect color image sequences of the target area;

在一个实施例中，通过采集设备收集目标区域的彩色图像，总共收集2秒。这里，采集设备使用的是RGB相机，用于捕获彩色图像。彩色图像的采集频率可以是相同的，也可以是不同的，具体取决于功能需求，例如，可以以60FPS的频率采集彩色图像。In one embodiment, a color image of the target area is collected by a collection device for a total of 2 seconds. Here, the collection device uses an RGB camera to capture color images. The acquisition frequency of the color images can be the same or different, depending on the functional requirements. For example, the color images can be collected at a frequency of 60FPS.

S2：对彩色图像序列进行人脸关键点检测，获取初始人脸框和人脸关键点信息。S2: Perform facial key point detection on the color image sequence to obtain the initial face frame and facial key point information.

具体来说，彩色图像被传输至预训练的彩色图像人脸检测模型进行人脸关键点检测。在一个实施例中，彩色图像人脸检测模型是基于SCRFD人脸检测算法构建的，包括两个关键部分：(1)样本再分配(SampleRedistribution,SR)，根据基准数据集的统计，在最需要的阶段增加训练样本；(2)计算再分配(computing Redistribution,CR)，它基于一种精心定义的搜索方法，在模型的主干、颈部和头部之间重新分配计算。利用这两点，使得样本与计算资源都分配到最需要它们的地方，从而实现了高效率与高精度的平衡。具体地，彩色图像人脸检测模型包括特征提取网络、路径聚合特征金字塔网络、头部卷积层、输出层；其步骤包括：Specifically, the color image is transmitted to a pre-trained color image face detection model for facial key point detection. In one embodiment, the color image face detection model is constructed based on the SCRFD face detection algorithm, and includes two key parts: (1) Sample Redistribution (SR), which adds training samples at the most needed stage based on the statistics of the benchmark data set; (2) Computing Redistribution (CR), which redistributes the calculation between the trunk, neck and head of the model based on a carefully defined search method. Using these two points, samples and computing resources are allocated to where they are most needed, thereby achieving a balance between high efficiency and high accuracy. Specifically, the color image face detection model includes a feature extraction network, a path aggregation feature pyramid network, a head convolution layer, and an output layer; its steps include:

S20、将彩色图像输送至以ResNet34为主干的特征提取网络中，其中每一层输出的通道数是原来的75％，最终输出的特征图为48通道；S20, the color image is transmitted to a feature extraction network with ResNet34 as the backbone, wherein the number of channels output by each layer is 75% of the original one, and the final output feature map has 48 channels;

S21、将特征提取网络提取的特征图输入到路径聚合特征金字塔网络(PAFPN)，获取有效特征融合层，最终输出的特征图为24通道；S21, input the feature map extracted by the feature extraction network into the path aggregation feature pyramid network (PAFPN) to obtain an effective feature fusion layer, and the final output feature map has 24 channels;

S22、将获取的有效特征融合层输入到头部卷积层，头部卷积层包含3个3×3的卷积层，最终输出的特征图为96通道；S22, input the acquired effective feature fusion layer into the head convolution layer, the head convolution layer includes three 3×3 convolution layers, and the final output feature map has 96 channels;

S23、输出层根据头部卷积层输出的特征图进行人脸预测，获取初始人脸框和人脸关键点信息。S23, the output layer performs face prediction based on the feature map output by the head convolution layer to obtain the initial face frame and face key point information.

输出层包含两个先验框，每个先验框代表彩色图像上的一定区域。对每个先验框进行人脸检测，通过设置置信度阈值为0.5，预测先验框是否包含人脸的概率，并与阈值进行比对。若先验框的概率大于阈值，则该先验框包含人脸，即为初始人脸框。The output layer contains two prior frames, each of which represents a certain area on the color image. Face detection is performed on each prior frame. By setting the confidence threshold to 0.5, the probability of whether the prior frame contains a face is predicted and compared with the threshold. If the probability of the prior frame is greater than the threshold, the prior frame contains a face and is the initial face frame.

S3：基于步骤S2获取的初始人脸框和人脸关键点信息对基于步骤S1采集的图像进行人脸对齐和裁剪以获取有效人脸图像序列，减少光照、颜色、复杂背景等因素对人脸活体检测的影响，从而进一步提高人脸活体检测的准确性。S3: Based on the initial face frame and face key point information obtained in step S2, face alignment and cropping are performed on the image collected based on step S1 to obtain a valid face image sequence, thereby reducing the influence of factors such as lighting, color, and complex background on face liveness detection, thereby further improving the accuracy of face liveness detection.

S4：对步骤S3获得的有效人脸图像序列进行人工模态生成。包括摩尔纹模态、深度图模态、稠密光流模态和时间池化模态中的一种或多种；在一个实施例中，具体包含上述四种，则包含以下步骤：S4: Generate artificial modalities for the valid face image sequence obtained in step S3, including one or more of moiré modality, depth map modality, dense optical flow modality, and time pooling modality; in one embodiment, specifically including the above four, the following steps are included:

S41：摩尔纹模态生成，使用预训练的MoireDet网络(Yang C,Yang Z,Ke Y,etal.Doing More With MoiréPattern Detection in Digital Photos[J].IEEETransactions on Image Processing,2023,32:694-708)从有效人脸图像序列中提取对应的摩尔纹模态。MoireDet网络采用了三个编码器，分别为宏观编码器(High-LevelEncoder)、微观编码器(Low-Level Encoder)和空间编码器(Spatial Encoder)，以捕获图像中的高级上下文特征(即摩尔纹效应)和低级结构特征(即细粒度条纹、波纹和曲线)。宏观编码器采用ResNet18作为主干网络，并与两个BiFPN层相连，以重复编码摩尔纹特征。BiFPN层允许自上而下和自下而上的多尺度特征融合，因此可以有效地编码高级摩尔纹的背景特征。微观编码器使用ResNet18的早期层(前两个块)，并使用BiFPN输出的注意力来增强由条纹、波纹和曲线产生的低级特征，这些特征有助于产生摩尔纹。由于BiFPN的注意力包含丰富的高级上下文特征，因此ResNet18中的早期层可以更好地定位和捕获这些必要的结构特征。空间编码器包含一个自适应权重且大小为5×5的卷积核，其中权重是通过两个1×1的卷积来计算的。该自适应卷积核只计算特定区域的激活。最终，来自三个编码器的特征图被连接以估计获得摩尔纹边缘图。S41: Moiré pattern generation, using the pre-trained MoireDet network (Yang C, Yang Z, Ke Y, et al. Doing More With Moiré Pattern Detection in Digital Photos [J]. IEEE Transactions on Image Processing, 2023, 32: 694-708) to extract the corresponding moiré pattern from the valid face image sequence. The MoireDet network uses three encoders, namely the macro encoder (High-Level Encoder), the micro encoder (Low-Level Encoder) and the spatial encoder (Spatial Encoder) to capture high-level contextual features (i.e., moiré effects) and low-level structural features (i.e., fine-grained stripes, ripples, and curves) in the image. The macro encoder uses ResNet18 as the backbone network and is connected to two BiFPN layers to repeatedly encode moiré features. The BiFPN layer allows multi-scale feature fusion from top to bottom and bottom to top, so that the background features of high-level moiré can be effectively encoded. The micro encoder uses the early layers of ResNet18 (the first two blocks) and uses the attention of the BiFPN output to enhance the low-level features generated by stripes, ripples, and curves that contribute to the moiré pattern. Since the attention of BiFPN contains rich high-level contextual features, the early layers in ResNet18 can better locate and capture these necessary structural features. The spatial encoder contains a convolution kernel of size 5×5 with adaptive weights, where the weights are calculated by two 1×1 convolutions. This adaptive convolution kernel only calculates activations in specific areas. Finally, the feature maps from the three encoders are concatenated to estimate the moiré edge map.

S42：深度图模态生成，使用预训练的PRNet网络(Feng Y,Wu F,Shao X,etal.Joint 3d face reconstruction and dense alignment with position mapregression network[C]//Proceedings of the European conference on computervision(ECCV).2018:534-551)从有效人脸图像序列中提取对应的深度图。具体来说，PRNet网络采用编码器-解码器结构。其中编码器部分从一个卷积层开始，然后是10个残差块，它们将256×256×3输入图像缩减为8×8×512个特征图，解码器部分包含17个转置卷积层，以生成预测的256×256×3位置图。对所有卷积或转置卷积层使用核大小4，并使用ReLU层进行激活。S42: Depth map modality generation, using the pre-trained PRNet network (Feng Y, Wu F, Shao X, et al. Joint 3d face reconstruction and dense alignment with position map regression network [C] // Proceedings of the European conference on computer vision (ECCV). 2018: 534-551) to extract the corresponding depth map from the valid face image sequence. Specifically, the PRNet network adopts an encoder-decoder structure. The encoder part starts with a convolution layer, followed by 10 residual blocks, which reduce the 256×256×3 input image to 8×8×512 feature maps, and the decoder part contains 17 transposed convolution layers to generate a predicted 256×256×3 position map. Use kernel size 4 for all convolutional or transposed convolutional layers and use ReLU layers for activation.

S43：稠密光流模态生成，使用有效人脸图像序列的第一帧图像和最后一帧图像计算光流。具体来说，使用基于金字塔的光流算法对连续帧图像进行处理，以计算出每个像素点在两帧图像之间的运动向量。通过计算运动向量的大小和方向，可以得到稠密光流模态。S43: Dense optical flow modality generation, using the first frame and the last frame of the valid face image sequence to calculate the optical flow. Specifically, the pyramid-based optical flow algorithm is used to process the consecutive frame images to calculate the motion vector of each pixel between the two frames. By calculating the size and direction of the motion vector, the dense optical flow modality can be obtained.

S44：时间池化模态生成，该模态采用了Rank Pooling方法对有效人脸图像序列进行编码，将有效人脸图像序列转化为一张动态/时序图像，从而捕捉帧级特征随时间演化的过程，得到时间池化模态。S44: Temporal pooling modality generation. This modality uses the Rank Pooling method to encode the valid face image sequence and converts the valid face image sequence into a dynamic/time-series image, thereby capturing the process of frame-level features evolving over time and obtaining the temporal pooling modality.

S5：将有效人脸图像序列及若干种模态输入至一预训练的人脸活体检测模型，获得检测结果；所述人脸活体检测模型包括若干个人工模态特征提取模块、CBAM注意力模块、裁剪模块、融合特征提取模块和全连接层；其中若干个人工模态特征提取模块与有效人脸图像序列及若干种模态一一对应，分别用于对有效人脸图像序列及若干种模态进行特征提取；CBAM注意力模块将若干个人工模态特征提取模块输出的特征进行融合；裁剪模块用于将CBAM注意力模块输出的融合特征进行随机水平翻转、随机旋转后裁剪成固定大小的多个人脸图块；融合特征提取模块用于分别对多个人脸图块进行特征提取，全连接层用于基于融合特征提取模块输出的特征输出检测结果；其中，人脸活体检测模型检测具体步骤包括：S5: Input the valid face image sequence and several modalities into a pre-trained face liveness detection model to obtain the detection result; the face liveness detection model includes several artificial modality feature extraction modules, CBAM attention module, cropping module, fusion feature extraction module and full connection layer; wherein the several artificial modality feature extraction modules correspond to the valid face image sequence and several modalities one by one, and are respectively used to extract features from the valid face image sequence and several modalities; the CBAM attention module fuses the features output by the several artificial modality feature extraction modules; the cropping module is used to randomly horizontally flip and randomly rotate the fusion features output by the CBAM attention module and then crop them into multiple face blocks of fixed size; the fusion feature extraction module is used to extract features from multiple face blocks respectively, and the full connection layer is used to output the detection result based on the features output by the fusion feature extraction module; wherein the specific steps of the face liveness detection model detection include:

S51：将有效人脸图像序列、摩尔纹、深度、稠密光流以及时间池化模态分别输入对应的五个人工模态特征提取模块中，本实施例中采用ResNet18作为人工模态特征提取模块进行特征提取。除此之外，还可以采用Light CNN、Xception或MobileNet等结构。S51: Input the valid face image sequence, moiré, depth, dense optical flow and time pooling mode into the corresponding five artificial modality feature extraction modules respectively. In this embodiment, ResNet18 is used as the artificial modality feature extraction module for feature extraction. In addition, structures such as Light CNN, Xception or MobileNet can also be used.

S52：使用CBAM注意力模块将五种提取后的特征图进行融合。融合后的特征图将输入至裁剪模块进行随机水平翻转、随机旋转后裁剪成固定大小(例如160×160)的若干个人脸图块。这样的处理可以增强模型的鲁棒性，使其对各种变化有更好的适应性。S52: Use the CBAM attention module to fuse the five extracted feature maps. The fused feature maps are input to the cropping module for random horizontal flipping and random rotation, and then cropped into several face blocks of a fixed size (e.g., 160×160). Such processing can enhance the robustness of the model and make it more adaptable to various changes.

S53：将人脸图块分别送入融合特征提取模块(例如ResNet18)中进行特征提取和归一化，再输入至全连接层进行活体判断并输出判断结果。S53: The face blocks are respectively sent to a fusion feature extraction module (such as ResNet18) for feature extraction and normalization, and then input to a fully connected layer for liveness judgment and output the judgment result.

所述人脸活体检测模型基于收集的数据集通过最小化损失函数进行训练获得，其中，数据集的每一样本根据提供的材料和捕获设备对分类进行细分。每一样本包含N个补丁类型采集的彩色图像序列对应获得的有效人脸图像序列及若干种模态，以及对应的真实标签；N个补丁类型由k个真实人脸的类和N-k个欺骗类组成；例如，对于CASIA-FASD数据集，有两种不同的欺骗介质和三种不同的捕获分辨率，因此共有九种细粒度补丁(三种活体类型(对应不存在欺骗介质的三种不同的捕获分辨率获得的真实人脸)和六种欺骗类型)。The face liveness detection model is trained based on the collected data set by minimizing the loss function, wherein each sample of the data set is subdivided into categories according to the provided materials and capture devices. Each sample contains a valid face image sequence and several modalities corresponding to the color image sequence collected by N patch types, as well as the corresponding true labels; the N patch types are composed of k real face classes and N-k deception classes; for example, for the CASIA-FASD data set, there are two different deception media and three different capture resolutions, so there are a total of nine fine-grained patches (three liveness types (corresponding to real faces obtained at three different capture resolutions without deception media) and six deception types).

对于损失函数，采用了非对称边缘的分类损失(Asymmetric Margin-basedClassification Loss)，使得每个类别的特征聚类分布更加紧凑，具有更好的泛化能力。此外，由于欺骗样本之间的分布差异大于活样本，对活样本和欺骗样本进行了不对称处理：迫使模型在活样本中学习一个更紧凑的聚类，同时使欺骗样本在特征空间中更分散。同时，使用自监督相似性损失(Self-supervised Similarity Loss)将对比度损失的积极部分应用于单幅人脸图像的两个变换后的补丁视图，从而进一步规范补丁特征。这样的处理可以增强模型的鲁棒性，使其对各种变化有更好的适应性。For the loss function, an asymmetric margin classification loss is used to make the feature clustering distribution of each category more compact and have better generalization ability. In addition, since the distribution difference between deceptive samples is greater than that of live samples, asymmetric treatment is performed on live and deceptive samples: forcing the model to learn a more compact clustering in live samples while making deceptive samples more dispersed in the feature space. At the same time, a self-supervised similarity loss is used to apply the positive part of the contrast loss to the two transformed patch views of a single face image, thereby further normalizing the patch features. Such processing can enhance the robustness of the model and make it better adaptable to various changes.

具体来说，假设在训练集中有N个补丁类型的类，它由k个真实人脸的类和N-k个欺骗类组成。每个输入样本的真实标签都属于一种细粒度的类y_i∈L₁,L₂,…L_k,S₁,S₂,…S_N-k。真实人脸的活体类集合为L＝{L₁,L₂,…L_k}，欺骗类集合为S＝{S₁,S₂,…S_N-k}；Specifically, assume that there are N patch-type classes in the training set, which consists of k real face classes and Nk spoof classes. The true label of each input sample belongs to a fine-grained class y _i ∈ L ₁ , L ₂ , … L _k , S ₁ , S ₂ , … S _Nk . The set of live classes of real faces is L = {L ₁ , L ₂ , … L _k }, and the set of spoof classes is S = {S ₁ , S ₂ , … S _Nk };

非对称边缘的分类损失为：The classification loss for asymmetric margins is:

其中，下标i表示第i个样本，m_l和m_s是活体和欺骗类的角度边界，为可调参数，是用于分类的全连接层的输入，即融合特征提取模块提取的第t个人脸图块对应的补丁特征，t∈P，P是裁剪模块裁剪的人脸图块数量，W_j是全连接层的第j列，表示第j类的预测概率，y_i是第i个样本的真实标签。/>是全连接层中y_i对应列的值。/>是称为第i个样本对应补丁特征的逻辑回归，s是超参数，用来对值进行缩放，/>为来自同一样本的任意两个不同补丁特征，t₁,t₂∈P，n表示样本数量。Where, the subscript i represents the i-th sample, _ml and _ms are the angle boundaries of the live and spoof classes, which are adjustable parameters. is the input of the fully connected layer for classification, i.e., the patch features corresponding to the t-th face patch extracted by the fusion feature extraction module, t∈P, P is the number of face patches cropped by the cropping module, _Wj is the j-th column of the fully connected layer, indicating the predicted probability of the j-th class, and _yi is the true label of the ith sample. /> is the value of the corresponding column of _yi in the fully connected layer. /> is the logistic regression of the patch feature corresponding to the i-th sample, s is a hyperparameter used to scale the value, /> are any two different patch features from the same sample, t ₁ , t ₂ ∈P, and n represents the number of samples.

自监督相似性损失为：The self-supervised similarity loss is:

|*|₂表示L2范数。|*| ₂ indicates the L2 norm.

基于训练好的人脸活体检测模型即可进行人脸活体检测；对于给定一张测试人脸图像，将有效人脸图像序列及若干种模态输入至一预训练的人脸活体检测模型，经过特征提取、融合后，裁剪模块从整个融合特征图像中均匀裁剪出大小与训练过程中相同的人脸图块，用于网络推断。假设从一张人脸图像中裁剪出了P个人脸图块，融合特征提取模块提取获得对应的P个补丁特征(f¹,f²,…,f^p)，那么平均活体概率可以通过在最后一个全连接层中的活体类别预测概率之和来获得：Face liveness detection can be performed based on the trained face liveness detection model; for a given test face image, the valid face image sequence and several modalities are input into a pre-trained face liveness detection model. After feature extraction and fusion, the cropping module evenly crops face patches of the same size as those in the training process from the entire fused feature image for network inference. Assuming that P face patches are cropped from a face image, the fusion feature extraction module extracts the corresponding P patch features (f ¹ ,f ² ,…,f ^p ), then the average liveness probability can be obtained by the sum of the predicted probabilities of the liveness categories in the last fully connected layer:

在一个实施例中预先设置在活体的概率阈值为0.6，若经过人脸活体检测模型检测出的概率大于阈值时，则该人脸图像中的人脸为在活体，输出活体判断结果；若小于阈值，则结束操作。应当理解的是，预先设置的活体概率可根据实际情况进行设置，此处不作限制。In one embodiment, the live probability threshold is pre-set to 0.6. If the probability detected by the face liveness detection model is greater than the threshold, the face in the face image is alive and the liveness judgment result is output; if it is less than the threshold, the operation is terminated. It should be understood that the pre-set liveness probability can be set according to actual conditions and is not limited here.

图2展示了一个基于人工多模态与细粒度补丁的人脸活体检测系统的示意图。系统200包括以下部分：FIG2 shows a schematic diagram of a face liveness detection system based on artificial multimodality and fine-grained patches. The system 200 includes the following parts:

采集相机201：用于采集目标区域的彩色图像序列。Acquisition camera 201: used to acquire a color image sequence of a target area.

图像检测模块202：包括彩色人脸检测单元2021和图像预处理单元2022。彩色人脸检测单元2021用于检测人脸并获取初始人脸框及人脸关键点信息。图像预处理单元2022用于对初始人脸框和人脸关键点信息进行人脸对齐以获取有效处理图像，减少光照、颜色、复杂背景等因素对人脸活体检测的影响，从而进一步提高人脸活体检测的准确性。Image detection module 202: includes a color face detection unit 2021 and an image preprocessing unit 2022. The color face detection unit 2021 is used to detect faces and obtain initial face frames and face key point information. The image preprocessing unit 2022 is used to perform face alignment on the initial face frames and face key point information to obtain an effective processed image, reduce the influence of factors such as lighting, color, and complex background on face liveness detection, and further improve the accuracy of face liveness detection.

图像裁剪模块203：用于对对齐后的有效处理图像对应的彩色图像进行裁剪，得到不同的有效人脸图像序列。Image cropping module 203: used to crop the color image corresponding to the aligned valid processed image to obtain different valid face image sequences.

图像人工模态生成模块204：用于从原始RGB模态中生成不同的人工模态，包含了摩尔纹模态生成单元2041，深度图模态生成单元2042，稠密光流模态生成单元2043，时间池化模态生成单元2044。Image artificial modality generation module 204: used to generate different artificial modalities from the original RGB modality, including a moiré modality generation unit 2041, a depth map modality generation unit 2042, a dense optical flow modality generation unit 2043, and a time pooling modality generation unit 2044.

活体判断模块205：包括预训练的人脸活体检测模型，用于将有效人脸图像序列及若干种模态输入至预训练的人脸活体检测模型，进行活体判断以输出活体判断结果。The liveness judgment module 205 includes a pre-trained face liveness detection model, which is used to input a valid face image sequence and several modalities into the pre-trained face liveness detection model, perform liveness judgment and output a liveness judgment result.

该系统通过人工多模态和细粒度补丁的方式，提高了人脸活体检测的准确性和鲁棒性。显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其他不同形式的变化或变动。这里无需也无法把所有的实施方式予以穷举。而由此所引申出的显而易见的变化或变动仍处于本发明的保护范围。The system improves the accuracy and robustness of face liveness detection by means of artificial multimodality and fine-grained patches. Obviously, the above embodiments are only examples for clear explanation and are not limitations on the implementation methods. For ordinary technicians in the relevant field, other different forms of changes or modifications can be made on the basis of the above description. It is not necessary and impossible to list all the implementation methods here. The obvious changes or modifications derived from this are still within the scope of protection of the present invention.

Claims

1. A method for face liveness detection based on artificial multimodality and fine-grained patches, comprising:

Collecting a sequence of color images of the target area;

Perform facial key point detection on the color image sequence to obtain the initial face frame;

Based on the acquired initial face frame and face key point information, the collected color image sequence is aligned and cropped to obtain a valid face image sequence;

Extracting several modalities based on a valid face image sequence, including one or more of a moiré modality, a depth map modality, a dense optical flow modality, and a temporal pooling modality;

Input a valid face image sequence and several modalities into a pre-trained face liveness detection model to obtain a detection result; the face liveness detection model includes several artificial modality feature extraction modules, a CBAM attention module, a cropping module, a fusion feature extraction module and a fully connected layer; wherein the several artificial modality feature extraction modules correspond to the valid face image sequence and several modalities one by one, and are respectively used to extract features from the valid face image sequence and several modalities; the CBAM attention module fuses the features output by the several artificial modality feature extraction modules; the cropping module is used to randomly horizontally flip and randomly rotate the fusion features output by the CBAM attention module and then crop them into multiple face blocks of fixed size; the fusion feature extraction module is used to extract features from multiple face blocks respectively, and the fully connected layer is used to output the detection result based on the features output by the fusion feature extraction module;

The face liveness detection model is trained based on the collected data set by minimizing the loss function. Each sample of the data set contains a valid face image sequence and several modalities corresponding to the color image sequence collected by N patch types, as well as the corresponding real labels; the N patch types are composed of k real face classes and N-k deception classes; the loss function includes the classification loss of asymmetric edges and the self-supervised similarity loss.

2. The method according to claim 1 is characterized in that a pre-trained color image face detection model is used to detect face candidate frames and face key points on a color image sequence to obtain initial face frames and face key point information; the color image face detection model is constructed based on the SCRFD face detection algorithm.

3. The method according to claim 1 is characterized in that the moiré modality is extracted from a valid face image sequence using a pre-trained MoireDet network.

4. The method according to claim 1 is characterized in that the depth map modality is extracted from a valid face image sequence using a pre-trained PRNet network.

5. The method according to claim 1, characterized in that the method for extracting the dense optical flow modality is as follows:

A pyramid-based optical flow algorithm is used to calculate the motion vector of each pixel between the first frame and the last frame of a valid face image sequence, and the dense optical flow modality is obtained by calculating the size and direction of the motion vector.

6. The method according to claim 1, characterized in that the method for extracting the temporal pooling modality is as follows:

The Rank Pooling method is used to encode the valid face image sequence, and the valid face image sequence is converted into a dynamic/time-series image to obtain the time pooling mode.

7. The method according to claim 1 is characterized in that the artificial modality feature extraction module adopts ResNet18, Light CNN, Xception or MobileNet.

8. The method according to claim 1, characterized in that the classification loss of the asymmetric edge is:

Wherein, subscript i denotes the i-th sample, _ml and _ms are the angle boundaries of the live and spoofing classes, _fit ^is the patch feature corresponding to the t-th face patch extracted by the fusion feature extraction module, t∈P, P is the number of face patches cropped by the cropping module, _Wj is the j-th column of the fully connected layer, _yi is the true label of the i-th sample, is the value of the corresponding column of _yi in the fully connected layer, s is a hyperparameter used to scale the value, /> are any two different patch features from the same sample, t ₁ ,t ₂ ∈P, n represents the number of samples; L is the set of k real face classes, S is the set of Nk deception classes;

The self-supervised similarity loss is:

|*| ₂ indicates the L2 norm.

9. A face liveness detection system based on artificial multimodality and fine-grained patches, used to implement a face liveness detection method based on artificial multimodality and fine-grained patches according to any one of claims 1 to 8, characterized in that it comprises:

Acquisition camera (201): used to acquire a color image sequence of a target area.

The image detection module (202) comprises a color face detection unit (2021) and an image preprocessing unit (2022); wherein the color face detection unit (2021) is used to detect a face and obtain an initial face frame and face key point information; and the image preprocessing unit (2022) is used to perform face alignment on a collected color image sequence based on the face key point information.

Image cropping module (203): cropping the aligned color image sequence based on the acquired initial face frame to obtain a valid face image sequence;

Image artificial modality generation module (204): extracting several modalities based on a valid face image sequence;

The liveness determination module (205) comprises a pre-trained human face liveness detection model, which is used to input a valid human face image sequence and a plurality of modalities into the pre-trained human face liveness detection model to obtain a detection result.

10. The system of claim 9, wherein the image artificial modality generation module (204) comprises:

A moiré pattern generation unit (2041), used for generating a moiré pattern based on a valid face image sequence;

A depth map modality generating unit (2042), configured to generate a depth map modality based on a valid face image sequence;

A dense optical flow modality generating unit (2043), used for generating a dense optical flow modality based on a valid face image sequence;

The temporal pooling modality generating unit (2044) is used to generate a temporal pooling modality based on a valid face image sequence.