CN112016437A

CN112016437A - A method of living body detection based on key frames of face video

Info

Publication number: CN112016437A
Application number: CN202010870462.4A
Authority: CN
Inventors: 潘瑞晗; 石宇; 周祥东; 罗代建; 邵枭虎; 程俊
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-01
Anticipated expiration: 2040-08-26
Also published as: CN112016437B

Abstract

The invention provides a living body detection method based on a face video key frame, which comprises the following steps: acquiring a region image corresponding to a face in each frame of face image; acquiring face feature key points corresponding to the region images, and acquiring the position variation of the face key points according to the face feature key points in the multi-frame face images; acquiring a face video key frame according to the position variation, performing living body identification on the face video key frame, and outputting an identification result; the invention can effectively improve the accuracy and efficiency of the living body detection.

Description

A method of living body detection based on key frames of face video

技术领域technical field

本发明涉及智能识别领域，尤其涉及一种基于人脸视频关键帧的活体检测方法。The invention relates to the field of intelligent recognition, and in particular to a method for detecting a living body based on a face video key frame.

背景技术Background technique

随着以人脸识别为手段的身份认证系统在日常社会生活中越来越广泛的应用，其安全性时刻面临着挑战，也越来越受到人们的重视。传统的人脸识别方法并不包含验证人脸真实性的模块，仅仅需要输入采集到的人脸图像就可以输出对应的识别结果；且随着数字影像设备的快速兴起以及互联网的蓬勃发展，人脸图像的获取难度也越来越低。这些因素使得攻击者可以轻易地通过某些工具或手段(如打印照片、电子显示屏、三维面具等)得到特定人脸的复制，以此假冒合法用户身份，从而达到入侵系统的目的。在无人值守的封闭应用环境下，这一安全问题尤为突出。As the identity authentication system using face recognition as a means is more and more widely used in daily social life, its security is always facing challenges and more and more attention is paid to it. The traditional face recognition method does not include a module for verifying the authenticity of the face. It only needs to input the collected face image to output the corresponding recognition result; and with the rapid rise of digital imaging equipment and the vigorous development of the Internet, people The difficulty of obtaining face images is also getting lower and lower. These factors make it easy for attackers to obtain a copy of a specific face through some tools or means (such as printed photos, electronic display screens, three-dimensional masks, etc.), so as to pretend to be a legitimate user, so as to achieve the purpose of invading the system. This security problem is particularly prominent in unattended closed application environments.

因此，人脸活体检测技术应运而生，该技术旨在于辨别当前所采集视频或图像中的人脸是活体人脸(有生命的真实人脸)还是假体人脸(冒充真人身份的仿造人脸)，以达到防止不法分子冒用合法用户人脸信息的目的。现今，对于一个典型的人脸识别系统，活体检测是其前端必不可少的一个预处理模块。当系统通过输入设备采集到人脸视频或图像之后，首先进行人脸活体检测，只有在确认当前人脸为真实人脸的条件下才能进行后续的身份验证，因而其安全性得到了极大程度的保障。Therefore, face live detection technology came into being, which aims to distinguish whether the face in the currently collected video or image is a live face (real life face) or a prosthetic face (a fake person pretending to be a real person). face), in order to prevent criminals from fraudulently using the face information of legitimate users. Today, for a typical face recognition system, liveness detection is an essential preprocessing module in its front-end. When the system collects the face video or image through the input device, the face liveness detection is carried out first, and the subsequent identity verification can only be carried out under the condition that the current face is confirmed to be a real face, so its security has been greatly improved. guarantee.

常见的人脸活体检测算法大多着力于图像纹理分析、运动信息分析、交互式辨别等方向，通常可以分为两种实现方式，以单个图像为输入的单帧模型与以视频帧流为输入的多帧模型：单帧模型复杂度相对较低，运算效率较高，但存在一定的误检可能；而多帧模型复杂度相对较高，整体性能较好，但运算效率偏低。Most of the common face live detection algorithms focus on image texture analysis, motion information analysis, interactive identification and other directions. They can usually be divided into two implementation methods, a single-frame model that takes a single image as input and a video frame stream as input. Multi-frame model: The complexity of the single-frame model is relatively low, and the computational efficiency is high, but there is a certain possibility of false detection; while the multi-frame model is relatively high in complexity, with good overall performance, but low computational efficiency.

随着设备成本的降低和影像技术的进步，越来越多的活体检测算法转而以人脸视频作为模型输入，同时将获取到的每一段人脸视频作为单个数据样本。然而，一般所采集到的人脸视频通常都具备极高的帧率，即便是一段长度较短的视频，总共也包含了成百甚至上千帧。由于模型大小的限制，通常需要使用适当的采样策略从人脸视频中抽取出少量的图像帧，以作为输入。其中比较常见的方法有均匀采样、连续采样、随机采样等等，但这些采样策略均不基于任何与人脸相关的特征，也无法确保所提取出的图像帧中包含显著的可用于判别人脸真伪的关键信息，因而使得后续活体检测算法模块的性能受到了潜在的影响。With the reduction of equipment cost and the advancement of imaging technology, more and more live detection algorithms have turned to face video as model input, and each acquired face video is taken as a single data sample. However, generally collected face videos usually have a very high frame rate, and even a short video contains hundreds or even thousands of frames in total. Due to the limitation of model size, it is usually necessary to extract a small number of image frames from face videos as input using an appropriate sampling strategy. Among them, the more common methods include uniform sampling, continuous sampling, random sampling, etc., but these sampling strategies are not based on any features related to faces, nor can they ensure that the extracted image frames contain significant features that can be used to discriminate faces. The key information of authenticity, thus potentially affecting the performance of subsequent live detection algorithm modules.

发明内容SUMMARY OF THE INVENTION

鉴于以上现有技术存在的问题，本发明提出一种基于人脸视频关键帧的活体检测方法，主要解决因采样带来的信息损失，导致识别率不高的问题。In view of the above problems in the prior art, the present invention proposes a method for detecting a living body based on key frames of face video, which mainly solves the problem of low recognition rate caused by information loss caused by sampling.

为了实现上述目的及其他目的，本发明采用的技术方案如下。In order to achieve the above objects and other objects, the technical solutions adopted in the present invention are as follows.

一种基于人脸视频关键帧的活体检测方法，包括：A liveness detection method based on face video key frames, comprising:

获取每帧人脸图像中对应人脸的区域图像；Obtain the area image of the corresponding face in each frame of face image;

获取所述区域图像对应的人脸特征关键点，并根据多帧人脸图像中所述人脸特征关键点获取人脸关键点位置变化量；Obtaining the face feature key points corresponding to the regional images, and obtaining the change amount of the face key point position according to the face feature key points in the multi-frame face images;

根据所述位置变化量获取人脸视频关键帧，并对所述人脸视频关键帧进行活体识别，输出识别结果。Acquire a face video key frame according to the position change amount, perform in vivo recognition on the face video key frame, and output a recognition result.

可选地，获取每帧人脸图像中对应人脸的区域图像之前，获取每帧人脸图像中最大人脸的边界框位置，并根据多帧人脸图像中对应的所述边界框位置获取所述区域图像。Optionally, before acquiring the area image corresponding to the face in each frame of face image, the position of the bounding box of the largest face in each frame of face image is obtained, and the position of the bounding box corresponding to the multiple frames of face images is obtained. the area image.

可选地，设置包含多帧图像中所有所述边框位置的包络边框，并根据所述包络边框的几何中心位置设置图像裁剪边框；Optionally, setting an envelope frame including all the frame positions in the multi-frame images, and setting an image cropping frame according to the geometric center position of the envelope frame;

根据所述图像裁剪边框获取每帧人脸图像中对应人脸的区域图像。A region image corresponding to the face in each frame of the face image is acquired according to the image cropping frame.

可选地，所述图像裁剪边框包括正方形边框。Optionally, the image cropping frame includes a square frame.

可选地，选取一帧人脸图像作为基准帧，获取多帧人脸图像中对应的所述人脸特征关键点与所述基准帧中所述人脸特征关键点的位置差获取所述位置变化量。Optionally, a frame of face image is selected as the reference frame, and the position difference between the corresponding face feature key points in the multi-frame face images and the face feature key points in the reference frame is obtained to obtain the position. amount of change.

可选地，获取所述位置变化量的极值点，根据所述极值点获取对应的所述人脸视频关键帧。Optionally, the extreme point of the position change is acquired, and the corresponding face video key frame is acquired according to the extreme point.

可选地，设置待提取的人脸视频关键帧的数量，当所述极值点的个数大于所述待提取的人脸视频关键帧的数量时，对所述极值点进行归并处理，获取显著极值点，并根据所述显著极值点获取人脸视频关键帧；当所述极值点的个数等于所述待提取的人脸视频关键帧的数量时，直接将极值点对应时刻的人脸图像作为所述人脸视频关键帧；当所述极值点的个数小于所述待提取的人脸视频关键帧的数量时，从非极值点对应的人脸图像中抽取相应数量的人脸图像补足人脸视频关键帧。Optionally, the number of face video key frames to be extracted is set, and when the number of the extreme value points is greater than the number of the face video key frames to be extracted, the extreme value points are merged, Obtain significant extreme points, and obtain face video key frames according to the significant extreme points; when the number of the extreme points is equal to the number of face video key frames to be extracted, the extreme points are directly The face image at the corresponding moment is used as the face video key frame; when the number of the extreme points is less than the number of the face video key frames to be extracted, from the face image corresponding to the non-extreme point Extract the corresponding number of face images to supplement the face video key frames.

可选地，获取所述显著极值点时，根据邻近多个极值点的时间差和在对应人脸图像中的位置差对满足预设条件的极值点进行归并处理。Optionally, when acquiring the significant extremum points, the extremum points that satisfy the preset condition are merged according to the time difference of the adjacent multiple extreme value points and the position difference in the corresponding face image.

可选地，当所述显著极值点的个数大于所述待提取的人脸视频关键帧的数量时，从所述显著极值点中获取与待提取的人脸视频关键帧的数量相对应的显著极值点，并获取对应的人脸视频关键帧；当所述显著极值点的个数等于所述待提取的人脸视频关键帧的数量时，直接获取所述显著极值点对应时刻的人脸图像作为人脸视频关键帧；当所述显著极值点的个数小于所述待提取的人脸视频关键帧的数量时，从非显著极值点对应的人脸图像中抽取相应数量的人脸图像补足人脸视频关键帧。Optionally, when the number of the significant extreme points is greater than the number of the face video key frames to be extracted, obtain from the significant extreme points the number of face video key frames to be extracted. The corresponding significant extreme points are obtained, and the corresponding face video key frames are obtained; when the number of the significant extreme points is equal to the number of the face video key frames to be extracted, the significant extreme points are directly obtained. The face image at the corresponding moment is used as a face video key frame; when the number of the significant extreme points is less than the number of the face video key frames to be extracted, from the face image corresponding to the non-significant extreme point Extract the corresponding number of face images to supplement the face video key frames.

可选地，训练活体识别模型，根据所述活体识别模型的输出概率获取所述识别结果。Optionally, a living body recognition model is trained, and the recognition result is obtained according to the output probability of the living body recognition model.

如上所述，本发明提出了一种基于人脸视频关键帧的活体检测方法，具有以下有益效果。As described above, the present invention proposes a method for detecting a living body based on a face video key frame, which has the following beneficial effects.

通过可见光图像、近红外图像和深度图像相结合进行人脸识别和信息交互，可有效提高人脸识别的安全性和准确性。Face recognition and information interaction through the combination of visible light images, near-infrared images and depth images can effectively improve the security and accuracy of face recognition.

综合考虑多帧人脸图像中人脸区域的位置变化进行关键帧提取，可有效解决传统采样方法信息损失的问题；采用单帧人脸图像进行并行运算，可有效提高检测效率。Comprehensively considering the position changes of face regions in multi-frame face images to extract key frames, it can effectively solve the problem of information loss in traditional sampling methods; using single-frame face images for parallel operations can effectively improve detection efficiency.

附图说明Description of drawings

图1为本发明一实施例中基于人脸视频关键帧的活体检测方法的流程图。FIG. 1 is a flowchart of a method for detecting a living body based on a face video key frame according to an embodiment of the present invention.

图2为本发明一实施例中区域图像示意图。FIG. 2 is a schematic diagram of a region image in an embodiment of the present invention.

图3为本发明一实施例中人脸关键点位置示意图。FIG. 3 is a schematic diagram of the positions of key points of a face in an embodiment of the present invention.

图4为本发明一实施例中极值点示意图。FIG. 4 is a schematic diagram of an extreme point in an embodiment of the present invention.

图5为本发明一实施例中显著极值点示意图。FIG. 5 is a schematic diagram of a significant extreme point in an embodiment of the present invention.

图6为本发明一实施例中获取人脸视频关键帧的示意图。FIG. 6 is a schematic diagram of acquiring a face video key frame according to an embodiment of the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic concept of the present invention in a schematic way, so the drawings only show the components related to the present invention rather than the number, shape and number of components in actual implementation. For dimension drawing, the type, quantity and proportion of each component can be changed at will in actual implementation, and the component layout may also be more complicated.

请参阅图1，本实施例提供了一种基于人脸视频关键帧的活体检测方法，包括步骤S01-S03。Referring to FIG. 1 , this embodiment provides a method for detecting a living body based on a face video key frame, including steps S01-S03.

在步骤S01中，获取每帧人脸图像中对应人脸的区域图像：In step S01, the region image corresponding to the face in each frame of face image is obtained:

在一实施例中，获取每帧人脸图像中对应人脸的区域图像之前，获取每帧人脸图像中最大人脸的边框位置，并根据多帧人脸图像中对应的边界框位置获取区域图像。具体地，可通过摄像头等图像采集设备采集一段人脸视频，对人脸视频中的每一帧原始图像进行人脸检测，若在任意一帧原始图像中未检测到人脸，则重新采集人脸视频；否则，保存每一帧原始图像中最大人脸的边框位置。In one embodiment, before obtaining the area image corresponding to the face in each frame of the face image, the frame position of the largest face in each frame of the face image is obtained, and the area is obtained according to the corresponding bounding frame position in the multiple frames of face images. image. Specifically, a piece of face video can be collected by an image collection device such as a camera, and face detection is performed on each frame of the original image in the face video. If no face is detected in any frame of the original image, the face is collected again. face video; otherwise, save the border position of the largest face in each frame of the original image.

最大人脸是指图像中像素面积最大的人脸；具体实施中，可以分别检测每一帧人脸图像中的最大人脸作为目标人脸，或者采用一种更好的方案，即先检测第一帧图像中的最大脸作为目标人脸，而后对其位置进行跟踪，可以避免在同时检测到多个人脸时出现目标人脸丢失或错位的情况。The largest face refers to the face with the largest pixel area in the image; in the specific implementation, the largest face in each frame of face image can be detected as the target face, or a better solution can be adopted, that is, to detect the first face first. The largest face in a frame of images is used as the target face, and then its position is tracked, which can avoid the loss or dislocation of the target face when multiple faces are detected at the same time.

在一实施例中，人脸边界框位置可以表示为人脸矩形两个对角顶点上的两对坐标：In one embodiment, the position of the face bounding box can be represented as two pairs of coordinates on the two diagonal vertices of the face rectangle:

[(X₁,Y₁),(X₂,Y₂)][(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ )]

其中，的X与Y分别表示原始图像像素中列与行的索引。Among them, X and Y represent the index of the column and row in the original image pixel, respectively.

在一实施例中，请参阅图2，设置包含多帧图像中所有边框位置3的包络边框2，并根据所述包络边框的几何中心设置设置图像裁剪边框1；根据图像裁剪边框获取每帧人脸图像中对应人脸的区域图像。其中，图像裁剪边框可设置为正方形边框。具体地，根据每一帧人脸图像对应的人脸边界框位置，在同一像素坐标尺度下，得到能够包含多帧人脸图像对应的所有人脸边框的最小矩形包络(即包络边框2)，而后获取能够包含该最小矩形包络且以其为中心的最小正方形区域(即图像裁剪边框1)，按照该最小正方形区域将每一帧原始人脸图像都裁剪为相同像素尺寸大小的正方形，以便于后续输入到深度神经网络中进行活体识别。In one embodiment, referring to FIG. 2 , an envelope frame 2 including all frame positions 3 in the multi-frame image is set, and an image cropping frame 1 is set according to the geometric center of the envelope frame; each image cropping frame is obtained according to the image cropping frame. The area image of the corresponding face in the frame face image. Among them, the image cropping border can be set as a square border. Specifically, according to the position of the face bounding box corresponding to each frame of the face image, under the same pixel coordinate scale, the minimum rectangular envelope (that is, the envelope frame 2) that can contain all the face frames corresponding to the multiple frames of face images is obtained. ), and then obtain the smallest square area that can contain the smallest rectangular envelope and is centered on it (ie, image cropping frame 1), and cut each frame of the original face image into a square of the same pixel size according to the smallest square area , so as to facilitate subsequent input into the deep neural network for living body recognition.

在一实施例中，可根据用于裁剪图像的最小正方形区域在原始图像中的位置，将检测到的人脸边框位置通过坐标平移从原始人脸图像坐标空间转换到裁剪图像坐标空间中。In one embodiment, according to the position of the smallest square area used for cropping the image in the original image, the detected face border position can be transformed from the original face image coordinate space to the cropped image coordinate space through coordinate translation.

具体地，记人脸边框的位置坐标为：Specifically, remember the position coordinates of the face frame as:

则最小矩形包络在原始图像空间中的位置坐标可以表示为：Then the position coordinates of the minimum rectangular envelope in the original image space can be expressed as:

一般情况下，最小正方形区域的中心与最小矩形包络的中心重合，且最小正方形区域的边长与最小矩形包络的长边相等，如果遇到超出原始图像边界等异常情况可以适当调整，将其平移至原始图像边界以内。In general, the center of the smallest square area coincides with the center of the smallest rectangular envelope, and the side length of the smallest square area is equal to the long side of the smallest rectangular envelope. It is translated to within the bounds of the original image.

最终达到的效果是：每一帧原始图像都按照画面中一个固定的正方形区域进行裁剪，且目标人脸始终出现在裁剪图像区域以内，便于后续图像缩放以作为深度神经网络的输入。The final effect is that each frame of the original image is cropped according to a fixed square area in the picture, and the target face always appears within the cropped image area, which is convenient for subsequent image scaling as the input of the deep neural network.

在进行坐标平移转换时，记原始图像的像素坐标原点为(0,0)，转换前的人脸边框位置坐标为：When performing coordinate translation transformation, remember that the pixel coordinate origin of the original image is (0,0), and the position coordinates of the face frame before the transformation are:

若裁剪图像的像素坐标原点在原始图像空间中的坐标为(X₀,Y₀)，则转换后的人脸边界框位置坐标为：If the coordinates of the origin of the pixel coordinates of the cropped image in the original image space are (X ₀ , Y ₀ ), the position coordinates of the transformed face bounding box are:

在步骤S02中，获取区域图像对应的人脸特征关键点，并根据多帧人脸图像中人脸特征关键点获取人脸关键点位置变化量：In step S02, the facial feature key points corresponding to the regional images are obtained, and the position variation of the facial key points is obtained according to the facial feature key points in the multi-frame face images:

在一实施例中，在完成人脸图像裁剪后，选取经过裁剪的一帧人脸图像作为基准帧，获取多帧人脸图像中对应的人脸特征关键点与基准帧中人脸特征关键点的位置差获取位置变化量。具体地，根据坐标平移转换后的人脸边框位置，对每一帧裁剪图像中的区域图像进行人脸关键点检测(人脸关键点的具体位置可参见图3)，将每一帧区域图像对应的所有人脸关键点位置合并保存为矩阵。In one embodiment, after completing the cropping of the face image, one frame of the cropped face image is selected as the reference frame, and the corresponding face feature key points in the multi-frame face images and the face feature key points in the reference frame are obtained. The position difference of gets the position change amount. Specifically, according to the position of the face frame transformed by the coordinate translation, face key point detection is performed on the regional image in each frame of the cropped image (see Figure 3 for the specific position of the face key point), and each frame of regional image The corresponding keypoint positions of all faces are merged and saved as a matrix.

根据得到的人脸关键点位置矩阵，以第一帧人脸图像对应的区域图像为基准帧，计算多帧区域图像中每一帧人脸关键点位置与基准帧人脸关键点位置的差，得到对应每一帧的人脸关键点位置变化，合并保存为人脸关键点位置变化矩阵。According to the obtained face key point position matrix, take the regional image corresponding to the first frame of face image as the reference frame, calculate the difference between the position of the face key point of each frame in the multi-frame regional image and the position of the face key point of the reference frame, The position changes of the face key points corresponding to each frame are obtained, which are merged and saved as the face key point position change matrix.

根据得到的人脸关键点位置变化矩阵，计算每一帧人脸关键点位置变化的模，得到对应每一帧的人脸关键点位置变化量，合并保存为人脸关键点位置变化向量。According to the obtained face key point position change matrix, calculate the modulo of the face key point position change in each frame, obtain the face key point position change amount corresponding to each frame, and save it as a face key point position change vector.

具体地，具体实施中，人脸关键点有不同方案可以选择，分别可取68个点、19个点、5个点等(人脸关键点的具体位置可参见附图3)；设人脸关键点个数为L，则L可取68或19或5等。人脸检测和人脸关键点检测的算法实现可以采用已有的工具包(如OpenCV、dlib等)或其它可用于人脸检测或人脸关键点检测的神经网络模型或算法代码等。Specifically, in the specific implementation, there are different schemes for the face key points to choose from, respectively 68 points, 19 points, 5 points, etc. (for the specific positions of the face key points, please refer to Figure 3); The number of points is L, then L can be 68 or 19 or 5, etc. The algorithm implementation of face detection and face key point detection can use existing toolkits (such as OpenCV, dlib, etc.) or other neural network models or algorithm codes that can be used for face detection or face key point detection.

人脸关键点位置矩阵可以表示为：The face key point position matrix can be expressed as:

其中，

与

表示在第i帧区域图像中第j个人脸关键点所对应的像素空间坐标；in,

and

Indicates the pixel space coordinates corresponding to the jth face key point in the ith frame area image;

人脸关键点位置变化矩阵可以表示为：The face key point position change matrix can be expressed as:

人脸关键点位置变化向量可以表示为：The face key point position change vector can be expressed as:

在步骤S03中，根据位置变化量获取人脸视频关键帧，并对人脸视频关键帧进行活体识别，输出识别结果。In step S03, the key frames of the face video are acquired according to the position change, the living body recognition is performed on the key frames of the face video, and the recognition result is output.

在一实施例中，获取位置变化量的极值点，根据极值点获取对应的人脸视频关键帧。In one embodiment, the extreme point of the position change is acquired, and the corresponding face video key frame is acquired according to the extreme point.

在一实施例中，可设置待提取的人脸视频关键帧的数量，当极值点的个数大于待提取的人脸视频关键帧的数量时，对极值点进行归并处理，获取显著极值点，并根据显著极值点获取人脸视频关键帧；当极值点的个数等于待提取的人脸图像关键帧的数量时，直接将极值点对应时刻的人脸图像作为人脸视频关键帧；当极值点的个数小于待提取的人脸视频关键帧的数量时，从非极值点对应的人脸图像中抽取相应数量的人脸图像补足人脸视频关键帧数量。In one embodiment, the number of face video key frames to be extracted can be set, and when the number of extreme points is greater than the number of face video key frames to be extracted, the extreme points are merged to obtain significant extreme points. value points, and obtain the face video key frames according to the significant extreme points; when the number of extreme points is equal to the number of face image key frames to be extracted, the face image at the time corresponding to the extreme point is directly used as the face Video key frames; when the number of extreme points is less than the number of face video key frames to be extracted, a corresponding number of face images are extracted from the face images corresponding to the non-extreme points to supplement the number of face video key frames.

具体地，可以按照三邻域或五邻域原则抽取极值点，即每一个符合条件的极值点需要满足：在以自身为中心的三帧领域或五帧邻域中取得极大值或极小值；Specifically, extremum points can be extracted according to the three-neighborhood or five-neighborhood principle, that is, each extremum point that meets the conditions needs to satisfy: obtain the maximum value or minimum value;

极值点可以记作(i,‖ΔLM‖_i)的形式，表示人脸关键点位置变化向量在第i帧取得了极值，且对应变化量为‖ΔLM‖_i，即‖ΔLM‖中的第i个元素。The extreme point can be written in the form of (i, ‖ΔLM‖ _i ), which means that the change vector of the key point position of the face has obtained the extremum in the ith frame, and the corresponding change is ‖ΔLM‖ _i , that is, ‖ΔLM‖ in ith element.

在一实施例中，获取显著极值点时，根据邻近多个极值点的时间差和在对应人脸图像中的位置差对满足预设条件的极值点进行归并处理。In one embodiment, when obtaining significant extreme points, the extreme points that meet the preset conditions are merged according to the time difference of the adjacent multiple extreme points and the position difference in the corresponding face image.

极值点之间的时间差是指两个极值点对应时刻之间相隔的帧数；The time difference between extreme points refers to the number of frames between the corresponding moments of two extreme points;

极值点之间的位置差是指两个极值点对应人脸关键点位置变化量之间的差值；The position difference between the extreme points refers to the difference between the position changes of the key points of the face corresponding to the two extreme points;

具体实施中，对于极值点(i,‖ΔLM‖_i)与极值点(j,‖ΔLM‖_j)，若In the specific implementation, for the extreme point (i, ‖ΔLM‖ _i ) and the extreme point (j, ‖ΔLM‖ _j ), if

|i-j|≤α×N|i-j|≤α×N

且and

则将这两个极值点归并到同一簇(其中α为比例系数，可取0.05)。待所有极值点归并完毕，得到若干相对独立的簇，从每一簇中仅随机抽取一个极值点，作为显著极值点(示例可参见图5)。Then the two extreme points are merged into the same cluster (where α is the scale coefficient, which can be 0.05). After all extreme points are merged, several relatively independent clusters are obtained, and only one extreme point is randomly selected from each cluster as a significant extreme point (see Figure 5 for an example).

在一实施例中，请参阅图6，当显著极值点的个数大于待提取的人脸视频关键帧的数量时，从显著极值点中获取与待提取的人脸视频关键帧的数量相对应的显著极值点，并获取对应的人脸视频关键帧；当显著极值点的个数等于待提取的人脸视频关键帧的数量时，直接获取显著极值点对应时刻的人脸图像作为人脸视频关键帧；当显著极值点的个数小于待提取的人脸视频关键帧的数量时，从非显著极值点对应的人脸图像中抽取相应数量的人脸图像补足人脸视频关键帧。由于采集帧率、图像噪声等因素，选出的极值点往往存在一定的冗余，而通过显著极值点的筛选能够在很大程度上去除这部分冗余，最终得到若干彼此之间相对独立的显著极值点。In one embodiment, referring to FIG. 6 , when the number of significant extreme points is greater than the number of face video key frames to be extracted, the number of face video key frames to be extracted is obtained from the significant extreme points. The corresponding significant extreme points are obtained, and the corresponding face video key frames are obtained; when the number of significant extreme points is equal to the number of face video key frames to be extracted, the face at the corresponding moment of the significant extreme points is directly obtained. The image is used as a face video key frame; when the number of salient extreme points is less than the number of face video key frames to be extracted, a corresponding number of face images are extracted from the face images corresponding to the non-salient extreme points to supplement the human face. Face video keyframes. Due to the acquisition frame rate, image noise and other factors, the selected extreme points often have certain redundancy, and this part of redundancy can be removed to a large extent through the screening of significant extreme points, and finally a number of relative points are obtained. Independent significant extreme points.

在一实施例中，训练活体识别模型，根据活体识别模型的输出概率获取所述识别结果。In one embodiment, a living body recognition model is trained, and the recognition result is obtained according to the output probability of the living body recognition model.

具体地，根据所提取出的人脸视频关键帧，将每一关键帧对应时刻的区域图像都缩放到相同的像素尺寸大小，以作为活体识别模型的输入。Specifically, according to the extracted key frames of the face video, the regional images at the corresponding moment of each key frame are scaled to the same pixel size as the input of the living body recognition model.

每一关键帧图像分别输入到预先训练好的用于人脸活体检测的活体识别模型中，每次运算所得到的输出概率作为对应各个关键帧的输出概率，该输出概率表示当前人脸视频关键帧中的最大人脸被判定为真脸的概率，取值范围在0到1之间，趋向于0表示假脸，趋向于1则表示真脸；记录下各输出概率的最大值、最小值和平均值以备用。Each key frame image is input into the pre-trained living body recognition model for face living body detection, and the output probability obtained by each operation is used as the output probability corresponding to each key frame, and the output probability represents the current face video key. The probability that the largest face in the frame is judged to be a real face, the value range is between 0 and 1, tending to 0 means a fake face, and tending to 1 means a real face; record the maximum and minimum values of each output probability and the average value for backup.

根据得到的输出概率最大值、最小值和平均值判断，若输出概率平均值小于假脸阈值，且最大值小于真脸阈值，则提示为假脸；若输出概率平均值大于真脸阈值，且最小值大于假脸阈值，则提示为真脸；若不满足前述两种情况，则重新进行图像采集。Judging according to the obtained maximum value, minimum value and average value of output probability, if the average value of output probability is less than the false face threshold, and the maximum value is less than the true face threshold, it will prompt a false face; if the average output probability is greater than the true face threshold, and If the minimum value is greater than the false face threshold, it indicates that it is a real face; if the above two conditions are not met, the image acquisition is performed again.

具体地，具体实施中，每一关键帧所对应的区域图像所需缩放到的像素尺寸大小可取224×224；Specifically, in the specific implementation, the pixel size to which the regional image corresponding to each key frame needs to be scaled may be 224×224;

可以采用残差网络(ResNet)作为用于人脸活体识别的深度神经网络模型框架。Residual network (ResNet) can be used as a deep neural network model framework for face recognition.

网络输入通道的尺寸大小可以设定为224×224，对应关键帧区域图像的缩放大小；最后一层全连接层输出维度设定为2，对应类别中的真脸与假脸；最后网络通过Softmax操作计算出当前输入的人脸图像被判定为真脸的概率；其它各层的卷积核大小、步长、通道数量、池化核大小、步长、全连接层维度、激活函数等参数可以根据实验效果进行灵活调整。The size of the network input channel can be set to 224×224, which corresponds to the scaling of the image in the key frame area; the output dimension of the last fully connected layer is set to 2, corresponding to the real face and fake face in the category; finally, the network passes Softmax The operation calculates the probability that the currently input face image is judged to be a real face; the convolution kernel size, step size, number of channels, pooling kernel size, step size, fully connected layer dimension, activation function and other parameters of other layers can be It can be adjusted flexibly according to the experimental effect.

所用于人脸活体识别的卷积神经网络框架也可以被同等替代为LeNet、AlexNet、GoogLeNet、VGG等预训练网络结构；The convolutional neural network framework used for face recognition can also be replaced by pre-trained network structures such as LeNet, AlexNet, GoogLeNet, and VGG;

在实际测试前，网络需要使用一定量预先标注的人脸视频关键帧进行模型训练。Before the actual test, the network needs to use a certain amount of pre-annotated face video keyframes for model training.

分别针对每一人脸视频关键帧的模型运算可并行执行，这一操作可以通过预先初始化K个完全相同或彼此共享网络权重的模型来实现。The model operation for each face video key frame can be performed in parallel, which can be achieved by pre-initializing K models that are identical or share network weights with each other.

令假脸阈值为Th_fake，真脸阈值为Th_real，对应各个关键帧的输出概率最大值为P_max，最小值为P_min，平均值为P_avg：Let the fake face threshold be Th _fake , the real face threshold be Th _real , the maximum output probability corresponding to each key frame is P _max , the minimum value is P _min , and the average value is P _avg :

若P_avg<Th_fake，且P_max<Th_real，则判定为假脸；If P _avg <Th _fake , and P _max <Th _real , it is determined to be a fake face;

若P_avg>Th_real，且P_min>Th_fake，则判定为真脸；If P _avg >Th _real , and P _min >Th _fake , it is determined as a real face;

若均不满足以上两种情况，则视为待定，需重新采集人脸视频再行判定。If the above two conditions are not met, it will be regarded as pending, and the face video needs to be collected again for judgment.

具体实施中，用于判定的假脸阈值Th_fake可以设定为0.4，真脸阈值Th_real可以设定为0.6。In a specific implementation, the fake face threshold Th _fake used for determination can be set to 0.4, and the real face threshold Th _real can be set to 0.6.

综上所述，本发明提出了一种基于人脸视频关键帧的活体检测方法，克服了传统视频采样策略的缺陷，能够更好地提取出视频片段中隐含显著判别信息的关键帧，减少了因采样而带来的信息损失；相比于一般的单帧模型而言，综合了输入人脸关键帧的多次判别结果，能够在一定程度上提升单帧活体检测模型的识别正确率，性能也更加鲁棒；相比于一般的多帧模型而言，将各个关键帧拆分开来，采用单帧模型做并行运算，能够在一定程度上减少模型的运算时间，提高检测效率；具有一定的泛用性，其中所用到的活体识别模型可以被绝大多数基于单帧图像的人脸活体识别算法或网络所替代。所以，本发明有效克服了现有技术中的种种缺点而具有较高的产业利用价值。To sum up, the present invention proposes a method for detecting a living body based on key frames of face video, which overcomes the defects of traditional video sampling strategies, and can better extract key frames containing significant discriminant information in video clips, reducing the need for Compared with the general single-frame model, the multiple discrimination results of the input face key frames can be combined, which can improve the recognition accuracy of the single-frame living detection model to a certain extent. The performance is also more robust; compared with the general multi-frame model, splitting each key frame and using the single-frame model for parallel computing can reduce the model's computing time to a certain extent and improve the detection efficiency; To a certain extent, the living body recognition model used can be replaced by the vast majority of face living body recognition algorithms or networks based on single-frame images. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can make modifications or changes to the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical idea disclosed in the present invention should still be covered by the claims of the present invention.

Claims

1. A living body detection method based on face video key frames is characterized by comprising the following steps:

acquiring a region image corresponding to a face in each frame of face image;

acquiring face feature key points corresponding to the region images, and acquiring the position variation of the face key points according to the face feature key points in the multi-frame face images;

and acquiring a face video key frame according to the position variation, performing living body identification on the face video key frame, and outputting an identification result.

2. The live body detection method based on the key frame of the human face video as claimed in claim 1, wherein before the region image of the corresponding human face in each frame of the human face image is obtained, the position of the bounding box of the largest human face in each frame of the human face image is obtained, and the region image is obtained according to the position of the corresponding bounding box in the plurality of frames of the human face image.

3. The live body detection method based on the key frame of the human face video, according to claim 2, characterized by setting an envelope border containing all the border positions in a multi-frame image, and setting an image clipping border according to the geometric center of the envelope border;

and obtaining a region image corresponding to the face in each frame of face image according to the image cutting frame.

4. The live body detection method based on the key frame of the human face video, according to claim 3, characterized in that the image cropping frame comprises a square frame.

5. The living body detection method based on the face video key frame as claimed in claim 1, wherein a frame of face image is selected as a reference frame, and the position variation is obtained by obtaining the position difference between the corresponding face feature key point in the face images of multiple frames and the face feature key point in the reference frame.

6. The living body detection method based on the face video key frame as claimed in claim 1, wherein an extreme point of the position variation is obtained, and the corresponding face video key frame is obtained according to the extreme point.

7. The living body detection method based on the face video key frames according to claim 6, characterized in that the number of the face video key frames to be extracted is set, when the number of the extreme points is larger than the number of the face video key frames to be extracted, merging processing is performed on the extreme points to obtain significant extreme points, and the face video key frames are obtained according to the significant extreme points; when the number of the extreme points is equal to the number of the face video key frames to be extracted, directly taking the face images at the moment corresponding to the extreme points as the face video key frames; and when the number of the extreme points is less than the number of the face video key frames to be extracted, extracting a corresponding number of face images from the face images corresponding to the non-extreme points to complement the face video key frames.

8. The living body detection method based on the face video key frame as claimed in claim 7, wherein when the significant extreme point is obtained, the extreme points meeting the preset condition are merged according to the time difference between the adjacent extreme points and the position difference in the corresponding face image.

9. The living body detection method based on the face video key frames according to claim 8, wherein when the number of the significant extreme points is larger than the number of the face video key frames to be extracted, the significant extreme points corresponding to the number of the face video key frames to be extracted are obtained from the significant extreme points, and the corresponding face video key frames are obtained; when the number of the significant extreme points is equal to the number of the face video key frames to be extracted, directly acquiring the face images corresponding to the significant extreme points as the face video key frames; and when the number of the significant extreme points is less than the number of the face video key frames to be extracted, extracting a corresponding number of face images from the face images corresponding to the non-significant extreme points to complement the face video key frames.

10. The living body detection method based on the human face video key frame as claimed in claim 1, characterized in that a living body recognition model is trained, and the recognition result is obtained according to the output probability of the living body recognition model.