CN106980811A

CN106980811A - Facial expression recognition method and facial expression recognition device

Info

Publication number: CN106980811A
Application number: CN201610921132.7A
Authority: CN
Inventors: 金啸; 胡晨晨; 旷章辉; 张伟
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2017-07-25

Abstract

The invention discloses a facial expression recognition method and a facial expression recognition device, wherein the facial expression recognition method comprises the following steps: acquiring a face image sequence to be recognized, wherein the face image sequence comprises a single frame or more than two frames of face images; respectively preprocessing each frame of face image in the face image sequence; inputting each preprocessed frame of face image into a trained training model for expression recognition to obtain an expression recognition result of the face image sequence; the input end to the output end of the training model are sequentially constructed by a convolutional neural network model, a long-time memory cyclic neural network model, a first pooling layer and a logistic regression model, and the training model is obtained by training a continuous frame image set labeled with expression categories. The technical scheme provided by the invention can effectively improve the recognition performance of the facial expression.

Description

Facial expression recognition method and human facial expression recognition device

技术领域technical field

本发明涉及图像识别技术领域，具体涉及一种人脸表情识别方法和人脸表情识别装置。The invention relates to the technical field of image recognition, in particular to a facial expression recognition method and a human facial expression recognition device.

背景技术Background technique

人脸表情识别技术是指对给定的人脸图像指定一个表情类别,包括：愤怒,厌恶,开心,伤心,恐惧,惊讶等。目前,人脸表情识别技术在人机交互、临床诊断、远程教育和侦查审讯等领域逐渐显现广阔的应用前景,是计算机视觉和人工智能的热门研究方向。Facial expression recognition technology refers to assigning an expression category to a given face image, including: anger, disgust, happiness, sadness, fear, surprise, etc. At present, facial expression recognition technology has gradually shown broad application prospects in the fields of human-computer interaction, clinical diagnosis, distance education and investigation and interrogation, and is a popular research direction of computer vision and artificial intelligence.

目前存在一种基于深度卷积神经网络的人脸表情识别方法，该人脸表情识别方法通过人脸图像检测、校准后，将校准后的人脸图像输入已训练好的深度卷积神经网络进行表情识别。在上述人脸表情识别方法中，深度卷积神经网络通过单帧图像训练得到，而由于人脸的表情与场景上下文联系紧密，且与对象的中性表情十分相关，因此，通过上述人脸表情识别方法难以对中性表情进行准确识别，人脸表情的识别性能较差。At present, there is a facial expression recognition method based on a deep convolutional neural network. After the face image is detected and calibrated, the calibrated human face image is input into a trained deep convolutional neural network. Expression recognition. In the above-mentioned facial expression recognition method, the deep convolutional neural network is obtained through single-frame image training, and since the facial expression is closely related to the scene context and is very related to the neutral expression of the object, the above-mentioned facial expression It is difficult for the recognition method to accurately recognize neutral expressions, and the recognition performance of facial expressions is poor.

发明内容Contents of the invention

本发明提供一种人脸表情识别方法和人脸表情识别装置，用于提升人脸表情的识别性能。The invention provides a human facial expression recognition method and a human facial expression recognition device, which are used to improve the recognition performance of human facial expressions.

本发明第一方面提供一种人脸表情识别方法，包括：A first aspect of the present invention provides a method for recognizing facial expressions, comprising:

获取待识别的人脸图像序列，所述人脸图像序列包含单帧或两帧以上人脸图像；Obtain a sequence of human face images to be identified, the sequence of human face images comprising a single frame or more than two frames of human face images;

分别对所述人脸图像序列中的各帧人脸图像进行预处理；Preprocessing each frame of face images in the sequence of face images respectively;

将预处理后的各帧人脸图像输入已训练好的训练模型进行表情识别，得到所述人脸图像序列的表情识别结果；Input each frame of human face image after preprocessing into the training model that has been trained and carry out expression recognition, obtain the expression recognition result of described human face image sequence;

其中，所述训练模型的输入端到输出端依次由卷积神经网络模型、长短时记忆循环神经网络模型、第一池化层和逻辑回归模型构建，且所述训练模型通过标注表情类别的连续帧图像集合训练得到。Wherein, the input end to the output end of the training model is successively constructed by a convolutional neural network model, a long-short-term memory recurrent neural network model, a first pooling layer and a logistic regression model, and the training model is continuously A collection of frame images is trained.

本发明第二方面提供一种人脸表情识别装置，包括：A second aspect of the present invention provides a facial expression recognition device, comprising:

图像获取单元，用于获取待识别的人脸图像序列，所述人脸图像序列包含单帧或两帧以上人脸图像；An image acquisition unit, configured to acquire a sequence of human face images to be recognized, the sequence of human face images comprising a single frame or more than two frames of human face images;

图像预处理单元，用于分别对所述人脸图像序列中的各帧人脸图像进行预处理；An image preprocessing unit, configured to preprocess each frame of human face images in the sequence of human face images;

识别处理单元，用于将预处理后的各帧人脸图像输入已训练好的训练模型进行表情识别，得到所述人脸图像序列的表情识别结果；The recognition processing unit is used to input the preprocessed frames of human face images into the trained training model for expression recognition, and obtain the expression recognition results of the human face image sequence;

由上可见，本发明中基于长短时记忆循环神经网络(LSTM-RNN，Long Short TermMemory-Recurrent Neural Networks)模型构建训练模型，并将连续帧图像集合(例如视频)作为该训练模型的训练输入，能够使该训练模型充分利用脸部表情变化的动态信息自动学习识别对象的中性表情以及不同姿态表情特征之间的映射关系，从而提高该训练模型的预测精度和鲁棒性，进而提升人脸表情的识别性能。As can be seen from the above, in the present invention, a training model is constructed based on a Long Short Term Memory Recurrent Neural Network (LSTM-RNN, Long Short TermMemory-Recurrent Neural Networks) model, and a continuous frame image collection (such as a video) is used as the training input of the training model, The training model can make full use of the dynamic information of facial expression changes to automatically learn to identify the neutral expression of the object and the mapping relationship between different posture and expression features, thereby improving the prediction accuracy and robustness of the training model, and further improving the accuracy of facial expressions. Expression recognition performance.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1-a为本发明提供的一种人脸表情识别方法一个实施例流程示意图；Fig. 1-a is a schematic flow chart of an embodiment of a method for recognizing facial expressions provided by the present invention;

图1-b为本发明提供的应用于图1-a所示人脸表情识别方法的一种训练模型实施例结构示意图；Fig. 1-b is a schematic structural diagram of a training model embodiment applied to the facial expression recognition method shown in Fig. 1-a provided by the present invention;

图1-c为本发明提供的图1-b所示的训练模型在一种应用场景下的时序处理流向示意图；Figure 1-c is a schematic diagram of the timing processing flow of the training model shown in Figure 1-b in an application scenario provided by the present invention;

图1-d为本发明提供的图1-b所示的训练模型在另一种应用场景下的时序处理流向示意图；Figure 1-d is a schematic diagram of the timing processing flow of the training model shown in Figure 1-b in another application scenario provided by the present invention;

图1-e为本发明提供的应用于图1-b所示的训练模型的一种LSTM-RNN模型结构示意图；Fig. 1-e is a schematic structural diagram of an LSTM-RNN model applied to the training model shown in Fig. 1-b provided by the present invention;

图1-f为本发明提供的应用于图1-b所示的训练模型的一种CNN模型结构示意图；Fig. 1-f is a schematic structural diagram of a CNN model applied to the training model shown in Fig. 1-b provided by the present invention;

图2为本发明提供的一种人脸表情识别装置一个实施例结构示意图。FIG. 2 is a schematic structural diagram of an embodiment of a facial expression recognition device provided by the present invention.

具体实施方式detailed description

为使得本发明的发明目的、特征、优点能够更加的明显和易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而非全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例一Embodiment one

本发明实例提供一种人脸表情识别方法。如图1-a所示，本发明实施例中的人脸表情识别方法包括：An example of the present invention provides a method for recognizing facial expressions. As shown in Figure 1-a, the facial expression recognition method in the embodiment of the present invention includes:

步骤101、获取待识别的人脸图像序列；Step 101, acquiring a face image sequence to be recognized;

其中，上述人脸图像序列包含单帧或两帧以上人脸图像。也即，本发明实施例中的人脸表情识别方法可以对连续的多帧人脸图像(例如视频)进行识别，同时，也兼容对单帧人脸图像的识别。Wherein, the above sequence of human face images includes a single frame or more than two frames of human face images. That is, the facial expression recognition method in the embodiment of the present invention can recognize continuous multi-frame facial images (such as video), and is also compatible with the recognition of single-frame facial images.

在步骤101中，可以通过摄像头实时获取待识别的人脸图像序列，或者，也可以通过接收来自外部设备的人脸图像序列的方式，获取待识别的人脸图像序列，或者，也可以基于用户在已有图像数据库或视频数据库中的选取来获取待识别的人脸图像序列，此处不作限定。In step 101, the face image sequence to be recognized can be obtained in real time through the camera, or the face image sequence to be recognized can be obtained by receiving the face image sequence from an external device, or it can also be based on user The face image sequence to be recognized is obtained by selecting from an existing image database or video database, which is not limited here.

步骤102、分别对上述人脸图像序列中的各帧人脸图像进行预处理；Step 102, performing preprocessing on each frame of face images in the above-mentioned sequence of face images respectively;

在步骤101获取待识别的人脸图像序列之后，分别对上述人脸图像序列中的各帧人脸图像进行预处理，以使得预处理后的人脸图像能够更适用于后续的表情识别，具体地，在不同的应用场景下，对人脸图像的预处理也可以采用相应的处理方法。After obtaining the sequence of human face images to be recognized in step 101, preprocessing is performed on each frame of human face images in the above-mentioned sequence of human face images, so that the preprocessed human face images can be more suitable for subsequent expression recognition, specifically Therefore, in different application scenarios, corresponding processing methods can also be used for the preprocessing of face images.

例如，在一种实施例中，上述分别对上述人脸图像序列中的各帧人脸图像进行预处理具体可以包括如下两个步骤：For example, in one embodiment, the above-mentioned preprocessing of each frame of face images in the above-mentioned sequence of face images may specifically include the following two steps:

步骤1、针对上述各帧人脸图像中的每帧人脸图像进行人脸检测，确定人脸区域。上述人脸检测的过程可以采用多种人脸检测算法进行实现，例如基于Haar-Like特征的Adaboost人脸检测算法等。基于人脸检测算法，可以以适当大小的窗口和适当的步长扫描输入图像(也即上述每帧人脸图像)，直到确定出该人脸图像中的人脸区域(人脸区域也即人脸所在的区域)。Step 1. Perform face detection on each frame of face images in the above frames of face images, and determine the face area. The above-mentioned face detection process can be implemented by using various face detection algorithms, such as the Adaboost face detection algorithm based on Haar-Like features. Based on the face detection algorithm, the input image (that is, each frame of the above-mentioned face image) can be scanned with a window of an appropriate size and an appropriate step size until the face area in the face image is determined (the face area is also the human face area). area of the face).

步骤2、检测上述人脸区域中的关键特征点，并基于检测到的关键特征点对相应的人脸图像进行对齐校准。在人脸检测的基础上，进一步确定人脸区域中的关键特征点(例如眼睛、眉毛、鼻子、嘴巴、脸部外轮廓等)的位置。根据在人脸区域中检测到的关键特征点，可通过刚体变换对相应的人脸图像进行对齐校准，使得人脸在图像中各关键特征点的位置基本一致。在本发明实施例中，具体可以采用landmark方法来进行人脸图像的对齐校准。另外，在对人脸图像进行对齐校准的过程中，还可以根据预置的人脸模型进行关键特征点的定位调整。Step 2. Detect the key feature points in the above-mentioned face area, and perform alignment and calibration on the corresponding face images based on the detected key feature points. On the basis of face detection, the positions of key feature points (such as eyes, eyebrows, nose, mouth, facial outline, etc.) in the face area are further determined. According to the key feature points detected in the face area, the corresponding face images can be aligned and calibrated through rigid body transformation, so that the positions of the key feature points of the face in the image are basically the same. In the embodiment of the present invention, specifically, the landmark method may be used to perform alignment and calibration of face images. In addition, in the process of aligning and calibrating the face image, the positioning and adjustment of key feature points can also be performed according to the preset face model.

进一步，为了避免图像大小不统一影响识别的结果，上述分别对上述人脸图像序列中的各帧人脸图像进行预处理还可以包括如下步骤：将步骤2对齐校准后的人脸图像按照预设的模板进行编辑处理，以获得统一大小的人脸图像，其中，上述编辑处理包括如下一种或两种以上：剪切处理、缩放处理。例如，在上述编辑处理过程中，基于检测到的人脸区域中的关键特征点，将相应的人脸图像按统一模板剪切出来，并将人脸图像缩放到统一大小。Further, in order to avoid the inconsistency of image size affecting the recognition result, the above-mentioned preprocessing of each frame of face images in the above-mentioned face image sequence may also include the following steps: aligning the calibrated face images in step 2 according to the preset The template is edited to obtain a face image of a uniform size, wherein the editing includes one or more of the following: cutting and scaling. For example, in the above editing process, based on the detected key feature points in the face area, the corresponding face image is cut out according to a uniform template, and the face image is scaled to a uniform size.

需要说明的是，若上述人脸图像序列包含单帧人脸图像，则上述分别对上述人脸图像序列中的各帧人脸图像进行预处理实际表现为对该单帧人脸图像进行预处理；若上述人脸图像序列包含两帧以上人脸图像，则上述分别对上述人脸图像序列中的各帧人脸图像进行预处理实际表现为对上述两帧以上人脸图像中的各帧人脸图像分别进行预处理。It should be noted that, if the above-mentioned sequence of human face images includes a single frame of human face images, the above-mentioned preprocessing of each frame of human face images in the above-mentioned sequence of human face images is actually performed as preprocessing the single frame of human face images. ; If the above-mentioned sequence of human face images includes more than two frames of human face images, then the above-mentioned preprocessing of each frame of human face images in the above-mentioned sequence of human face images is actually performed by performing preprocessing on each frame of human face images in the above-mentioned two or more frames of human face images. Face images are preprocessed separately.

步骤103、将预处理后的各帧人脸图像输入已训练好的训练模型进行表情识别，得到上述人脸图像序列的表情识别结果；Step 103, input the preprocessed frames of human face images into the trained training model for expression recognition, and obtain the expression recognition result of the above-mentioned human face image sequence;

在步骤103中，将步骤102预处理后的各帧人脸图像输入已训练好的训练模型进行表情识别，获得上述人脸图像序列的表情识别结果。上述表情识别结果可指示上述人脸图像序列所属的表情类别，其中，存在的表情类别可包括但不限于：生气、平静、困惑、厌恶、快乐、难过、害怕、惊讶、斜眼和尖叫。In step 103, each frame of human face images preprocessed in step 102 is input into the trained training model for expression recognition, and an expression recognition result of the above human face image sequence is obtained. The facial expression recognition result may indicate the facial expression category to which the human face image sequence belongs, wherein the existing facial expression categories may include but not limited to: angry, calm, confused, disgusted, happy, sad, scared, surprised, squinting and screaming.

本发明实施中，如图1-b所示，上述训练模型的输入端到输出端依次由卷积神经网络(CNN，Convolutional Neural Network)模型、长短时记忆循环神经网络模型(即LSTM-RNN模型)、第一池化层和逻辑回归模型构建。并且，上述训练模型通过标注表情类别的连续帧图像集合训练得到。由于上述训练模型是通过标注表情类别的连续帧图像集合训练得到，因此，一方面，上述训练模型可自动学习时间尺度的依赖关系，充分利用脸部表情变化的动态信息，联系表情当前帧的前后帧信息，使得表情识别更具鲁棒性；另一方面，可以精确界定中性表情以消除不同对象之间表情张力与强度等不同所带来的影响，提升识别准确率；再一方面，由于连续帧图像集合中的各帧图像与所标注的表情类别具有强相关性，因此，即使输入的图像序列存在扭曲失真也能够实现表情识别。In the implementation of the present invention, as shown in Figure 1-b, the above-mentioned training model is sequentially composed of a convolutional neural network (CNN, Convolutional Neural Network) model, a long-short-term memory recurrent neural network model (i.e. LSTM-RNN model) from the input end to the output end of the above-mentioned training model. ), the first pooling layer and logistic regression model construction. In addition, the above training model is obtained by training a set of consecutive frame images marked with expression categories. Since the above-mentioned training model is trained through a collection of consecutive frame images marked with expression categories, on the one hand, the above-mentioned training model can automatically learn the dependence of the time scale, make full use of the dynamic information of facial expression changes, and connect the front and back of the current frame of expression Frame information makes expression recognition more robust; on the other hand, neutral expressions can be precisely defined to eliminate the influence of differences in expression tension and intensity between different objects and improve recognition accuracy; on the other hand, due to Each frame image in the continuous frame image set has a strong correlation with the labeled expression category, so expression recognition can be realized even if the input image sequence is distorted.

可选的，上述第一池化层可以为平均池化层或者最大值池化层或其它类型的池化层，此处不作限定。Optionally, the above-mentioned first pooling layer may be an average pooling layer or a maximum pooling layer or other types of pooling layers, which are not limited herein.

可选的，若上述人脸图像序列包含两帧以上人脸图像，则，上述将预处理后的各帧人脸图像输入已训练好的训练模型进行人脸识别，包括：通过上述第一池化层对上述长短时记忆循环神经网络模型输入的上述各帧人脸图像的人脸特征向量统一进行降维处理，得到降维处理后的人脸特征向量；向上述逻辑回归模型输出上述降维处理后的人脸特征向量。下面以连续帧图像(即输入的人脸图像序列包含两帧以上人脸图像)为例对该训练模型的时序处理流向进行描述，如图1-c所示的训练模型的时序处理流向示意图，其中，X0，X1，...，Xn是长度为n帧的视频的每个帧图像，将各帧图像经CNN模块提取的人脸特征向量按照时间顺序依次输入LSTM模块，将经LSTM模块处理得到的不同时刻输出的人脸特征向量h0，h1，...，hn经过第一池化层统一进行降维处理，得到用于表情分类的人脸特征向量h，最后将人脸特征向量h输入逻辑回归模型进行逻辑回归处理，得到该连续帧图像的表情识别结果。当输入的人脸图像序列为单帧人脸图像(即上述n＝1)时，图1-c所示的训练模型的时序处理流向示意图可简化为如图1-d所示的训练模型的时序处理流向示意图。Optionally, if the above-mentioned sequence of human face images contains more than two frames of human face images, the above-mentioned input of each pre-processed frame of human face images into the trained training model for face recognition includes: passing through the above-mentioned first pool The transformation layer uniformly performs dimension reduction processing on the face feature vectors of the above-mentioned frames of face images input by the above-mentioned long-short-term memory recurrent neural network model, and obtains the face feature vectors after dimension reduction processing; outputs the above-mentioned dimension reduction to the above-mentioned logistic regression model The processed face feature vector. The following is a description of the sequence processing flow of the training model by taking continuous frame images (that is, the input face image sequence contains more than two frames of face images) as an example, as shown in Figure 1-c. Among them, X0, X1, ..., Xn are each frame image of a video with a length of n frames, and the face feature vectors extracted by the CNN module of each frame image are input into the LSTM module in chronological order, and will be processed by the LSTM module The obtained face feature vectors h0, h1, ..., hn output at different times are processed through the first pooling layer for dimensionality reduction to obtain the face feature vector h for expression classification, and finally the face feature vector h Input the logistic regression model to perform logistic regression processing to obtain the facial expression recognition results of the continuous frame images. When the input face image sequence is a single-frame face image (i.e. the above n=1), the sequence processing flow diagram of the training model shown in Figure 1-c can be simplified to the training model shown in Figure 1-d Schematic diagram of timing processing flow.

本发明实施例中，上述训练模型所包含的LSTM-RNN模型的结构可以如图1-e所示，包括：输入门(即input gate)、遗忘门(即forget gate)、输出门(即output gate)、状态单元(即cell)和LSTM-RNN模型输出结果。In the embodiment of the present invention, the structure of the LSTM-RNN model contained in the above training model can be shown in Figure 1-e, including: input gate (ie input gate), forget gate (ie forget gate), output gate (ie output gate) gate), state unit (ie cell) and LSTM-RNN model output.

对于输入的人脸图像序列包含两帧以上人脸图像的情况，上述输入门、上述遗忘门、上述输出门、上述状态单元和上述LSTM-RNN模型输出结果的处理过程可以分别通过以下公式实现：For the case where the input face image sequence contains more than two frames of face images, the processing process of the above-mentioned input gate, the above-mentioned forgetting gate, the above-mentioned output gate, the above-mentioned state unit and the output result of the above-mentioned LSTM-RNN model can be realized by the following formulas respectively:

i_t＝σ(W_ixx_t+W_imm_t-1+W_icc_t-1+b_i)；i _t = σ(W _ix x _t +W _im m _t-1 +W _ic c _t-1 + _bi );

f_t＝σ(W_fxx_t+W_fmm_t-1+W_fcc_t-1+b_f)；f _t = σ(W _fx x _t +W _fm m _t-1 +W _fc c _t-1 +b _f );

c_t＝f_t⊙c_t-1+i_t⊙σ(W_cxx_t+W_cmm_t-1+b_c)；c _t = f _t ⊙c _t-1 +i _t ⊙σ(W _cx x _t +W _cm m _t-1 +b _c );

o_t＝σ(W_oxx_t+W_omm_t-1+W_occ_t-1+b_o)；o _t = σ(W _ox x _t +W _om m _t-1 +W _oc c _t-1 +b _o );

m_t＝o_t⊙h(c_t)。m _t =o _t ⊙h(c _t ).

其中，在上述公式中，x_t表示为t时刻输入的人脸特征向量；W(即W_ix、W_im、W_ic、W_fx、W_fm、W_fc、W_cx、W_cm、W_ox、W_om和W_oc)为预设的权重矩阵，表示每个门的元素都是由对应维数的数据得到，也就是说不同维数的节点之间互不干扰；b(即b_i、b_f、b_c、b_o)表示预设的偏置向量，i_t、f_t、o_t、c_t、m_t分别表示t时刻的上述输入门、上述遗忘门、上述输出门、上述状态单元和上述LSTM-RNN模型输出结果的状态，⊙为点积，σ()为sigmoid函数，h()为上述状态单元的输出激活函数，该输出激活函数具体可以为tanh函数。Among them, in the above formula, x _t represents the face feature vector input at time t; W (ie W _ix , W _im , W _ic , W _fx , W _fm , W _fc , W _cx , W _cm , W _ox , W _om and W _oc ) are preset weight matrices, indicating that the elements of each gate are obtained from the data of the corresponding dimension, that is to say, nodes of different dimensions do not interfere with each other; b (namely b _i , b _f , b _c , b _o ) represent the preset bias vector, and it , f _t , o _t , c _t , m _t respectively represent the above-mentioned input gate, the above-mentioned forget gate, the above-mentioned output gate, and the above-mentioned state unit at time _t And the state of the output result of the above LSTM-RNN model, ⊙ is the dot product, σ() is the sigmoid function, h() is the output activation function of the above state unit, and the output activation function can specifically be a tanh function.

可选的，对于输入的人脸图像序列包含单帧人脸图像的情况，上述输入门、上述遗忘门、上述输出门、上述状态单元和上述LSTM-RNN模型输出结果的处理过程还可以简化为如下公式实现：Optionally, for the case where the input face image sequence contains a single frame of face images, the processing process of the above-mentioned input gate, the above-mentioned forgetting gate, the above-mentioned output gate, the above-mentioned state unit and the output result of the above-mentioned LSTM-RNN model can also be simplified as The following formula is implemented:

i_t＝σ(W_ixx_t+consatant₁)；i _t =σ(W _ix x _t +consatant ₁ );

f_t＝σ(W_fxx_t+consatant₂)；f _t = σ(W _fx x _t + consatant ₂ );

c_t＝f_t⊙c_t-1+i_t⊙σ(W_cxx_t+consatant₃)；c _t = f _t ⊙c _t-1 +i _t ⊙σ(W _cx x _t +consatant ₃ );

o_t＝σ(W_oxx_t+W_omm_t-1+consatant₄)；o _t = σ(W _ox x _t +W _om m _t-1 +consatant ₄ );

m_t＝o_t⊙h(c_t)。m _t =o _t ⊙h(c _t ).

其中，在上述公式中，x_t表示为t时刻输入的人脸特征向量；W(即W_ix、W_im、W_ic、W_fx、W_fm、W_fc、W_cx、W_cm、W_ox、W_om和W_oc)为预设的权重矩阵，表示每个门的元素都是由对应维数的数据得到，也就是说不同维数的节点之间互不干扰；consatant(即consatant₁、consatant₂、consatant₃和consatant₄)为预设的常量，i_t、f_t、o_t、c_t、m_t分别表示t时刻的上述输入门、上述遗忘门、上述输出门、上述状态单元和上述LSTM-RNN模型输出结果的状态，⊙为点积，σ()为sigmoid函数，h()为上述状态单元的输出激活函数，该输出激活函数具体可以为tanh函数。Among them, in the above formula, x _t represents the face feature vector input at time t; W (ie W _ix , W _im , W _ic , W _fx , W _fm , W _fc , W _cx , W _cm , W _ox , W _om and W _oc ) are preset weight matrices, indicating that the elements of each gate are obtained from the data of the corresponding dimension, that is to say, nodes of different dimensions do not interfere with each other; consatant (ie consatant ₁ , consatant ₂ , consatant ₃ and consatant ₄ ) are preset constants, and it , f _t , o _t , c _t , _{m t} _represent the above-mentioned input gate, the above-mentioned forgetting gate, the above-mentioned output gate, the above-mentioned state unit and the above-mentioned The state of the output result of the LSTM-RNN model, ⊙ is the dot product, σ() is the sigmoid function, h() is the output activation function of the above state unit, and the output activation function can specifically be a tanh function.

可选的，如图1-f所示，上述CNN模型的输入端到输出端依次由第一卷积层、第二池化层、第二卷积层和第三池化层构建。上述将预处理后的各帧人脸图像输入已训练好的训练模型进行人脸识别，包括：向上述LSTM-RNN模型输出经上述第三池化层处理后得到的人脸特征向量。其中，上述第二池化层和第三池化层可以为平均池化层或最大值池化层或其它类型的池化层，此处不作限定。当然，在其它实施例中，上述CNN模型也可以参照已有的CNN模型构建，此处不作限定。Optionally, as shown in FIG. 1-f, the above CNN model is sequentially constructed from the input end to the output end by the first convolutional layer, the second pooling layer, the second convolutional layer and the third pooling layer. The above-mentioned input of each pre-processed face image into the trained training model for face recognition includes: outputting the face feature vector obtained after the above-mentioned third pooling layer processing to the above-mentioned LSTM-RNN model. Wherein, the above-mentioned second pooling layer and third pooling layer may be an average pooling layer or a maximum pooling layer or other types of pooling layers, which are not limited herein. Certainly, in other embodiments, the above CNN model may also be constructed with reference to an existing CNN model, which is not limited here.

下面对上述通过标注表情类别的连续帧图像集合对上述训练模型进行训练的过程经说明，具体可如下：1、收集一个或多个连续帧图像集合(上述连续帧图像集合可包含连续的帧图像(例如视频))以及每个连续帧图像集合所属的表情类别(同一连续帧图像集合中的各个图像所属的表情类别相同)，将各个连续帧图像集合所属的表情类别标注为期望通过上述训练模型输出的表情类别。本发明实施例中，可以预先设定多种表情类别(例如生气、平静、困惑、厌恶、快乐、难过、害怕、惊讶、斜眼和尖叫等)，每种表情类别对应一映射值。2、对上述连续帧图像集合中的图像进行预处理(预处理的过程可以参照步骤102中的描述，此处不再赘述)。3、将预处理后的图像输入上述训练模型中，并基于反向传播算法对该训练模型进行训练，以使得输入的图像经上述训练模型处理后输出的值与该图像所属表情类别的映射值的偏差在预设的允许范围内。当然，对训练模型的训练过程也可以参照其它已有的技术方案实现，此处不作限定。The above-mentioned process of training the above-mentioned training model is described below through the continuous frame image collection of the above-mentioned labeling expression category, specifically as follows: 1, collect one or more continuous frame image collections (the above-mentioned continuous frame image collection can include continuous frames image (such as video)) and the expression category to which each continuous frame image set belongs (the expression category to which each image in the same continuous frame image set belongs is the same), and the expression category to which each continuous frame image set belongs is marked as expected to pass the above training The expression category output by the model. In the embodiment of the present invention, a variety of expression categories (such as anger, calm, confusion, disgust, happiness, sadness, fear, surprise, squinting, screaming, etc.) can be preset, and each expression category corresponds to a mapping value. 2. Perform preprocessing on the images in the above-mentioned continuous frame image set (for the preprocessing process, refer to the description in step 102, which will not be repeated here). 3. Input the preprocessed image into the above training model, and train the training model based on the backpropagation algorithm, so that the output value of the input image processed by the above training model and the mapping value of the expression category to which the image belongs The deviation is within the preset allowable range. Of course, the training process of the training model can also be realized by referring to other existing technical solutions, which is not limited here.

需要说明的是，本发明实施例中的人脸表情识别方法可以由人脸表情识别装置执行，上述人脸表情识别装置可以集成在机器人、监控终端或其它终端中，此处不作限定。It should be noted that the facial expression recognition method in the embodiment of the present invention can be executed by a facial expression recognition device, and the facial expression recognition device can be integrated in a robot, a monitoring terminal or other terminals, which is not limited here.

由上可见，本发明实施例中的人脸表情识别方法基于LSTM-RNN模型构建训练模型，并将连续帧图像集合(例如视频)作为该训练模型的训练输入，能够使该训练模型充分利用脸部表情变化的动态信息自动学习识别对象的中性表情以及不同姿态表情特征之间的映射关系，从而提高该训练模型的预测精度和鲁棒性，进而提升人脸表情的识别性能。As can be seen from the above, the facial expression recognition method in the embodiment of the present invention builds a training model based on the LSTM-RNN model, and uses a collection of continuous frame images (such as video) as the training input of the training model, which can make the training model fully utilize the facial expressions. The dynamic information of facial expression changes automatically learns the mapping relationship between the neutral expression of the recognition object and the expression features of different poses, thereby improving the prediction accuracy and robustness of the training model, and then improving the recognition performance of facial expressions.

实施例二Embodiment two

本发明实例提供一种人脸表情识别装置，如图2所示，本发明实施例中的人脸表情识别装置200包括：The example of the present invention provides a kind of human facial expression recognition device, as shown in Figure 2, the human facial expression recognition device 200 in the embodiment of the present invention comprises:

图像获取单元201，用于获取待识别的人脸图像序列，所述人脸图像序列包含单帧或两帧以上人脸图像；An image acquisition unit 201, configured to acquire a sequence of human face images to be recognized, the sequence of human face images comprising a single frame or more than two frames of human face images;

图像预处理单元202，用于分别对所述人脸图像序列中的各帧人脸图像进行预处理；An image preprocessing unit 202, configured to preprocess each frame of human face images in the sequence of human face images;

识别处理单元203，用于将预处理后的各帧人脸图像输入已训练好的训练模型进行表情识别，得到所述人脸图像序列的表情识别结果；Recognition processing unit 203, for inputting each preprocessed human face image into the trained training model for expression recognition, and obtaining the expression recognition result of the human face image sequence;

可选的，识别处理单元203具体用于：当所述人脸图像序列包含两帧以上人脸图像时，通过所述第一池化层对所述长短时记忆循环神经网络模型输入的所述各帧人脸图像的人脸特征向量统一进行降维处理，得到降维处理后的人脸特征向量；向所述逻辑回归模型输出所述降维处理后的人脸特征向量。Optionally, the recognition processing unit 203 is specifically configured to: when the sequence of human face images contains more than two frames of human face images, use the first pooling layer to process the input of the long short-term memory recurrent neural network model. The face feature vectors of each frame of face images are uniformly subjected to dimension reduction processing to obtain face feature vectors after dimension reduction processing; and output the face feature vectors after dimension reduction processing to the logistic regression model.

可选的，所述卷积神经网络模型的输入端到输出端依次由第一卷积层、第二池化层、第二卷积层和第三池化层构建；识别处理单元203具体用于：向所述长短时记忆循环神经网络模型输出经所述第三池化层处理后得到的人脸特征向量。Optionally, the input terminal to the output terminal of the convolutional neural network model is sequentially constructed by the first convolutional layer, the second pooling layer, the second convolutional layer and the third pooling layer; the recognition processing unit 203 specifically uses In: outputting the face feature vector obtained after being processed by the third pooling layer to the long-short-term memory recurrent neural network model.

可选的，图像预处理单元202具体用于：针对所述各帧人脸图像中的每帧人脸图像进行人脸检测，确定人脸区域；检测所述人脸区域中的关键特征点，并基于检测到的关键特征点对相应的人脸图像进行对齐校准。Optionally, the image preprocessing unit 202 is specifically configured to: perform face detection for each frame of face images in the frames of face images, and determine the face area; detect key feature points in the face area, And based on the detected key feature points, the corresponding face images are aligned and calibrated.

可选的，图像预处理单元202具体还用于：将对齐校准后的人脸图像按照预设的模板进行编辑处理，以获得统一大小的人脸图像，其中，所述编辑处理包括如下一种或两种以上：剪切处理、缩放处理。Optionally, the image preprocessing unit 202 is specifically further configured to: edit the aligned and calibrated face image according to a preset template to obtain a face image of a uniform size, wherein the editing process includes the following one Or two or more: clipping processing, scaling processing.

需要说明的是，本发明实施例中的人脸表情识别装置可以集成在机器人、监控终端或其它终端中。该人脸表情识别装置的各个功能模块的功能可以参照上述方法实施例中的描述，其具体实现过程可参照上述方法实施例中的相关描述，此处不再赘述。It should be noted that the facial expression recognition device in the embodiment of the present invention may be integrated into a robot, a monitoring terminal or other terminals. For the functions of each functional module of the facial expression recognition device, reference may be made to the description in the above-mentioned method embodiments, and for the specific implementation process, reference may be made to the relevant descriptions in the above-mentioned method embodiments, which will not be repeated here.

由上可见，本发明实施例中的人脸表情识别装置基于LSTM-RNN模型构建训练模型，并将连续帧图像集合(例如视频)作为该训练模型的训练输入，能够使该训练模型充分利用脸部表情变化的动态信息自动学习识别对象的中性表情以及不同姿态表情特征之间的映射关系，从而提高该训练模型的预测精度和鲁棒性，进而提升人脸表情的识别性能。As can be seen from the above, the facial expression recognition device in the embodiment of the present invention builds a training model based on the LSTM-RNN model, and uses a collection of continuous frame images (such as video) as the training input of the training model, so that the training model can make full use of facial expressions. The dynamic information of facial expression changes automatically learns the mapping relationship between the neutral expression of the recognition object and the expression features of different poses, thereby improving the prediction accuracy and robustness of the training model, and then improving the recognition performance of facial expressions.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways.

需要说明的是，对于前述的各方法实施例，为了简便描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其它顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定都是本发明所必须的。It should be noted that, for the sake of simplicity of description, the aforementioned method embodiments are expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. Because of the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其它实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

以上为对本发明所提供的一种人脸表情识别方法和人脸表情识别装置的描述，对于本领域的一般技术人员，依据本发明实施例的思想，在具体实施方式及应用范围上均会有改变之处，综上，本说明书内容不应理解为对本发明的限制。The above is a description of a facial expression recognition method and a facial expression recognition device provided by the present invention. For those of ordinary skill in the art, according to the ideas of the embodiments of the present invention, there will be specific implementation methods and application ranges. Changes, in summary, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a kind of facial expression recognizing method, it is characterised in that including：

Human face image sequence to be identified is obtained, the human face image sequence includes single frames or two frame above facial images；

Each frame facial image in the human face image sequence is pre-processed respectively；

The training pattern that pretreated each frame facial image input has been trained carries out Expression Recognition, obtains the face figure As the Expression Recognition result of sequence；

Wherein, the input of the training pattern circulates god by convolutional neural networks model, long short-term memory successively to output end Built through network model, the first pond layer and Logic Regression Models, and the training pattern is by marking the continuous of classification of expressing one's feelings Two field picture set training is obtained.

2. facial expression recognizing method according to claim 1, it is characterised in that if the human face image sequence includes two Frame above facial image, then, the training pattern that pretreated each frame facial image input has been trained carry out face Identification, including：

Pass through each frame facial image of first pond layer to the long short-term memory Recognition with Recurrent Neural Network mode input Face feature vector it is unified carry out dimension-reduction treatment, obtain the face feature vector after dimension-reduction treatment；

The face feature vector after the dimension-reduction treatment is exported to the Logic Regression Models.

3. facial expression recognizing method according to claim 1 or 2, it is characterised in that the convolutional neural networks model Input to output end successively by the first convolutional layer, the second pond layer, the second convolutional layer and the 3rd pond layer building；

The training pattern that pretreated each frame facial image input has been trained carries out recognition of face, including：To institute State the face feature vector obtained after long short-term memory Recognition with Recurrent Neural Network model output is handled through the 3rd pond layer.

4. facial expression recognizing method according to claim 1 or 2, it is characterised in that described respectively to the face figure As each frame facial image in sequence is pre-processed, including：

Face datection is carried out for every frame facial image in each frame facial image, human face region is determined；

The key feature points in the human face region are detected, and corresponding facial image is entered based on the key feature points detected Row alignment.

5. facial expression recognizing method according to claim 4, it is characterised in that in the detection human face region Key feature points, and alignment is carried out to corresponding facial image based on the key feature points detected, also include afterwards：

Facial image after alignment is subjected to editing and processing according to default template, to obtain the face figure of unified size Picture, wherein, the editing and processing includes following one or more kinds of：Shear treatment, scaling processing.

6. a kind of expression recognition device, it is characterised in that including：

Image acquisition unit, the human face image sequence to be identified for obtaining, the human face image sequence includes single frames or two frames Above facial image；

Image pre-processing unit, for being pre-processed respectively to each frame facial image in the human face image sequence；

Identifying processing unit, the training pattern for pretreated each frame facial image input have been trained carries out expression knowledge Not, the Expression Recognition result of the human face image sequence is obtained；

7. expression recognition device according to claim 6, it is characterised in that identifying processing unit specifically for：When When the human face image sequence includes two frame above facial images, the long short-term memory is circulated by first pond layer The face feature vector of each frame facial image of neural network model input is unified to carry out dimension-reduction treatment, obtains dimension-reduction treatment Face feature vector afterwards；

8. the expression recognition device according to claim 6 or 7, it is characterised in that the convolutional neural networks model Input to output end successively by the first convolutional layer, the second pond layer, the second convolutional layer and the 3rd pond layer building；

The identifying processing unit specifically for：To the long short-term memory Recognition with Recurrent Neural Network model output through the 3rd pond Change the face feature vector obtained after layer processing.

9. the expression recognition device according to claim 6 or 7, it is characterised in that described image pretreatment unit has Body is used for：Face datection is carried out for every frame facial image in each frame facial image, human face region is determined；Detection is described Key feature points in human face region, and alignment is carried out to corresponding facial image based on the key feature points detected.

10. expression recognition device according to claim 9, it is characterised in that described image pretreatment unit is specific It is additionally operable to：Facial image after alignment is subjected to editing and processing according to default template, to obtain the face of unified size Image, wherein, the editing and processing includes following one or more kinds of：Shear treatment, scaling processing.