CN114529969A

CN114529969A - Expression recognition method and system

Info

Publication number: CN114529969A
Application number: CN202210061916.2A
Authority: CN
Inventors: 米建勋; 张美欣
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-05-24

Abstract

The present invention claims to protect an expression recognition method and system, belonging to the technical field of biological feature recognition. The method includes the steps of: obtaining a face expression picture sample and a corresponding category label; performing face detection and face alignment according to the face expression sample; establishing a deep neural network model, and sending the face expression picture into the depth Extract features from the neural network model to obtain expression features; select multiple triples according to the expression category labels, and each triplet includes three samples from different categories; A loss value; calculate the cross entropy loss as the second loss value according to the real category label and the obtained expression feature; send the expression feature to the classifier for classification, and output the classification result. The main purpose of the present invention is to readjust the inter-class distance between negative expressions and improve the problem of poor recognition effect of negative expressions.

Description

A method and system for facial expression recognition

技术领域technical field

本发明属于生物特征识别技术领域，具体涉及一种表情识别方法及系统。The invention belongs to the technical field of biological feature recognition, and in particular relates to an expression recognition method and system.

背景技术Background technique

人脸表情识别是通过对输入的人脸图片进行分析，判断当前输入的人脸图片属于哪一类表情。常见的表情识别方法通常是对七类基本表情进行分类，包括：开心、惊讶、中性、恐惧、厌恶、愤怒和伤心。Facial expression recognition is to analyze the input face picture to determine which type of expression the current input face picture belongs to. Common expression recognition methods usually classify seven basic expressions, including: happy, surprised, neutral, fearful, disgusted, angry, and sad.

现有的表情识别方法大多数使用深度学习的卷积神经网络来提取人脸表情特征并且进行分类。目前方法共同存在的明显缺陷是负面表情的准确率明显低于正面表情的准确率。所谓正面表情即带有正面情绪的表情，包含开心和惊讶，负面表情即带有负面情绪的表情，包含恐惧、厌恶、愤怒、伤心。Most of the existing expression recognition methods use deep learning convolutional neural networks to extract facial expression features and classify them. The obvious flaw shared by current methods is that the accuracy of negative expressions is significantly lower than that of positive expressions. The so-called positive expressions are expressions with positive emotions, including happiness and surprise, and negative expressions are expressions with negative emotions, including fear, disgust, anger, and sadness.

造成负面表情效果差的原因有如下两点：There are two reasons for the poor effect of negative expressions:

(1)负面表情样本相对较少。几乎所有表情数据集都存在该问题，负面表情的样本量小于正面表情样本量，尤其对于大型数据集，正负面表情样本量相差非常悬殊。该原因导致神经网络在有限的负面表情上，无法学习具有强鲁棒性的特征，最终导致测试过程中神经网络对负面表情不能进行准确判断。(1) There are relatively few negative expression samples. This problem exists in almost all expression datasets. The sample size of negative expressions is smaller than that of positive expressions. Especially for large datasets, the sample sizes of positive and negative expressions are very different. This reason causes the neural network to fail to learn strong robust features on limited negative expressions, which eventually leads to the inability of the neural network to accurately judge negative expressions during the testing process.

(2)负面表情之间的脸部特征相似。表情的变化是由面部多个局部位置的运动组成，而负面表情对应的局部运动存在重叠，例如愤怒和厌恶都包含皱眉的动作。相似的脸部运动即相似的特征导致神经网络无法准确判断当前特征属于哪一类表情。(2) Facial features are similar between negative expressions. The change of expression is composed of movements of multiple local positions on the face, and the local movements corresponding to negative expressions overlap, such as frowning in anger and disgust. Similar facial movements are similar features, so that the neural network cannot accurately determine which type of expression the current feature belongs to.

在现实中，人在带有负面情绪时，更容易做出过激和冲动的行为。在现实应用中，假设驾驶员的情绪监测或者病患状态的监测，在这些应用中均是在被观测者出现负面表情时需要他人进行干预，故准确识别负面表情尤为重要。In reality, people are more likely to act aggressively and impulsive when they have negative emotions. In practical applications, it is assumed that the driver's emotion monitoring or the monitoring of the patient's state requires others to intervene when the observed person has a negative expression, so it is particularly important to accurately identify the negative expression.

经过检索，申请公开号CN107358169A，一种人脸表情识别方法及人脸表情识别装置，包括：构建并训练基于卷积神经网络的情绪识别模型；将待识别的人脸图像输入所述情绪识别模型，以输出所述人脸图像的情绪类别，所述情绪类别包括正面情绪、负面情绪以及中立情绪中的一种；获取与所情绪类别对应的表情识别模型；将人脸图像输入所述表情识别模型，以输出所述人脸图像的表情类别。本发明通过分层次的方式来对人脸的表情进行识别，根据不同的情绪来选择不同的表情识别模型，减少了每个识别模型所需要记忆的内容，降低了整个表情识别过程的运算复杂度，提高了运算效率。After retrieval, application publication number CN107358169A, a facial expression recognition method and facial expression recognition device, comprising: constructing and training an emotion recognition model based on convolutional neural networks; inputting a face image to be recognized into the emotion recognition model , to output the emotional category of the face image, and the emotional category includes one of positive emotions, negative emotions and neutral emotions; obtain an expression recognition model corresponding to the emotional category; input the facial image into the facial expression recognition model model to output the expression category of the face image. The invention recognizes the facial expressions in a hierarchical manner, selects different expression recognition models according to different emotions, reduces the content that each recognition model needs to memorize, and reduces the computational complexity of the entire expression recognition process , which improves the operation efficiency.

专利CN107358169A公开的方法是首先进行正面、负面、中性情绪的大类别判别，再根据不同大类别匹配不同的识别模型。由于负面表情识别效果差的原因是表情间特征相似，而该专利并没有对负面表情进行有针对性的操作，只是分割出一个独立的识别模型，但是在该独立的识别模型中，负面表情的特征依然是相似的，对负面表情的识别效果并没有改善。而本发明设计了一种新的三元组损失函数，并且可以通过控制三元组的类别来源实现重点对负面表情的类间距离进行调整，这种损失函数能够使负面表情类间距离增大，从而增强负面表情特征鉴别性。The method disclosed in the patent CN107358169A is to firstly discriminate the large categories of positive, negative and neutral emotions, and then match different recognition models according to different categories. The reason for the poor recognition effect of negative expressions is that the features between expressions are similar, and the patent does not carry out targeted operations on negative expressions, but only separates an independent recognition model, but in this independent recognition model, the negative expressions The features are still similar, and the recognition of negative expressions does not improve. However, the present invention designs a new triple loss function, and can focus on adjusting the distance between classes of negative expressions by controlling the source of categories of triples. This loss function can increase the distance between classes of negative expressions. , thereby enhancing the discrimination of negative expression features.

申请公开号CN111353390A，一种基于深度学习的微表情识别方法，其包括如下步骤：1：对含有表情动作的视频数据裁剪、视频分帧，提取；2：对提取的表情序列进行人脸对齐，人脸裁剪，进行归一化等预处理；3：将得到的数据集进行数据增强操作；4：搭建神经网络模型：5：将所有人脸表情数据按比例分成训练集和测试集；6：使用测试集对模型进行测试，输出识别准确率，识别时间，误差等信息，当识别率达到要求，选择当前模型。Application Publication No. CN111353390A, a deep learning-based micro-expression recognition method, comprising the following steps: 1: crop video data containing facial expressions, divide the video into frames, and extract; 2: perform face alignment on the extracted facial expression sequences, Face cropping, normalization and other preprocessing; 3: Perform data enhancement operations on the obtained data set; 4: Build a neural network model: 5: Divide all facial expression data into training set and test set in proportion; 6: Use the test set to test the model, and output information such as recognition accuracy, recognition time, and error. When the recognition rate meets the requirements, select the current model.

专利CN111353390A的方法是改进神经网络模型结构，但关注点在整体表情识别效果，并没有重点关注目前存在的痛点问题。而本发明重点针对负面表情效果差的问题提出了一种专门适用于区分相似类的损失函数。并且本发明不需要改变神经网络模型结构，可以直接使用经典的网络结构，只需要将全连接层前的特征值进行记录即可，方法简单易操作。The method of the patent CN111353390A is to improve the neural network model structure, but the focus is on the overall expression recognition effect, and does not focus on the existing pain points. However, the present invention proposes a loss function specially suitable for distinguishing similar classes in view of the problem of poor negative expression effect. In addition, the present invention does not need to change the neural network model structure, can directly use the classical network structure, only needs to record the eigenvalues before the fully connected layer, and the method is simple and easy to operate.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决以上现有技术的问题。提出了一种表情识别方法及系统。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. An expression recognition method and system are proposed. The technical scheme of the present invention is as follows:

一种表情识别方法，其包括以下步骤：An expression recognition method, comprising the following steps:

获取人脸表情图片样本和对应的类别标签；根据所述人脸表情样本进行人脸检测和人脸对齐；Obtaining face expression picture samples and corresponding category labels; performing face detection and face alignment according to the face expression samples;

建立深度神经网络模型，将所述人脸表情图片送入深度神经网络模型中提取特征，得到表情特征；A deep neural network model is established, and the facial expression picture is sent into the deep neural network model to extract features to obtain expression features;

根据表情类别标签选择多个三元组，每个三元组包括的三个样本来自不同的类别；Select multiple triples according to the expression category label, and each triple includes three samples from different categories;

根据所述表情特征和所述三元组计算第一个损失值；Calculate the first loss value according to the expression feature and the triplet;

根据真实类别标签与所得表情特征计算交叉熵损失作为第二个损失值；Calculate the cross-entropy loss as the second loss value according to the real class label and the obtained expression features;

将所述表情特征送入分类器进行分类，输出分类结果。The facial expression features are sent to the classifier for classification, and the classification result is output.

进一步的，所述脸表情图片样本和对应的类别标签，具体包括：Further, the facial expression picture samples and the corresponding category labels specifically include:

使用摄像机录制拍摄人脸发生表情的视频；Use a camera to record videos of facial expressions;

观察所述视频中人脸出现表情的时间，抽取该时间对应的帧保存为图像，得到人脸表情图片样本，并且标注表情的类别。Observing the time when the facial expression appears in the video, extracting the frame corresponding to the time and saving it as an image, obtaining a sample of the facial expression picture, and labeling the category of the expression.

进一步的，所述建立深度神经网络模型，将所述人脸表情图片送入神经网络模型中提取特征，得到表情特征，具体包括：Further, establishing a deep neural network model, sending the facial expression picture into the neural network model to extract features, and obtaining expression features, specifically including:

创建18层的残差网络模型，并对模型内部包含的参数以及神经网络学习率、优化器等参数进行初始化；Create an 18-layer residual network model, and initialize the parameters included in the model, the neural network learning rate, optimizer and other parameters;

将表情样本输入到深度神经网络中，得到深度神经网络最后一层前的特征值；Input the expression samples into the deep neural network to obtain the feature values before the last layer of the deep neural network;

将最后一层前的特征值输入模型最后一层全连接层，得到最终输出的特征值。Input the eigenvalues before the last layer into the last fully connected layer of the model to get the final output eigenvalues.

进一步的，所述根据表情类别标签选择多个三元组，每个三元组包括的三个样本来自不同的类别，具体包括：Further, selecting a plurality of triples according to the expression category label, and the three samples included in each triple are from different categories, specifically including:

遍历整个标签集合，当遍历的三个样本均属于不同类别时，将当前三个样本所对应的下标存储到集合中，得到一个存储下标的三元组集合；Traverse the entire tag set, when the three traversed samples belong to different categories, store the subscripts corresponding to the current three samples into the set, and obtain a triplet set storing the subscripts;

为保证每个三元组中的三个样本特征之间的距离不均衡，从上一步得到的三元组集合中过滤掉三个样本全属于正面表情或负面表情的组合。负面表情本身特征相似，而衡量特征相似的方法就是特征间的距离，距离越近越相似，即负面表情之间距离会更近。所述保证三元组中三个样本距离不均衡的意义为，假设目前选择的三元组包含两个负面表情和一个正面表情，那么正面表情和负面表情之间的距离会比较远，而两个负面表情的距离比较近，即为所述的距离不均衡。这种情况下，通过后续的损失函数控制可以使得两个负面表情间的距离变远，数值上接近正面表情和负面表情间的距离，达到重新调整负面表情间距离的目的，使得负面表情间距离增大，同时正负表情的类间距离与负负表情的类间距离趋于相等。但是若当前选择的三元组均属于负面表情，那么开始时样本间的距离就没有明显差异，所以很难使负负表情的类间距离大小达到正负表情的类间距离大小，三元组同属于正面表情时同理。这就是不选择同属于负面表情或正面表情的三元组的目的。In order to ensure that the distances between the three sample features in each triplet are not balanced, three samples are filtered from the triplet set obtained in the previous step, which all belong to the combination of positive or negative expressions. The negative expressions themselves have similar characteristics, and the method to measure the similarity of the characteristics is the distance between the features. The closer the distance is, the more similar the negative expressions will be. The meaning of ensuring that the distances between the three samples in the triplet are unbalanced is that, assuming that the currently selected triplet contains two negative expressions and one positive expression, the distance between the positive expression and the negative expression will be relatively far, while the two The distance between the negative expressions is relatively close, that is, the distance is uneven. In this case, through the subsequent loss function control, the distance between the two negative expressions can be made farther, and the value is close to the distance between the positive expression and the negative expression, so as to achieve the purpose of readjusting the distance between the negative expressions, so that the distance between the negative expressions can be adjusted. increases, and the inter-class distance of positive and negative expressions tends to be equal to the inter-class distance of negative and negative expressions. However, if the currently selected triples belong to negative expressions, there is no significant difference in the distance between samples at the beginning, so it is difficult to make the distance between classes of negative and negative expressions reach the size of the distance between classes of positive and negative expressions. The same is true for positive expressions. This is the purpose of not selecting triples that belong to both negative or positive expressions.

进一步的，所述根据表情特征和三元组计算第一个损失值，具体包括：Further, calculating the first loss value according to the expression feature and the triplet specifically includes:

所述三元组集合中的每个三元组计算一个损失值，根据三元组中存储的下标从神经网络最后一层前得到的表情特征中提取对应的三个特征，三个特征能够形成三个欧式距离，每次随机选取其中的两个距离，使得两个欧式距离的比例趋于1，故取两个欧式距离的比例减数值1的平方值作为损失值，最终取多个三元组计算的损失值的平均作为第一个损失的最终值。A loss value is calculated for each triplet in the triplet set, and three corresponding features are extracted from the expression features obtained before the last layer of the neural network according to the subscript stored in the triplet. Three Euclidean distances are formed, and two of them are randomly selected each time, so that the ratio of the two Euclidean distances tends to be 1. Therefore, the square value of the subtraction value 1 of the ratio of the two Euclidean distances is taken as the loss value, and finally a plurality of three distances are taken. The average of the loss values computed by the tuple is taken as the final value of the first loss.

进一步的，根据所得的三元组集合中每个三元组中存储的下标从神经网络最后一层前得到的特征中取出对应的三个512维的特征值x₁、x₂、x₃，通过三个特征能够得到三个欧式距离，每次随机选取其中的两个欧式距离d1，d2，欧式距离计算公式如下：Further, according to the subscript stored in each triple in the obtained triple set, the corresponding three 512-dimensional eigenvalues x ₁ , x ₂ , x ₃ are extracted from the features obtained before the last layer of the neural network. , three Euclidean distances can be obtained through three features, and two Euclidean distances d1 and d2 are randomly selected each time. The Euclidean distance calculation formula is as follows:

用d1和d2的比例趋于1作为损失函数，目的是使得两个类间距离d1、d2趋于相等，每个三元组计算一个损失值，最后取平均值，损失函数公式如下，其中T为三元组数量：The ratio of d1 and d2 tends to 1 as the loss function, the purpose is to make the distances d1 and d2 between the two classes tend to be equal, each triplet calculates a loss value, and finally takes the average value. The loss function formula is as follows, where T is the number of triples:

该损失函数对于每一个三元组内部只能够使三个样本间的两个距离趋于相等，但是在不同的三元组中随机选取不同的两个距离，所以在整体上会使三个距离均趋于相等，结果会使得所有类间距离趋于相等。The loss function can only make the two distances between the three samples tend to be equal for each triple, but two different distances are randomly selected in different triples, so on the whole, the three distances will be equal. All tend to be equal, and the result will make all inter-class distances tend to be equal.

进一步的，所述根据真实类别标签与神经网络最后一层输出所得的表情特征计算交叉熵损失作为第二个损失值，具体包括：Further, calculating the cross entropy loss as the second loss value according to the real category label and the expression feature output by the last layer of the neural network specifically includes:

假设第i个表情样本经过神经网络最后一层的输出值是P_i＝(p_i1,p_i2,…,p_iM)，与其真实标签Y_i＝(y_i1,y_i2,…,y_iM)共同计算交叉熵损失，其中M代表表情类别数量，真实标签的表示中，假设该样本属于第j类，那么Y_i中y_ij的值为1，其余值为0，计算公式如下，其中N为样本数量：Assume that the output value of the i-th expression sample through the last layer of the neural network is P _i = (p _i1 ,p _i2 ,...,p _iM ), and its real label Y _i =(y _i1 ,y _i2 ,...,y _iM ) Calculate the cross entropy loss together, where M represents the number of expression categories. In the representation of the real label, assuming that the sample belongs to the jth category, the value of y _ij in Y _i is 1, and the rest are 0. The calculation formula is as follows, where N is Number of samples:

根据两个损失值的数量级，通过系数λ将两个损失值的数量级调节到一致进行相加，计算公式如下：According to the order of magnitude of the two loss values, the magnitude of the two loss values is adjusted to be consistent by the coefficient λ and added. The calculation formula is as follows:

Loss＝λ*Loss1+Loss2Loss=λ*Loss1+Loss2

根据网络模型的输出P_i计算预测类别，其中最大值对应的下标即为预测类别，最后将预测类别进行输出。。The predicted category is calculated according to the output P _i of the network model, and the subscript corresponding to the maximum value is the predicted category, and finally the predicted category is output. .

一种采用任一项所述方法的表情识别系统，其包括：A facial expression recognition system using any one of the methods, comprising:

输入模块，用于将人脸表情图像输入到表情识别系统中；The input module is used to input the facial expression image into the expression recognition system;

表情获取模块，用于对输入的多个人脸表情图像进行处理，对输入图片样本进行人脸检测和人脸对齐，得到人脸表情图片样本；The expression acquisition module is used to process multiple input facial expression images, perform face detection and face alignment on the input image samples, and obtain the facial expression image samples;

标准化处理模块，用于对人脸表情图片样本进行标准化处理，使得所述人脸表情图片的尺寸相同，得到标准化的表情图片样本；a standardization processing module, which is used to standardize the facial expression picture samples, so that the sizes of the facial expression pictures are the same, and the standardized facial expression picture samples are obtained;

格式转化模块，用于将得到的标准化人脸表情图片转化为神经网络模型需要的张量格式；The format conversion module is used to convert the obtained standardized facial expression pictures into the tensor format required by the neural network model;

模型管理模块，用于创建、管理和保存神经网络模型；Model management module for creating, managing and saving neural network models;

特征提取模块，用于通过神经网络模型提取表情样本对应的特征；The feature extraction module is used to extract the features corresponding to the expression samples through the neural network model;

组合生成模块，用于生成符合要求的表情三元组；Combination generation module to generate expression triples that meet the requirements;

损失计算模块，用于计算两个损失值，第一个损失值根据所述的三元组计算，第二个损失值是根据真实标签计算交叉熵损失值，最后将两个损失进行结合，并且根据损失进行梯度回传更新网络参数；The loss calculation module is used to calculate two loss values, the first loss value is calculated according to the triplet, the second loss value is calculated according to the true label cross entropy loss value, and finally the two losses are combined, and According to the loss, the gradient backhaul is performed to update the network parameters;

预测类别模块，用于根据特征提取模块提取到的特征预测分类结果，并将预测结果输出。The prediction category module is used to predict the classification result according to the features extracted by the feature extraction module, and output the prediction result.

进一步的，所述组合生成模块，用于根据样本标签选取符合要求的三元组，一批样本中每个样本对应唯一一个下标，三元组的选择结果即为三个来自不同类别样本的下标，并且三个样本不同时属于负面表情或同时属于正面表情，存储下标目的是方便后续计算损失根据下标提取对应的特征；Further, the combination generation module is used to select a triplet that meets the requirements according to the sample label, each sample in a batch of samples corresponds to a unique subscript, and the selection result of the triplet is three samples from different categories. subscripts, and the three samples are not both negative expressions or positive expressions at the same time, the purpose of storing the subscripts is to facilitate the subsequent calculation of the loss to extract the corresponding features according to the subscripts;

所述损失计算模块，用于计算两种损失函数，并且将两个值进行一定调整后相加，调整主要是使得两个损失值的数量级一致，保证两者权重大约相等，即保证两者均发挥作用。The loss calculation module is used to calculate two loss functions, and add the two values after a certain adjustment. The adjustment is mainly to make the order of magnitude of the two loss values consistent, and to ensure that the weights of the two are approximately equal, that is, to ensure that both are equal. Play a role.

本发明的优点及有益效果如下：The advantages and beneficial effects of the present invention are as follows:

本发明旨在改善现有的人脸表情识别方法负面表情识别效果差的问题，提供一种表情识别方法及系统，通过控制负面表情在特征空间中的距离增强特征鉴别性，缩小负面表情与正面表情准确率上的差异。The invention aims to improve the problem that the negative expression recognition effect of the existing facial expression recognition method is poor, and provides an expression recognition method and system, which can enhance the feature discrimination by controlling the distance of the negative expression in the feature space, and reduce the difference between the negative expression and the positive expression. Differences in expression accuracy.

在传统的基于神经网络的表情识别方法中，正面表情之间以及正负表情之间的距离都是相对较远的，而负面表情之间的距离是相对较小的，即存在不同类别间距离不均衡的问题，导致负面表情的准确率相对较低，本发明旨在改善该问题。In the traditional neural network-based expression recognition method, the distances between positive expressions and positive and negative expressions are relatively far, while the distance between negative expressions is relatively small, that is, there is a distance between different categories The problem of imbalance leads to a relatively low accuracy rate of negative expressions, and the present invention aims to improve this problem.

论文Learning Informative and Discriminative Features for FacialExpression Recognition in the Wild的主要创新在于改进中心损失函数，其设计的损失函数为最小化每个样本与其所属类中心的距离、最大化与其他类中心距离。专利CN113887325A设计的损失函数为每个样本与其所属类中心的距离。本发明的与以上两种方法的区别在于，以上两种方法的损失函数对所有类别的限制是相同的，经过这种损失函数调整的结果是类间距离依然是不均衡的。本发明的主要优点在于每次调整一个三元组中三个样本间的距离，并且控制了所选取的三个样本的类别来源，本发明的特点在于这种方式不是对所有类别进行同样的操作，而是侧重于原本距离相近的类别，即负负面表情之间，所以主要能够有针对性的改善负面表情识别效果差的问题。The main innovation of the paper Learning Informative and Discriminative Features for FacialExpression Recognition in the Wild is to improve the center loss function, which is designed to minimize the distance between each sample and the center of the class to which it belongs, and maximize the distance to other class centers. The loss function designed by the patent CN113887325A is the distance between each sample and the class center to which it belongs. The difference between the present invention and the above two methods is that the loss functions of the above two methods have the same restriction on all categories, and the result of this loss function adjustment is that the distance between classes is still unbalanced. The main advantage of the present invention is that the distance between three samples in a triplet is adjusted each time, and the source of the categories of the three selected samples is controlled. The feature of the present invention is that this method does not perform the same operation on all categories , but focuses on the categories with similar distances, that is, between negative and negative expressions, so it can mainly improve the problem of poor recognition of negative expressions in a targeted manner.

附图说明Description of drawings

图1是本发明提供优选实施例表情识别方法流程图；Fig. 1 is the flow chart of facial expression recognition method that the present invention provides preferred embodiment;

图2为表情识别系统结构图。Figure 2 is a structural diagram of the facial expression recognition system.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the above-mentioned technical problems is:

实施例1：Example 1:

如图1所示，本发明表情识别方法，包括如下步骤：As shown in Figure 1, the expression recognition method of the present invention comprises the following steps:

步骤1：利用摄像机采集多人清晰的人脸表情视频，每个人包含七种基本表情。对采集到的视频进行抽帧，提取视频中带有表情的帧，得到有效的人脸表情图片样本；Step 1: Use the camera to collect clear facial expression videos of many people, each person contains seven basic expressions. Extract frames from the collected video, extract frames with expressions in the video, and obtain valid face expression picture samples;

步骤2：对步骤1得到的人脸表情图片样本进行人脸检测和人脸关键点检测，具体的方法为使用Dlib库中的算法，能够初步得到人脸区域以及人脸关键点，再根据相似变换矩阵方式进行人脸对齐，使得所有人脸图像鼻子处于图片中间位置；Step 2: Perform face detection and face key point detection on the face expression picture sample obtained in step 1. The specific method is to use the algorithm in the Dlib library, which can preliminarily obtain the face area and face key points. The transformation matrix method is used to align the face, so that the nose of all face images is in the middle of the picture;

步骤3：对步骤2所得的人脸表情图片样本所属类别进行标注，用0-6分别代表七种表情，得到完整的表情数据集，供后续训练网络模型使用；Step 3: Label the categories to which the facial expression picture samples obtained in step 2 belong, and use 0-6 to represent seven expressions, respectively, to obtain a complete expression data set for subsequent training of the network model;

步骤4：创建深度神经网络模型，并且将所有表情图片样本的输入模型中。具体步骤如下：Step 4: Create a deep neural network model and input all expression picture samples into the model. Specific steps are as follows:

(1)对步骤2得到的人脸表情图片样本进行标准化处理，将所有样本的尺寸设置为224×224，得到标准化表情图片样本；(1) Standardize the facial expression picture samples obtained in step 2, and set the size of all samples to 224×224 to obtain standardized expression picture samples;

(2)将标准化的表情图片样本分为多个集合，每个集合中包含不同种类的表情，将样本转化为神经网络所需的张量格式，得到多个张量集；(2) Divide the standardized expression picture samples into multiple sets, each set contains different kinds of expressions, convert the samples into the tensor format required by the neural network, and obtain multiple tensor sets;

(3)创建18层的残差网络模型，并且将在大型人脸数据集上训练得到的网络模型参数加载到当前模型中，得到已经初始化的神经网络模型；(3) Create an 18-layer residual network model, and load the network model parameters trained on a large face dataset into the current model to obtain an initialized neural network model;

(4)将多个张量集依次输入神经网络模型中进行特征提取，得到相应的表情特征，第i个表情样本在网络最后一层前面得到特征值x_i＝(x_i1,x_i2,…,x_in),其中n＝512，每个样本得到一个512位特征；(4) Input multiple tensor sets into the neural network model for feature extraction in turn to obtain the corresponding expression features. The i-th expression sample obtains the feature value x _i =(x _i1 ,x _i2 ,… , x _in ), where n=512, each sample gets a 512-bit feature;

(5)将上一步得到的表情特征输入神经网络的最后一层全连接层得到分类信息P_i＝(p_i1,p_i2,…,p_iM)，其中M为类别数量，每个样本对应七个数值，分别代表该样本属于七个类别的可能性；(5) Input the expression features obtained in the previous step into the last fully connected layer of the neural network to obtain the classification information P _i = (p _i1 ,p _i2 ,...,p _iM ), where M is the number of categories, and each sample corresponds to seven values, respectively representing the possibility that the sample belongs to seven categories;

步骤5：根据当前张量集对应的标签选择三元组，具体操作如下：Step 5: Select triples according to the labels corresponding to the current tensor set. The specific operations are as follows:

(1)遍历整个标签集合，当遍历的三个样本均属于不同类别时，将当前三个样本所对应的下标存储到集合中，得到一个存储下标的三元组集合；(1) Traverse the entire label set, when the three traversed samples belong to different categories, store the subscripts corresponding to the current three samples in the set, and obtain a triplet set storing the subscripts;

(2)为保证每个三元组中的三个样本特征之间的距离不均衡，从上一步得到的三元组集合中过滤掉三个样本全属于正面表情或负面表情的组合；(2) In order to ensure that the distances between the three sample features in each triplet are unbalanced, three samples are filtered out from the triplet set obtained in the previous step, which all belong to the combination of positive or negative expressions;

步骤6：根据步骤5所得的三元组集合中每个三元组中存储的下标从步骤4的第三步得到的特征中取出对应的三个512维的特征值，通过三个特征能够三个欧式距离，每次随机选取其中的两个欧式距离d1，d2，欧式距离计算公式如下：Step 6: According to the subscript stored in each triplet in the triplet set obtained in step 5, the corresponding three 512-dimensional eigenvalues are extracted from the features obtained in the third step of step 4. Three Euclidean distances, two of which are randomly selected each time, d1, d2. The Euclidean distance calculation formula is as follows:

用d1和d2的比例趋于1作为损失函数，每个三元组计算一个损失值，最后取平均值。损失函数公式如下，其中T为三元组数量：Using the ratio of d1 and d2 tending to 1 as the loss function, each triplet calculates a loss value, and finally takes the average. The loss function formula is as follows, where T is the number of triples:

该损失函数对于每一个三元组内部只控制了三个样本间的两个距离，但是在不同的三元组中随机选取不同的两个距离，故在整体上是对三个距离的控制，即对所有类别间距离的控制；The loss function only controls two distances between the three samples for each triple, but randomly selects two different distances in different triples, so it controls the three distances as a whole. That is, the control of the distance between all categories;

步骤7：根据步骤4的第四步得到的最终的分类信息P_i，与表情样本真实标签Y_i＝(y_i1,y_i2,…,y_iM)共同计算交叉熵损失，真实标签的表示中，若该样本属于第j类则Y_i中y_ij的值为1，其余为0，计算公式如下，其中N为样本数量：Step 7: According to the final classification information P _i obtained in the fourth step of Step 4, calculate the cross entropy loss together with the real label of the expression sample Y _i =(y _i1 ,y _i2 ,...,y _iM ), in the representation of the real label , if the sample belongs to the jth class, the value of y _ij in _{Yi i} is 1, and the rest are 0. The calculation formula is as follows, where N is the number of samples:

步骤8：根据两个损失值的数量级，通过系数λ将两个损失值的数量级调节到一致进行相加，计算公式如下：Step 8: According to the magnitude of the two loss values, adjust the magnitude of the two loss values to be consistent with the coefficient λ and add them. The calculation formula is as follows:

Loss＝λ*Loss1+Loss2Loss=λ*Loss1+Loss2

步骤9：根据网络模型的输出P_i计算预测类别，其中最大值对应的下标即为预测类别，最后将预测类别进行输出。Step 9: Calculate the predicted category according to the output P _i of the network model, where the subscript corresponding to the maximum value is the predicted category, and finally the predicted category is output.

实施例2：Example 2:

如图2所示，一种表情识别系统，包括：As shown in Figure 2, an expression recognition system includes:

输入模块，用于将人脸表情图像和对应的类别标签输入到表情识别系统中；The input module is used to input the facial expression image and the corresponding category label into the expression recognition system;

损失计算模块，用于计算两个损失值，第一个损失值根据所述的三元组计算，第二个损失值根据真实标签计算交叉熵损失值，最后将两个损失进行结合；The loss calculation module is used to calculate two loss values, the first loss value is calculated according to the triplet, the second loss value is calculated according to the real label, and the cross entropy loss value is calculated, and finally the two losses are combined;

所述表情获取模块，用从输入图片中得到人脸表情区域，根据表情图片样本后用Dlib库对图片样本进行人脸检测和人脸对齐，得到规范化的人脸表情样本；所述标准化处理模块，用于对人脸表情样本进行标准化处理，对人脸表情图片样本数据进行归一化，并且将所有样本尺寸设置为224×224；The facial expression acquisition module is used to obtain the facial expression area from the input picture, and the Dlib library is used to perform face detection and face alignment on the picture sample according to the facial expression picture sample, so as to obtain a normalized facial expression sample; the normalization processing module , which is used to standardize the facial expression samples, normalize the sample data of the facial expression pictures, and set the size of all samples to 224×224;

所述模型管理模块，包括模型创建单元、模型参数加载单元和模型存储单元。具体步骤为首先在模型创建单元中创建18层的残差网络结构，第二在模型参数加载单元中，首先通过在大型人脸数据集上训练的模型中得到网络参数，将该预训练参数加载入创建的模型，最后模型存储单元负责将在多次预测和训练中满足条件的模型参数进行保存；The model management module includes a model creation unit, a model parameter loading unit and a model storage unit. The specific steps are to first create an 18-layer residual network structure in the model creation unit, and second, in the model parameter loading unit, first obtain the network parameters from the model trained on the large face dataset, and load the pre-training parameters Enter the created model, and finally the model storage unit is responsible for saving the model parameters that meet the conditions in multiple predictions and training;

所述特征提取模块，用于获取输入表情样本对应的特征，将样本输入神经网络模型，可以得到样本对应的特征，主要包含两个部分，第一为网络最后一层前的输出，用来计算第一个损失值。第二是网络最后一层输出，用来计算交叉熵损失和计算分类结果；The feature extraction module is used to obtain the features corresponding to the input expression samples, and input the samples into the neural network model to obtain the features corresponding to the samples. It mainly includes two parts. The first is the output before the last layer of the network, which is used to calculate The first loss value. The second is the output of the last layer of the network, which is used to calculate the cross entropy loss and calculate the classification result;

所述组合生成模块，用于选取符合要求的三元组。首先根据样本的标签选择不同类别样本能够组成的所有三元组，从中删除三个同属于正面表情或同属于负面表情的组合，得到三元组集合，其中每个三元组存储的是所选取样本对应的下标；The combination generation module is used to select triples that meet the requirements. First, select all triples that can be composed of samples of different categories according to the labels of the samples, delete three combinations that belong to the same positive expression or the same negative expression, and obtain a triplet set, in which each triplet stores the selected The subscript corresponding to the sample;

所述损失计算模块，用于计算两种损失函数，第一种为通过组合生成模块所得到的三元组集合进行计算，对每个三元组计算一个损失，最终取所有三元组的平均值作为第一个损失值。第二个损失为交叉熵损失，通过网络最后一层的输出和真实标签计算得到交叉熵损失。最后将两个值进行一定调整后相加，调整主要是使得两个损失值的数量级一致；The loss calculation module is used to calculate two kinds of loss functions. The first is to calculate the triplet set obtained by combining the generation module, calculate a loss for each triplet, and finally take the average of all triples. value as the first loss value. The second loss is the cross-entropy loss, which is calculated from the output of the last layer of the network and the true label. Finally, the two values are added after a certain adjustment, and the adjustment is mainly to make the order of magnitude of the two loss values consistent;

所述预测类别模块，用于对最终类别的预测，使用特征提取模块所得的第二部分特征进行预测，并将最终预测结果输出。The predicting category module is used for predicting the final category, using the second part of the features obtained by the feature extraction module to predict, and outputting the final prediction result.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer can be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention and not for limiting the protection scope of the present invention. After reading the contents of the description of the present invention, the skilled person can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. a facial expression recognition method, is characterized in that, comprises the following steps:

Obtaining face expression picture samples and corresponding category labels; performing face detection and face alignment according to the face expression samples;

A deep neural network model is established, and the facial expression picture is sent into the deep neural network model to extract features to obtain expression features;

Select multiple triples according to the expression category label, and each triple includes three samples from different categories;

Calculate the first loss value according to the expression feature and the triplet;

Calculate the cross-entropy loss as the second loss value according to the real class label and the obtained expression features;

The facial expression features are sent to the classifier for classification, and the classification result is output.

2. a kind of expression recognition method according to claim 1, is characterized in that, described face expression picture sample and corresponding category label, specifically comprise:

Use a camera to record videos of facial expressions;

Observing the time when the facial expression appears in the video, extracting the frame corresponding to the time and saving it as an image, obtaining a sample of the facial expression picture, and labeling the category of the expression.

3. a kind of facial expression recognition method according to claim 2, is characterized in that, described establishing deep neural network model, the facial expression picture is sent into the neural network model to extract feature, obtains facial expression feature, specifically comprises:

Create an 18-layer residual network model, and initialize the parameters included in the model, the neural network learning rate, optimizer and other parameters;

Input the expression samples into the deep neural network to obtain the feature values before the last layer of the deep neural network;

Input the eigenvalues before the last layer into the last fully connected layer of the model to get the final output eigenvalues.

4. A kind of facial expression recognition method according to claim 3, is characterized in that, described according to facial expression category label to select a plurality of triples, the three samples that each triple includes are from different categories, specifically include:

Traverse the entire tag set, when the three traversed samples belong to different categories, store the subscripts corresponding to the current three samples into the set, and obtain a triplet set storing the subscripts;

In order to ensure that the distances between the three sample features in each triplet are unbalanced, three samples are filtered out from the triplet set obtained in the previous step, all of which belong to the combination of positive or negative expressions; the characteristics of negative expressions themselves are similar , and the method to measure the similarity of features is the distance between features. The closer the distance, the more similar, that is, the distance between negative expressions will be closer. The meaning of ensuring that the distances between the three samples in the triplet are unbalanced is that, assuming that the currently selected triplet contains two negative expressions and one positive expression, the distance between the positive expression and the negative expression will be relatively far, while the two The distance between the negative expressions is relatively close, that is, the distance is uneven. In this case, through the subsequent loss function control, the distance between the two negative expressions can be made farther, and the value is close to the distance between the positive expression and the negative expression, so as to achieve the purpose of readjusting the distance between the negative expressions, so that the distance between the negative expressions can be adjusted. increases, and the inter-class distance of positive and negative expressions tends to be equal to the inter-class distance of negative and negative expressions. However, if the currently selected triples belong to negative expressions, there is no significant difference in the distance between samples at the beginning, so it is difficult to make the distance between classes of negative and negative expressions reach the size of the distance between classes of positive and negative expressions. The same is true for positive expressions. This is the purpose of not selecting triples that belong to both negative or positive expressions.

5. a kind of facial expression recognition method according to claim 4, is characterized in that, described according to facial expression feature and triple to calculate the first loss value, specifically comprises:

A loss value is calculated for each triplet in the triplet set, and three corresponding features are extracted from the expression features obtained before the last layer of the neural network according to the subscript stored in the triplet. Three Euclidean distances are formed, and two of them are randomly selected each time, so that the ratio of the two Euclidean distances tends to be 1. Therefore, the square value of the subtraction value 1 of the ratio of the two Euclidean distances is taken as the loss value, and finally a plurality of three distances are taken. The average of the loss values computed by the tuple is taken as the final value of the first loss.

6. A kind of expression recognition method according to claim 5, is characterized in that, according to the subscript stored in each triple in the obtained triple set, take out the corresponding feature from the feature obtained before the last layer of the neural network. The three 512-dimensional eigenvalues x ₁ , x ₂ , and x ₃ of , three Euclidean distances can be obtained through the three features, and two Euclidean distances d1 and d2 are randomly selected each time. The Euclidean distance calculation formula is as follows:

The ratio of d1 and d2 tends to 1 as the loss function, the purpose is to make the distances d1 and d2 between the two classes tend to be equal, each triplet calculates a loss value, and finally takes the average value. The loss function formula is as follows, where T is the number of triples:

The loss function can only make the two distances between the three samples tend to be equal for each triple, but two different distances are randomly selected in different triples, so on the whole, the three distances will be equal. All tend to be equal, and the result will make all inter-class distances tend to be equal.

7. A kind of facial expression recognition method according to claim 6, is characterized in that, described according to the facial expression characteristic of true category label and neural network last layer output gained, calculates cross entropy loss as the second loss value, specifically comprises:

Assume that the output value of the i-th expression sample through the last layer of the neural network is P _i = (p _i1 ,p _i2 ,...,p _iM ), and its real label Y _i =(y _i1 ,y _i2 ,...,y _iM ) Calculate the cross entropy loss together, where M represents the number of expression categories. In the representation of the real label, assuming that the sample belongs to the jth category, the value of y _ij in Y _i is 1, and the rest are 0. The calculation formula is as follows, where N is Number of samples:

According to the order of magnitude of the two loss values, the magnitude of the two loss values is adjusted to be consistent by the coefficient λ and added. The calculation formula is as follows:

Loss=λ*Loss1+Loss2

The predicted category is calculated according to the output P _i of the network model, and the subscript corresponding to the maximum value is the predicted category, and finally the predicted category is output.

8. A facial expression recognition system adopting the method of any one of claims 1-7, characterized in that, comprising:

The input module is used to input the facial expression image into the expression recognition system;

The expression acquisition module is used to process multiple input facial expression images, perform face detection and face alignment on the input image samples, and obtain the facial expression image samples;

a standardization processing module, which is used to standardize the facial expression picture samples, so that the sizes of the facial expression pictures are the same, and the standardized facial expression picture samples are obtained;

The format conversion module is used to convert the obtained standardized facial expression pictures into the tensor format required by the neural network model;

Model management module for creating, managing and saving neural network models;

The feature extraction module is used to extract the features corresponding to the expression samples through the neural network model;

The combined generation module is used to generate expression triples that meet the requirements;

The loss calculation module is used to calculate two loss values, the first loss value is calculated according to the triplet, the second loss value is calculated according to the true label cross entropy loss value, and finally the two losses are combined, and According to the loss, the gradient backhaul is performed to update the network parameters;

The prediction category module is used to predict the classification result according to the features extracted by the feature extraction module, and output the prediction result.

9. The facial expression recognition system according to claim 8, wherein the combination generation module is used to select a triplet that meets the requirements according to the sample label, and each sample in a batch of samples corresponds to a unique subscript, and the three The selection result of the tuple is the subscripts of three samples from different categories, and the three samples are not both negative expressions or positive expressions at the same time. The purpose of storing the subscripts is to facilitate the subsequent calculation of the loss to extract the corresponding features according to the subscripts;

The loss calculation module is used to calculate two loss functions, and add the two values after a certain adjustment. The adjustment is mainly to make the order of magnitude of the two loss values consistent, and to ensure that the weights of the two are approximately equal, that is, to ensure that both are equal. Play a role.