CN115273143A

CN115273143A - A head pose estimation method and system

Info

Publication number: CN115273143A
Application number: CN202210794371.6A
Authority: CN
Inventors: 朱晓亮; 杨巧来; 杨宗凯; 赵亮; 戴志诚; 荣文婷; 何自力
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-11-01

Abstract

The invention provides a head posture estimation method and a system, comprising the following steps: determining an image containing a human face; inputting the image into a pre-trained hierarchical prediction network, and predicting to obtain a pitch angle, a yaw angle and a roll angle of the face posture orientation; the method comprises the following steps: the system comprises a backbone network, a characteristic pyramid network, a dimension reduction module and a hierarchical prediction module; the system comprises a backbone network, a characteristic pyramid network, a dimension reduction module and a dimension extraction module, wherein the backbone network is used for extracting image space features with different sizes, the characteristic pyramid network is used for fusing the image space features with different sizes to obtain fused features, and the dimension reduction module is used for performing three-dimension reduction on the fused features to obtain three-dimension space features of images; the hierarchical prediction module comprises: three full-link layers; the three full-connection layers respectively predict the spatial features of three dimensions, and each full-connection layer predicts an angle of the face pose orientation, so that the image regions focused by the three angles of the face pose orientation predicted by the hierarchical prediction network are different, and the mutual interference among the three angle predictions is reduced.

Description

A head pose estimation method and system

技术领域technical field

本发明属于头部姿态检测领域，更具体地，涉及一种头部姿态估计方法及系统。The present invention belongs to the field of head posture detection, and more specifically, relates to a head posture estimation method and system.

背景技术Background technique

头部姿态检测技术有着广泛的应用领域，例如：疲劳检测、自动驾驶等。一方面，虽然利用深度图像可以使得头部姿态估计取得非常良好的结果，但对于RGB图像而言，头部姿态的估计方法仍然存在角度预测不连续的问题，这限制了头部姿态估计的应用。Head posture detection technology has a wide range of applications, such as: fatigue detection, automatic driving, etc. On the one hand, although the use of depth images can achieve very good results in head pose estimation, for RGB images, the head pose estimation method still has the problem of discontinuous angle prediction, which limits the application of head pose estimation. .

真实环境中，人物往往存在大范围的面部遮挡、大角度的头部偏转；其所处的背景环境，光线的明暗也大不相同。基于传统的机器学习的方法很难在此种情况下检测出人物头部，而且对于不同身份的人也不具有鲁棒性，无法完成正常的头部姿态估计任务。In a real environment, characters often have large-scale facial occlusion and large-angle head deflection; the background environment they are in, and the brightness and darkness of the light are also very different. The method based on traditional machine learning is difficult to detect the head of the person in this situation, and it is not robust to people with different identities, and cannot complete the normal head pose estimation task.

相对于一般的机器学习方法，深度学习方法在图像领域有着较优的表现，比机器学习方法更适用于真实场景下的头部姿态估计，具体表现在以下三点：(1)对不同身份的人展现出良好的鲁棒性；(2)对于人物的背景变化不敏感；(3)可以从单个图像进行头部姿态估计任务，使得实时检测成为可能。Compared with general machine learning methods, deep learning methods have better performance in the image field, and are more suitable for head pose estimation in real scenes than machine learning methods, specifically in the following three points: (1) Different identities People exhibit good robustness; (2) are not sensitive to background changes of people; (3) head pose estimation tasks can be performed from a single image, making real-time detection possible.

同时，深度学习方法在进行头部姿态估计任务时也存在一定的缺陷：(1)头部姿态估计的现有方法中，三种角度的参数调节互相干扰，模型的预测效果难以平衡；(2)采用交叉熵损失函数和均方差损失函数简单相加，导致角度预测出现不连续性。At the same time, the deep learning method also has certain defects in the task of head pose estimation: (1) In the existing methods of head pose estimation, the parameter adjustments of the three angles interfere with each other, and the prediction effect of the model is difficult to balance; (2) ) using a simple addition of the cross-entropy loss function and the mean square error loss function, resulting in discontinuities in the angle predictions.

发明内容Contents of the invention

针对现有技术的缺陷，本发明的目的在于提供一种头部姿态估计方法及系统，旨在解决现有头部姿态估计的三种角度的参数调节互相干扰，且角度预测不连续的问题。Aiming at the defects of the prior art, the purpose of the present invention is to provide a method and system for estimating head pose, aiming to solve the problem that the parameter adjustments of the three angles of the existing head pose estimation interfere with each other and the angle prediction is discontinuous.

为实现上述目的，第一方面，本发明提供了一种头部姿态估计方法，包括如下步骤：In order to achieve the above object, in a first aspect, the present invention provides a head pose estimation method, comprising the following steps:

确定包含人脸的图像；Identify images that contain human faces;

将所述图像输入到预先训练好的分层预测网络，预测得到人脸姿态朝向的俯仰角、偏航角以及翻滚角，以估计人脸头部姿态；所述分层预测网络包括：骨干网络、特征金字塔网络、降维模块以及分层预测模块；所述骨干网络用于提取不同尺寸的图像空间特征，所述特征金字塔网络用于将不同尺寸图像空间特征融合，得到融合特征，所述降维模块用于对所述融合特征进行三种不同维度的降维，得到图像三种维度的空间特征，不同维度对应不同的图像通道数；所述分层预测模块包括：三个全连接层；所述三个全连接层分别对所述三种维度的空间特征进行预测，每个全连接层预测得到人脸姿态朝向的一个角度，以使所述分层预测网络预测人脸姿态朝向三个角度各自关注的图像区域不同，减少三个角度预测之间的相互干扰；所述尺寸以像素为单位。The image is input to the pre-trained layered prediction network, and the pitch angle, yaw angle and roll angle of the attitude of the face are predicted to estimate the head pose of the face; the layered prediction network includes: a backbone network , a feature pyramid network, a dimensionality reduction module, and a hierarchical prediction module; the backbone network is used to extract image space features of different sizes, and the feature pyramid network is used to fuse image space features of different sizes to obtain fusion features, and the reduction The dimension module is used to perform dimensionality reduction in three different dimensions on the fusion feature to obtain spatial features of the image in three dimensions, and different dimensions correspond to different image channel numbers; the layered prediction module includes: three fully connected layers; The three fully connected layers predict the spatial features of the three dimensions respectively, and each fully connected layer predicts an angle of the orientation of the face pose, so that the layered prediction network predicts the orientation of the face pose toward the three dimensions. Each angle focuses on different image areas, reducing the mutual interference between the three angle predictions; the size is in pixels.

在一个可选的示例中，所述分层预测网络训练过程中的损失函数采用自调节的损失限制系数，以在三个全连接层预测角度的平均绝对误差小于阈值时，校正由预测角度的交叉熵损失项大于预测角度的均方差损失项而带来损失大小扭转问题，并在三个全连接层预测角度的平均绝对误差不小于阈值时，增大由上述两个损失项所带来的误差惩罚，使得训练过程中分层预测网络以更快的速度收敛；设在三个全连接层预测角度的平均绝对误差小于阈值时，分层预测网络的误差惩罚为第一惩罚，设三个全连接层预测角度的平均绝对误差不小于阈值时，分层预测网络的误差惩罚为第二惩罚，校正损失大小扭转问题指控制第一惩罚小于第二惩罚，保证分层预测网络能够正常训练学习。In an optional example, the loss function in the training process of the hierarchical prediction network adopts a self-adjusting loss limit coefficient, so that when the average absolute error of the prediction angle of the three fully connected layers is less than a threshold, the correction is made by the prediction angle The cross-entropy loss item is greater than the mean square error loss item of the prediction angle, which brings about the loss size reversal problem, and when the average absolute error of the prediction angle of the three fully connected layers is not less than the threshold value, the increase caused by the above two loss items The error penalty makes the hierarchical prediction network converge at a faster speed during the training process; when the average absolute error of the prediction angles of the three fully connected layers is less than the threshold, the error penalty of the hierarchical prediction network is the first penalty, and three When the average absolute error of the prediction angle of the fully connected layer is not less than the threshold, the error penalty of the layered prediction network is the second penalty, and the correction loss size reverse problem means that the first penalty is controlled to be less than the second penalty, so that the layered prediction network can be trained and learned normally .

在一个可选的示例中，所述骨干网络包括四个残差块；人脸图像依次经过所述四个残差块处理，依次得到四种尺寸递减的图像空间特征；In an optional example, the backbone network includes four residual blocks; the face image is sequentially processed by the four residual blocks, and four image space features with decreasing sizes are sequentially obtained;

上述四种尺寸递减的图像空间特征被所述特征金字塔网络融合，融合策略为先将第一种尺寸空间特征与第二种尺寸空间特征进行融合、将第一种尺寸空间特征与第三种尺寸空间特征进行融合，分别得到新的第二种尺寸空间特征和新的第三种尺寸空间特征，之后将新的第二种空间特征与新的第三种尺寸空间特征进行融合、将新的第二种尺寸空间特征与第四种尺寸空间特征进行融合，分别得到再次更新的第三种尺寸空间特征和新的第四种尺寸空间特征，最后将再次更新的第三种尺寸空间特征和新的第四种尺寸空间特征进行融合，得到第四种尺寸的融合特征；其中，第一种尺寸到第四种尺寸的尺寸大小逐级递减；The above four size-decreasing image space features are fused by the feature pyramid network. The fusion strategy is to first fuse the first size space feature with the second size space feature, and combine the first size space feature with the third size space feature. The spatial features are fused to obtain the new second-dimensional spatial features and the new third-dimensional spatial features, and then the new second-dimensional spatial features are fused with the new third-dimensional spatial features, and the new third-dimensional spatial features are fused. The second dimension space feature is fused with the fourth dimension space feature to obtain the updated third dimension space feature and the new fourth dimension space feature, and finally the updated third dimension space feature and the new dimension space feature The fourth size space feature is fused to obtain the fusion feature of the fourth size; wherein, the size from the first size to the fourth size decreases step by step;

所述降维模块包括三个卷积核；所述第四种尺寸的融合特征依次经过所述三个卷积核处理，每个卷积核对输入的图像特征进行一次降维，依次得到尺寸不变且通道数逐级递减的三种维度的空间特征。The dimensionality reduction module includes three convolution kernels; the fusion features of the fourth size are sequentially processed by the three convolution kernels, and each convolution kernel performs a dimensionality reduction on the input image features, and successively obtains different sizes. The spatial characteristics of the three dimensions change and the number of channels decreases step by step.

在一个可选的示例中，所述分层预测网络对预测的三个角度的调节公式为：In an optional example, the adjustment formulas of the hierarchical prediction network for the three prediction angles are:

其中，

以及

分别表示俯仰角、偏航角以及翻滚角的预测值；K₁、K₂以及K₃分别为三个卷积核的权重因子；Γ₁、Γ₂以及Γ₃为所述融合特征经过降维模块三个卷积核得到的三种维度的空间特征；in,

as well as

respectively represent the predicted values of pitch angle, yaw angle and roll angle; K ₁ , K ₂ and K ₃ are the weight factors of the three convolution kernels respectively; Γ ₁ , Γ ₂ and Γ ₃ are the dimensionality reduction of the fusion features The three-dimensional spatial features obtained by the three convolution kernels of the module;

Γ₁，Γ₂，Γ₃之间的关系满足如下公式：

The relationship between Γ ₁ , Γ ₂ , and Γ ₃ satisfies the following formula:

其中，W₁是降维模块第一个卷积核向第二个卷积核的反馈参数，W₂是降维模块第二个卷积核向第三个卷积核的反馈参数，b₄是经第一个卷积核向第二个卷积核降维带来的新偏差项，b₅是经第二个卷积核向第三个卷积核降维带来的新偏差项。Among them, W ₁ is the feedback parameter from the first convolution kernel to the second convolution kernel of the dimensionality reduction module, W ₂ is the feedback parameter from the second convolution kernel to the third convolution kernel of the dimensionality reduction module, b ₄ is the new bias term brought by the dimensionality reduction from the first convolution kernel to the second convolution kernel, and b ₅ is the new bias term brought by the dimensionality reduction from the second convolution kernel to the third convolution kernel.

在一个可选的示例中，所述分层预测网络的损失函数

为：In an optional example, the loss function of the hierarchical prediction network

for:

其中，

是分层预测网络对头部姿态预测得到的值，y是图像中人脸头部姿态的真实值，β为损失限制系数，是由均方差损失和交叉熵损失的同大同小关系构建得出，k为角度类别数，σ是sigmod函数，L_mse代表均方差损失，Y_ic代表根据角度类别所形成的one-hot编码，

是预测角度所属的类别。in,

is the value predicted by the hierarchical prediction network for the head pose, y is the real value of the head pose of the face in the image, and β is the loss restriction coefficient, which is constructed by the relationship between the mean square error loss and the cross entropy loss , k is the number of angle categories, σ is the sigmod function, L _mse represents the mean square error loss, Y _ic represents the one-hot code formed according to the angle category,

is the category to which the predicted angle belongs.

第二方面，本发明提供了一种头部姿态估计系统，包括：In a second aspect, the present invention provides a head pose estimation system, comprising:

人脸图像确定单元，用于确定包含人脸的图像；A human face image determining unit, configured to determine an image containing a human face;

头部姿态估计单元，用于将所述图像输入到预先训练好的分层预测网络，预测得到人脸姿态朝向的俯仰角、偏航角以及翻滚角，以估计人脸头部姿态；所述分层预测网络包括：骨干网络、特征金字塔网络、降维模块以及分层预测模块；所述骨干网络用于提取不同尺寸的图像空间特征，所述特征金字塔网络用于将不同尺寸图像空间特征融合，得到融合特征，所述降维模块用于对所述融合特征进行三种不同维度的降维，得到图像三种维度的空间特征，不同维度对应不同的图像通道数；所述分层预测模块包括：三个全连接层；所述三个全连接层分别对所述三种维度的空间特征进行预测，每个全连接层预测得到人脸姿态朝向的一个角度，以使所述分层预测网络预测人脸姿态朝向三个角度各自关注的图像区域不同，减少三个角度预测之间的相互干扰；所述尺寸以像素为单位。The head posture estimation unit is used to input the image into the pre-trained layered prediction network, and predicts the pitch angle, yaw angle and roll angle towards the face posture to estimate the head posture of the human face; The layered prediction network includes: a backbone network, a feature pyramid network, a dimensionality reduction module, and a layered prediction module; the backbone network is used to extract image space features of different sizes, and the feature pyramid network is used to fuse image space features of different sizes , to obtain the fusion feature, the dimensionality reduction module is used to perform dimensionality reduction in three different dimensions on the fusion feature, to obtain the spatial features of the image in three dimensions, and different dimensions correspond to different image channel numbers; the layered prediction module Including: three fully connected layers; the three fully connected layers respectively predict the spatial features of the three dimensions, and each fully connected layer predicts an angle towards the orientation of the face posture, so that the layered prediction The network predicts the face poses of the three angles to focus on different image areas, reducing the mutual interference between the three angle predictions; the size is in pixels.

在一个可选的示例中，所述头部姿态估计单元所用的分层预测网络训练过程中的损失函数采用自调节的损失限制系数，以在三个全连接层预测角度的平均绝对误差小于阈值时，校正由预测角度的交叉熵损失项大于预测角度的均方差损失项而带来损失大小扭转问题，并在三个全连接层预测角度的平均绝对误差不小于阈值时，增大由上述两个损失项所带来的误差惩罚，使得训练过程中分层预测网络以更快的速度收敛；设在三个全连接层预测角度的平均绝对误差小于阈值时，分层预测网络的误差惩罚为第一惩罚，设三个全连接层预测角度的平均绝对误差不小于阈值时，分层预测网络的误差惩罚为第二惩罚，校正损失大小扭转问题指控制第一惩罚小于第二惩罚，保证分层预测网络能够正常训练学习。In an optional example, the loss function used in the layered prediction network training process used by the head pose estimation unit adopts a self-adjusting loss limit coefficient, so that the average absolute error of the three fully connected layer prediction angles is less than a threshold When the cross entropy loss item of the prediction angle is greater than the mean square error loss item of the prediction angle, the loss size reversal problem is corrected, and when the average absolute error of the prediction angle of the three fully connected layers is not less than the threshold value, the above two The error penalty brought by the loss item makes the hierarchical prediction network converge at a faster speed during the training process; when the average absolute error of the three fully connected layer prediction angles is less than the threshold, the error penalty of the hierarchical prediction network is The first penalty, when the average absolute error of the prediction angles of the three fully connected layers is not less than the threshold, the error penalty of the layered prediction network is the second penalty, and the correction loss size reverse problem means that the first penalty is controlled to be less than the second penalty, and the score is guaranteed The layer prediction network can train and learn normally.

在一个可选的示例中，所述头部姿态估计单元所用的骨干网络包括四个残差块；人脸图像依次经过所述四个残差块处理，依次得到四种尺寸递减的图像空间特征；上述四种尺寸递减的图像空间特征被所述特征金字塔网络融合，融合策略为先将第一种尺寸空间特征与第二种尺寸空间特征进行融合、将第一种尺寸空间特征与第三种尺寸空间特征进行融合，分别得到新的第二种尺寸空间特征和新的第三种尺寸空间特征，之后将新的第二种空间特征与新的第三种尺寸空间特征进行融合、将新的第二种尺寸空间特征与第四种尺寸空间特征进行融合，分别得到再次更新的第三种尺寸空间特征和新的第四种尺寸空间特征，最后将再次更新的第三种尺寸空间特征和新的第四种尺寸空间特征进行融合，得到第四种尺寸的融合特征；其中，第一种尺寸到第四种尺寸的尺寸大小逐级递减；所述降维模块包括三个卷积核；所述第四种尺寸的融合特征依次经过所述三个卷积核处理，每个卷积核对输入的图像特征进行一次降维，依次得到尺寸不变且通道数逐级递减的三种维度的空间特征。In an optional example, the backbone network used by the head pose estimation unit includes four residual blocks; the face image is sequentially processed by the four residual blocks, and four kinds of image space features with decreasing sizes are sequentially obtained. ; The image space features of the above four kinds of decreasing sizes are fused by the feature pyramid network, and the fusion strategy is to first fuse the first size space feature with the second size space feature, and combine the first size space feature with the third The dimension space features are fused to obtain the new second dimension space feature and the new third dimension space feature, and then the new second dimension space feature is fused with the new third dimension space feature, and the new The second dimension space feature is fused with the fourth dimension space feature to obtain the updated third dimension space feature and the new fourth dimension space feature, and finally the updated third dimension space feature and the new dimension space feature The fourth size space feature is fused to obtain the fusion feature of the fourth size; wherein, the size of the first size to the fourth size decreases step by step; the dimensionality reduction module includes three convolution kernels; The fusion features of the fourth size are sequentially processed by the three convolution kernels, and each convolution kernel performs a dimensionality reduction on the input image features, and sequentially obtains a three-dimensional space with a constant size and a gradually decreasing number of channels. feature.

在一个可选的示例中，所述头部姿态估计单元所用的分层预测网络对预测的三个角度的调节公式为：

其中，

以及

分别表示俯仰角、偏航角以及翻滚角的预测值；K₁、K₂以及K₃分别为三个卷积核的权重因子；Γ₁、Γ₂以及Γ₃为所述融合特征经过降维模块三个卷积核得到的三种维度的空间特征；Γ₁，Γ₂，Γ₃之间的关系满足如下公式：

其中，W₁是降维模块第一个卷积核向第二个卷积核的反馈参数，W₂是降维模块第二个卷积核向第三个卷积核的反馈参数，b₄是经第一个卷积核向第二个卷积核降维带来的新偏差项，b₅是经第二个卷积核向第三个卷积核降维带来的新偏差项。In an optional example, the adjustment formula of the hierarchical prediction network used by the head pose estimation unit for the three predicted angles is:

in,

as well as

respectively represent the predicted values of pitch angle, yaw angle and roll angle; K ₁ , K ₂ and K ₃ are the weight factors of the three convolution kernels respectively; Γ ₁ , Γ ₂ and Γ ₃ are the dimensionality reduction of the fusion features The three-dimensional spatial features obtained by the three convolution kernels of the module; the relationship between Γ ₁ , Γ ₂ , and Γ ₃ satisfies the following formula:

Among them, W ₁ is the feedback parameter from the first convolution kernel to the second convolution kernel of the dimensionality reduction module, W ₂ is the feedback parameter from the second convolution kernel to the third convolution kernel of the dimensionality reduction module, b ₄ is the new bias term brought by the dimensionality reduction from the first convolution kernel to the second convolution kernel, and b ₅ is the new bias term brought by the dimensionality reduction from the second convolution kernel to the third convolution kernel.

在一个可选的示例中，所述头部姿态估计单元所用的分层预测网络的损失函数

为：In an optional example, the loss function of the hierarchical prediction network used by the head pose estimation unit

for:

其中，

是预测角度所属的类别。in,

is the category to which the predicted angle belongs.

其中，k为角度类别数，具体地，本发明实施例中将-99°到99°按照3°为一个区间，划分为66个角度区间，对应的角度类别数为66个。此外，本领域技术人员可根据实际需要对角度进行不同类别数的划分，本发明不对此做进一步限定说明。Wherein, k is the number of angle categories. Specifically, in the embodiment of the present invention, -99° to 99° is divided into 66 angle intervals according to 3°, and the corresponding number of angle categories is 66. In addition, those skilled in the art may divide the angles into different categories according to actual needs, which is not further limited in the present invention.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，具有以下有益效果：Generally speaking, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

本发明提供一种头部姿态估计方法及系统，将与头部姿态估计视为同一任务的三个分支，融入特征金字塔和多任务卷积思想，相较于传统的头部姿态估计方法，降低了三种角度调节间的相互干扰，使得头部姿态估计的结果拥有更小的偏差。传统方法采用交叉熵与均方差损失简单相加的方式进行头部姿态估计的训练，而本发明在分析传统损失函数弊端的基础上，进行了损失优化，解决了因损失函数自身的不连续性而带来的角度估计损失断续问题，使得头部姿态估计的结果进一步提升。该方法与最新的基于旋转矩阵的头部姿态估计方法具有相容性，这为日后进一步提升头部姿态估计精确度提供了可能。The present invention provides a method and system for estimating head pose, which regards head pose estimation as three branches of the same task, integrates feature pyramid and multi-task convolution ideas, and reduces The mutual interference between the three angle adjustments is eliminated, so that the results of head pose estimation have smaller deviations. The traditional method adopts the simple addition of cross entropy and mean square error loss to carry out the training of head pose estimation, but the present invention optimizes the loss on the basis of analyzing the disadvantages of the traditional loss function, and solves the problem caused by the discontinuity of the loss function itself. The resulting intermittent problem of angle estimation loss further improves the results of head pose estimation. This method is compatible with the latest head pose estimation method based on rotation matrix, which provides the possibility to further improve the accuracy of head pose estimation in the future.

附图说明Description of drawings

图1为本发明实施例提供的头部姿态估计方法流程图；FIG. 1 is a flowchart of a head pose estimation method provided by an embodiment of the present invention;

图2为本发明实施例提供的头部姿态估计方法实施框图；FIG. 2 is a block diagram of the implementation of the head pose estimation method provided by the embodiment of the present invention;

图3为本发明实施例提出的头部姿态估计模型框架图；3 is a frame diagram of a head pose estimation model proposed by an embodiment of the present invention;

图4为本发明实施例提供的角度预测不连续问题说明图；Fig. 4 is an explanatory diagram of the angle prediction discontinuity problem provided by the embodiment of the present invention;

图5为本发明实施例提出的不同角度关注差异热图；Fig. 5 is a heat map of attention differences from different angles proposed by the embodiment of the present invention;

图6为本发明实施例提供的额外注意力原理图；FIG. 6 is a schematic diagram of the additional attention provided by the embodiment of the present invention;

图7为本发明实施例提供的头部姿态估计效果图；FIG. 7 is an effect diagram of head pose estimation provided by an embodiment of the present invention;

图8为本发明实施例提供的头部姿态估计系统架构图。FIG. 8 is an architecture diagram of a head pose estimation system provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

图1为本发明实施例提供的头部姿态估计方法流程图，如图1所示，包括如下步骤：Fig. 1 is a flow chart of the head pose estimation method provided by the embodiment of the present invention, as shown in Fig. 1, comprising the following steps:

S101，确定包含人脸的图像；S101, determine an image containing a human face;

S102，将所述图像输入到预先训练好的分层预测网络，预测得到人脸姿态朝向的俯仰角、偏航角以及翻滚角，以估计人脸头部姿态；所述分层预测网络包括：骨干网络、特征金字塔网络、降维模块以及分层预测模块；所述骨干网络用于提取不同尺寸的图像空间特征，所述特征金字塔网络用于将不同尺寸图像空间特征融合，得到融合特征，所述降维模块用于对所述融合特征进行三种不同维度的降维，得到图像三种维度的空间特征，不同维度对应不同的图像通道数；所述分层预测模块包括：三个全连接层；所述三个全连接层分别对所述三种维度的空间特征进行预测，每个全连接层预测得到人脸姿态朝向的一个角度，以使所述分层预测网络预测人脸姿态朝向三个角度各自关注的图像区域不同，减少三个角度预测之间的相互干扰；所述尺寸以像素为单位。S102, inputting the image into a pre-trained layered prediction network, predicting the pitch angle, yaw angle and roll angle of the attitude of the face to estimate the head pose of the face; the layered prediction network includes: A backbone network, a feature pyramid network, a dimensionality reduction module, and a hierarchical prediction module; the backbone network is used to extract image space features of different sizes, and the feature pyramid network is used to fuse image space features of different sizes to obtain fusion features, so The dimensionality reduction module is used to perform dimensionality reduction in three different dimensions on the fusion feature to obtain spatial features in three dimensions of the image, and different dimensions correspond to different numbers of image channels; the layered prediction module includes: three full connections layer; the three fully connected layers predict the spatial features of the three dimensions respectively, and each fully connected layer predicts an angle towards the attitude of the human face, so that the layered prediction network predicts the orientation of the human face attitude The image areas concerned by the three angles are different, so as to reduce the mutual interference between predictions from the three angles; the size is in pixels.

具体地，本发明采用以下技术方案：提供一种分层预测下损失自适应调整的头部姿态估计方法，其中，分层预测指的是采用不同的网络层分离三种角度的预测任务；损失自适应调整是根据损失函数本身添加限制，从损失函数的角度使得角度预测不连续问题得到解决，具体包括以下步骤：Specifically, the present invention adopts the following technical solutions: provide a head pose estimation method for loss adaptive adjustment under hierarchical prediction, wherein, hierarchical prediction refers to the use of different network layers to separate the prediction tasks of three angles; loss Adaptive adjustment is to add restrictions according to the loss function itself. From the perspective of the loss function, the problem of discontinuity in angle prediction is solved. The specific steps are as follows:

(一)、将单张人物RGB图像，或是标准数据集中的图像，进行人脸裁剪，裁剪后的图片尺寸为224×224，由于数据预处理的方法较为普遍，且原理较为简单，故此处不多作赘述；(1) Cut a single RGB image of a person, or an image in a standard data set, for face cropping. The size of the cropped image is 224×224. Since the data preprocessing method is relatively common and the principle is relatively simple, here not to repeat;

(二)、在模型中利用特征金字塔对不同网络层提取到的不同尺度的图片特征进行融合，并采用动态自适应空间特征融合策略，使融合的特征权重自动分配。采用ResNet 50作为骨干网络，结合特征金字塔策略，融合过程中只对跨越2个阶层尺度内的特征进行融合，超过此比例的不做融合处理。将提取到的不同尺寸的图片特征融合之后，得到一个融合后的汇总特征，将该特征通过三次降维形成三种空间分辨率相同，但通道数不同的特征，分别用于三种角度的头部姿态估计任务。(2) In the model, feature pyramids are used to fuse image features of different scales extracted from different network layers, and a dynamic adaptive spatial feature fusion strategy is adopted to automatically assign the weight of the fused features. Using ResNet 50 as the backbone network, combined with the feature pyramid strategy, only the features spanning two levels of scales are fused during the fusion process, and the fusion processing is not performed if the ratio exceeds this ratio. After merging the extracted image features of different sizes, a fused summary feature is obtained, and the feature is reduced three times to form three features with the same spatial resolution but different numbers of channels, which are used for the heads of the three angles respectively. pose estimation task.

(三)、将(二)步骤中得到的三种特征分别与全连接层进行连接，累计三个全连接层，特征与全连接层之间形成一对一的连接关系。此时，模型有了为yaw，pitch，roll三种角度单独调节的空间，三种角度调节之间的相互影响得到降低。yaw表示绕y轴旋转的角度，称为偏航角；pitch表示绕x轴旋转的角度俯仰角，称为；roll表示绕z轴旋转的角度，称为翻滚角。(3) Connect the three features obtained in step (2) with the fully connected layer respectively, accumulating three fully connected layers, and form a one-to-one connection relationship between the feature and the fully connected layer. At this point, the model has space for separate adjustments for the three angles of yaw, pitch, and roll, and the interaction between the adjustments of the three angles is reduced. yaw represents the angle of rotation around the y-axis, known as the yaw angle; pitch represents the angle of rotation around the x-axis, known as the pitch angle; roll represents the angle of rotation around the z-axis, known as the roll angle.

(四)、将(三)步骤中形成的三个分支，分别配备额外注意力机制，使得模型在参数调解中所关注的特征更加集中，且具有身份鲁棒性。外部注意力使得模型关注的特征集中在不同身份人的相同特征上，这使得三种角度的预测形成了各自独特的特征关注点。(4) The three branches formed in step (3) are equipped with additional attention mechanisms, so that the features that the model pays attention to in parameter mediation are more concentrated and have identity robustness. External attention makes the features of the model focus on the same features of people with different identities, which makes the predictions from the three perspectives form their own unique feature focus.

(五)、将(四)步骤中经过外部注意力提取后的三种特征，利用交叉熵损失和均方差损失加以训练。特别的，本发明提出了一种动态自调节的损失约束项，解决了传统训练过程中存在的角度预测不连续问题，具体做法为：利用均方差和交叉熵损失之间存在的同大同小关系，以均方差来约束交叉熵损失，使得训练中模型的预测损失与产生的真实角度损失保持同一增减趋势。(5) The three features extracted by external attention in step (4) are trained using cross-entropy loss and mean square error loss. In particular, the present invention proposes a dynamic self-adjusting loss constraint item, which solves the discontinuous problem of angle prediction existing in the traditional training process. , the mean square error is used to constrain the cross-entropy loss, so that the prediction loss of the model during training and the generated real angle loss maintain the same increase and decrease trend.

本发明各步骤涉及的思路总体介绍如下：首先，通过裁剪得到人脸图像，去除了无关背景因素的干扰，图片尺寸的减小也减轻了模型的计算负担。其次，通过特征金字塔融合策略，使得不同尺度的特征都发挥一定作用，融合后的特征兼具细节和整体两个方面，消除了由于卷积层太深导致的高级特征趋于关注整体的弊端。更进一步的，采用特征降维形成的阶梯式头部姿态预测，兼顾了传统特征融合与多任务预测的优势，使得三种角度能够在一个模型上达到良好的协调。同时，配备注意力机制使得模型提取的特征具有普适性。最后，利用优化的损失函数使头部姿态估计整体呈现出一种连续性，头部姿态估计的准确度大大提升。The ideas involved in each step of the present invention are generally introduced as follows: firstly, the face image is obtained by cropping, the interference of irrelevant background factors is removed, and the reduction of the image size also reduces the calculation burden of the model. Secondly, through the feature pyramid fusion strategy, the features of different scales can play a certain role. The fused features have both details and overall aspects, eliminating the disadvantages of high-level features that tend to focus on the whole due to too deep convolutional layers. Furthermore, the stepped head pose prediction formed by feature dimensionality reduction takes into account the advantages of traditional feature fusion and multi-task prediction, so that the three angles can achieve good coordination on one model. At the same time, the attention mechanism makes the features extracted by the model universal. Finally, the optimized loss function is used to make the overall head pose estimation present a continuity, and the accuracy of the head pose estimation is greatly improved.

经过上述五个步骤后，本发明所提出的头部姿态估计方法，解决了传统头部姿态预测过程中的角度相互影响，角度预测不连续问题，而由采用特征金字塔所带来的额外计算力消耗也通过降维操作得到了减轻。经在标准数据集上的验证，本发明所提出的头部姿态估计方法对于具有不同身份的人，或是同一人物的不同的姿态，均表现出鲁棒性。After the above five steps, the head pose estimation method proposed by the present invention solves the problem of angle interaction and discontinuous angle prediction in the traditional head pose prediction process, and the additional computing power brought by the feature pyramid Consumption is also mitigated by dimensionality reduction operations. After verification on the standard data set, the head pose estimation method proposed by the present invention is robust to people with different identities or different poses of the same person.

本发明提供了一种分层预测下损失自适应调整的头部姿态估计方法，其具体实施步骤如下：The present invention provides a head pose estimation method for loss adaptive adjustment under hierarchical prediction, and its specific implementation steps are as follows:

图2为本发明一种分层预测下损失自适应调整的头部姿态估计方法实施框图，如图2所示，本发明所提出的头部姿态估计方法整体包含以下几个模块：(1)图像输入模块；(2)特征融合模块；(3)分层预测模块；(4)损失限制模块。各模块涉及的具体操作步骤如下：Fig. 2 is a block diagram of the implementation of the head pose estimation method for loss adaptive adjustment under hierarchical prediction according to the present invention. As shown in Fig. 2, the head pose estimation method proposed by the present invention as a whole includes the following modules: (1) Image input module; (2) Feature fusion module; (3) Hierarchical prediction module; (4) Loss limitation module. The specific operation steps involved in each module are as follows:

一、图像输入模块：首先，获取待检测头部姿态的人物图像或视频，进行预处理操作，具体的操作过程包括但不限于：进行人脸裁剪，得到去除背景后的人物头部图片。将图片进行Resize，将图片的尺寸调节为224×224像素大小；将视频中的图片按照帧序排列(fps＝60)，将图片集中的图片按照顺序排列；需要注意的是，图片的尺寸调节与图片的排序两个步骤无特定的先后顺序要求。完成前需处理后，需要对所有图片进行随机遮挡处理，此处理的目的是为了防止训练过程中模型过于关注局部特征而导致模型的通用性降低。图像输入模块的输出是一批已经处理的人物头部标准化图像。1. Image input module: First, obtain the image or video of the person whose head posture is to be detected, and perform preprocessing operations. The specific operation process includes but is not limited to: performing face cropping to obtain the head image of the person with the background removed. Resize the picture, and adjust the size of the picture to 224×224 pixels; arrange the pictures in the video according to the frame sequence (fps=60), and arrange the pictures in the picture set in order; it should be noted that the size of the picture is adjusted There is no specific sequence requirement for the two steps of sorting pictures. After the pre-processing is completed, all pictures need to be randomly occluded. The purpose of this processing is to prevent the model from paying too much attention to local features during the training process, resulting in a decrease in the versatility of the model. The output of the image input module is a batch of processed normalized images of human heads.

二、特征融合模块：如图2所示，特征融合模块位于图像输入模块的下方，用于接收图像输入模块所输出的标准化图像，以及向分层预测模块输入经融合后的特征。对于特征融合模块的详细示意可见图3所示，其包含的操作步骤有：将标准图像经由骨干网络Resnet-50提取特征，对于不同block提取的特征，只采用下采样策略。在特征融合方面，对于空间尺度比例为2:1的情况，本发明用步长为2，卷积核大小为3×3的卷积层来保证空间尺度一致；对于空间尺度比例为4:1的情况，本发明先采用步长为2的maxpooling进行最大池化，再利用一层步长为2，卷积核大小3×3的卷积层来保证空间尺度一致。而对于8:1的空间尺度比，由于特征相差过多，本发明不对其采用空间融合策略。用S来代表每个阶段，则上述特征融合的过程可以被描述为：2. Feature fusion module: As shown in Figure 2, the feature fusion module is located below the image input module, and is used to receive the standardized image output by the image input module and input the fused features to the hierarchical prediction module. The detailed schematic diagram of the feature fusion module can be seen in Figure 3. The operation steps included are: extracting features from the standard image through the backbone network Resnet-50, and only adopting the downsampling strategy for the features extracted by different blocks. In terms of feature fusion, for the case where the spatial scale ratio is 2:1, the present invention uses a convolutional layer with a step size of 2 and a convolution kernel size of 3×3 to ensure that the spatial scale is consistent; for a spatial scale ratio of 4:1 In this case, the present invention first uses maxpooling with a step size of 2 for maximum pooling, and then uses a convolutional layer with a step size of 2 and a convolution kernel size of 3×3 to ensure consistent spatial scales. However, for the spatial scale ratio of 8:1, the present invention does not adopt a spatial fusion strategy for it because the features differ too much. Using S to represent each stage, the above process of feature fusion can be described as:

其中Sj|j＝3,4表示最后两个block阶段，→j表示以当前block的特征空间尺度为标准进行融合，γ为融合权值。当j＝1或2时，γ₂或γ₃对应的值为0，即此时只进行两个阶段的特征融合。同时，本发明强迫γ₁+γ₂+γ₃＝1|γ₁,γ₂,γ₃∈[0,1]。为了实现这一目的，本发明采用三个1×1的卷积层来计算权重，公式如下：Among them, Sj|j=3, 4 means the last two block stages, →j means fusion based on the feature space scale of the current block, and γ is the fusion weight. When j=1 or 2, the value corresponding to γ ₂ or γ ₃ is 0, that is, only two stages of feature fusion are performed at this time. Meanwhile, the present invention forces γ ₁ +γ ₂ +γ ₃ =1|γ ₁ ,γ ₂ ,γ ₃ ∈[0,1]. In order to achieve this goal, the present invention uses three 1×1 convolutional layers to calculate weights, the formula is as follows:

为第一尺度特征对应的权值，

为第二尺度特征对应的权值，

为第三尺度特征对应的权值，γ₁为经过类似sigmod加权计算后得到的占比，最终使得γ₁+γ₂+γ₃＝1。

is the weight corresponding to the first scale feature,

is the weight corresponding to the second scale feature,

is the weight corresponding to the third-scale feature, and γ ₁ is the proportion obtained after a similar sigmod weighted calculation, which finally makes γ ₁ +γ ₂ +γ ₃ =1.

经特征融合后，模型既对细小特征有所保留，又能够关注到图像的整体部分。在下一阶段中，该融合特征将被进一步降维，以形成阶梯式的分层预测。After feature fusion, the model not only retains small features, but also pays attention to the overall part of the image. In the next stage, this fused feature will be further dimensionally reduced to form a ladder-like hierarchical prediction.

在一个具体的实施例中，将224×224大小的图像输入到卷积神经网络中，经过一次卷积核大小为3×3的卷积运算，图像的空间尺寸变为112×112。之后，经过骨干网络ResNet-50中的四个block时，形成大小分别为56×56，28×28，14×14以及7×7的图像空间特征。将这个四维度的空间特征利用特征金字塔进行融合，融合时需保持空间尺寸一致，融合策略为：对于56×56大小的空间特征，采用一层最大池化，使空间特征变为28×28大小，再经过3×3的卷积，使尺寸变为14×14大小，之后分别与这两个维度相融，其余融合操作以此类推。对于跨越两个尺度的特征，如112×112和7×7大小的特征不进行融合。经过特征金字塔后，得到融合后的图像特征尺寸为7×7×2048，此时，利用1×1的卷积核减少通道数，即将通道数减少到7×7×1024大小，称为降维层1(dw1)，再对dw1进行降维处理，得到降维层2(dw2)，再降维得到降维层3(dw3)。此时，三种维度的特征，它们的空间分辨比率相同，都为7×7大小，但通道数不同。In a specific embodiment, an image with a size of 224×224 is input into the convolutional neural network, and after a convolution operation with a convolution kernel size of 3×3, the spatial size of the image becomes 112×112. Afterwards, when passing through the four blocks in the backbone network ResNet-50, image space features with sizes of 56×56, 28×28, 14×14 and 7×7 are formed. The four-dimensional spatial features are fused using the feature pyramid, and the spatial size needs to be kept consistent during fusion. The fusion strategy is: for a 56×56-sized spatial feature, a layer of maximum pooling is used to make the spatial feature a 28×28 size , and then undergo a 3×3 convolution to make the size into a 14×14 size, and then merge with these two dimensions respectively, and the rest of the fusion operations can be deduced by analogy. For features spanning two scales, such as 112×112 and 7×7 features, fusion is not performed. After the feature pyramid, the fused image feature size is 7×7×2048. At this time, the number of channels is reduced by using the 1×1 convolution kernel, that is, the number of channels is reduced to 7×7×1024, which is called dimensionality reduction. Layer 1 (dw1), and then perform dimensionality reduction processing on dw1 to obtain dimensionality reduction layer 2 (dw2), and then dimensionality reduction to obtain dimensionality reduction layer 3 (dw3). At this time, the features of the three dimensions have the same spatial resolution ratio and are all 7×7 in size, but the number of channels is different.

三、分层预测模块：如图2所示，分层预测模块的主要作用是将三种角度的预测通过降维形成三个分支，每个分支的预测之间互不干扰。在阐述本发明所提出的分层预测功能前，有必要对目前普遍使用的方法做出说明：传统方法将头部姿势三个角度的预测被视为同一任务的三个分支，它们完全共享相同的网络层，这增加了模型的负担，如图4所示，AP为Avg pooling，当模型根据其他角度的损失反馈对模型参数进行调整时，此时预测损失较小的一个角度预测结果可能会变差，因为模型必须在这三种角度的调节之间保持平衡。图4中MAE表示平均绝对误差。3. Hierarchical prediction module: As shown in Figure 2, the main function of the hierarchical prediction module is to form three branches of predictions from three perspectives through dimensionality reduction, and the predictions of each branch do not interfere with each other. Before elaborating the hierarchical prediction function proposed by the present invention, it is necessary to explain the methods commonly used at present: the traditional method regards the prediction of the three angles of the head posture as three branches of the same task, and they completely share the same The network layer, which increases the burden of the model, as shown in Figure 4, AP is Avg pooling, when the model adjusts the model parameters according to the loss feedback of other angles, the prediction result of an angle with a smaller prediction loss may be worse because the model must balance between these three angles of adjustment. MAE in Fig. 4 means mean absolute error.

四、简而言之，相对于单个角度的头部姿态估计，同时进行三个角度的估计限制了模型的性能。传统的头部姿态估计任务中，三种角度的预测可以被描述为以下公式：4. In short, the simultaneous estimation of three angles limits the performance of the model relative to head pose estimation from a single angle. In the traditional head pose estimation task, the prediction of the three angles can be described as the following formula:

其中，K表示不同的权重，Γ为卷积层提取的特征，b为偏置因子，使用

θ和ψ分别表示对yaw，pitch，roll的预测值。假设一个图像的预测损失是

由于网络层是共享梯度反向传播，调整后的预测损失变为

虽然总的预测损失降低了，但它并不是偏航的最佳模型。经过本发明提出的分层预测结构之后，三种角度的调节公式更改为：Among them, K represents different weights, Γ is the feature extracted by the convolutional layer, b is the bias factor, using

θ and ψ represent the predicted values of yaw, pitch and roll, respectively. Suppose the prediction loss for an image is

Since the network layers are shared gradient backpropagation, the adjusted prediction loss becomes

Although the overall prediction loss is reduced, it is not the best model for yaw. After the hierarchical prediction structure proposed by the present invention, the adjustment formulas of the three angles are changed to:

其中，Γ₁,Γ₂,Γ₃为上述融合特征经过降维得到的三个降维层dw1，dw2，dw3降维后得到的特征。Γ₁，Γ₂，Γ₃之间的相关关系如下，其中，W₁和W₂是新的卷积参数，b₄和b₅是降维带来的新偏差项：Among them, Γ ₁ , Γ ₂ , and Γ ₃ are the features obtained after dimensionality reduction of the three dimensionality reduction layers dw1, dw2, and dw3 obtained through dimensionality reduction of the above fusion features. The correlation between Γ ₁ , Γ ₂ , and Γ ₃ is as follows, where W ₁ and W ₂ are new convolution parameters, and b ₄ and b ₅ are new bias items brought about by dimensionality reduction:

本发明将头部姿态估计视为三项任务，为模型参数调整提供额外的调整空间，角度的预测顺序由数据集中样本数量的分布决定。如图5所示，图中(a)代表由传统方法得到的各个角度关注特征部分，(b)代表由本发明提出的分层预测方法所得到的的三种角度关注特征部分。可以看出，经分层后，三种角度各自关注的区域不再相同，这意味着分层预测策略发挥了作用。The present invention regards head pose estimation as three tasks, provides additional adjustment space for model parameter adjustment, and the prediction order of angles is determined by the distribution of the number of samples in the data set. As shown in FIG. 5 , (a) in the figure represents the attention feature parts of various angles obtained by the traditional method, and (b) represents the attention feature parts of three angles obtained by the hierarchical prediction method proposed by the present invention. It can be seen that after stratification, the areas of concern of the three perspectives are no longer the same, which means that the hierarchical prediction strategy has played a role.

此后，本发明为每个角度的预测都添加了一层外部注意力，外部注意力机制的工作原理如图6所示，通过不断提取参与训练的图片间的公共特征，使得这些公共特征对应的权重不断增加，而其他部分的权值则相对减弱。分层预测模块将从骨干网获得的一批特征作为输入，然后使用1×1卷积层放缩通道，减小计算负担，经过一层外部注意力机制后，再通过1×1的卷积层还原通道数量，最后输出到损失限制模块当中。Afterwards, the present invention adds a layer of external attention to the prediction of each angle. The working principle of the external attention mechanism is shown in Figure 6. By continuously extracting the common features between the pictures participating in the training, these common features correspond to The weights are constantly increasing, while the weights of other parts are relatively weakened. The hierarchical prediction module takes a batch of features obtained from the backbone network as input, and then uses a 1×1 convolutional layer to scale the channel to reduce the computational burden. After a layer of external attention mechanism, the 1×1 convolution The layer restores the number of channels, and finally outputs to the loss limit module.

四、损失限制模块：在阐述本发明方法解决的问题前，有必要对传统方法导致的损失函数预测不连续问题做出说明，如图4所示，当真正的头部姿态为[6.1°,-3.2°,-15°]和预测角度为[5.9°,-1.9°,-9.9°]时，由于分类损失大于回归损失，此时传统损失函数不正确逆转了yaw和pitch实际损失大小关系。此外，传统损失函数还导致了角度分类线两边损失不平衡的问题，间歇性的损失和错误倒置的损失函数使得模型难以学习。通过另一个简单的例子来说明分类两端的损失不平衡问题。设定真实角度为[0°、3°、5°]和预测角度[1°,3.5°,7°]，将(-99°、99°)之间的头部角度以3°为一间隔，分为66类。当预测损失在1°以内时，角度预测存在两种情况：类间损失和类内损失。当预测损失为类内损失时，交叉熵损失较小，总损失与真值损失趋势一致。但当预测损失为类间损失时，由于均方差的指数项为2，此时交叉熵损失将大于均方损失，这将导致总损失与真实损失趋势相反，使得模型很难学习。传统损失函数方法可以被描述为：4. Loss limitation module: Before explaining the problem solved by the method of the present invention, it is necessary to explain the loss function prediction discontinuity problem caused by the traditional method, as shown in Figure 4, when the real head posture is [6.1°, When -3.2°, -15°] and prediction angles are [5.9°, -1.9°, -9.9°], because the classification loss is greater than the regression loss, the traditional loss function incorrectly reverses the relationship between the actual loss of yaw and pitch. In addition, the traditional loss function also leads to the problem of unbalanced loss on both sides of the angle classification line, intermittent loss and wrongly inverted loss function make the model difficult to learn. Another simple example to illustrate the problem of imbalanced loss at both ends of the classification. Set the real angle to [0°, 3°, 5°] and the predicted angle to [1°, 3.5°, 7°], and set the head angles between (-99°, 99°) at intervals of 3° , divided into 66 categories. When the prediction loss is within 1°, there are two cases of angle prediction: inter-class loss and intra-class loss. When the prediction loss is an intra-class loss, the cross-entropy loss is smaller, and the total loss is consistent with the true loss. But when the prediction loss is between-class loss, since the exponential term of the mean square error is 2, the cross entropy loss will be greater than the mean square loss at this time, which will cause the total loss to be opposite to the true loss trend, making the model difficult to learn. The traditional loss function approach can be described as:

其中，k为类别数，Y_ic代表根据角度类别所形成的one-hot编码，为0或1，表示分类是否正确，

是预测角度所属的类别，σ表示softmax，L_ce表示交叉熵损失，L_mse代表均方差损失。Among them, k is the number of categories, and Y _ic represents the one-hot code formed according to the angle category, which is 0 or 1, indicating whether the classification is correct,

is the category to which the prediction angle belongs, σ represents softmax, L _ce represents cross entropy loss, and L _mse represents mean square error loss.

考虑到两种损失之间存在协同作用，本发明对分类损失设置了额外的约束：

经更新后的头部姿态估计损失函数如下：Considering the synergy between the two losses, the present invention sets additional constraints on the classification loss:

The updated head pose estimation loss function is as follows:

经过损失限制后，反向传播梯度中也加入了损失项β∈[0,1]，当真实损失较小时，得到的惩罚较小。在上面的例子中，通过β使得俯仰角的交叉熵损失减少到原来的1/5，这将模型的总损失重置为与真实损失相同的趋势，从而解决了传统损失函数带来的角度预测不一致问题。本发明增加了角度损失在1°以上的误差惩罚，以加快模型的收敛速度。输入的图像经过四个预测模块，完成一轮训练过程。模型通过返向传播机制，调整参数，使得角度预测不断趋于完善。After the loss limit, the loss item β∈[0,1] is also added to the backpropagation gradient. When the real loss is small, the penalty obtained is small. In the above example, the cross-entropy loss of the pitch angle is reduced to 1/5 by β, which resets the total loss of the model to the same trend as the real loss, thus solving the angle prediction brought by the traditional loss function Inconsistencies. The present invention increases the error penalty when the angle loss is above 1°, so as to speed up the convergence speed of the model. The input image goes through four prediction modules to complete a round of training process. The model adjusts the parameters through the back propagation mechanism, so that the angle prediction is continuously perfected.

为了更好地对本发明提供的结合位置信息和头部姿态定位与检测系统做出进一步解释，以下结合实施例进行具体说明。In order to better explain the positioning and detection system combining position information and head posture provided by the present invention, the following specific description will be given in conjunction with embodiments.

图7为本发明实施例提供的复杂环境下头部姿态估计示意图，图7中既包含大角度偏转情况，也包括头部存在遮挡物遮挡情况。同时用本发明方法和传统方法进行预测，将得到的结果进行对比，如图7所示，在头部姿态发生大角度偏转或者头部姿态存在遮挡物的情况下，与传统头部姿态估计方法相比，本发明提出的方法降低了10°以上的角度预测平均损失，对于各个角度的预测接近真实值。这表明，同传统方法相比，本发明所提出的头部姿态估计方法具有复杂场景下的鲁棒性，即证明本发明所涉及的头部姿态角度分层预测策略和损失函数自调整策略是有效的。Fig. 7 is a schematic diagram of head pose estimation in a complex environment provided by an embodiment of the present invention. Fig. 7 includes not only the case of large-angle deflection, but also the case of head being blocked by an obstacle. At the same time, use the method of the present invention and the traditional method to predict, and compare the obtained results. As shown in Figure 7, when the head posture is deflected at a large angle or there is an occluder in the head posture, the traditional head posture estimation method Compared with the method proposed in the present invention, the average loss of angle prediction is reduced by more than 10°, and the prediction for each angle is close to the real value. This shows that, compared with traditional methods, the head pose estimation method proposed in the present invention has robustness in complex scenes, which proves that the head pose angle hierarchical prediction strategy and loss function self-adjustment strategy involved in the present invention are Effective.

图7中Ground truth为头部姿态角度的真实数据，HopeNet为现有方法，其全称为：不含关键点的细粒度头部姿态估计；TPL-net为本发明的头部姿态估计方法，本发明所用的分层预测网络全称为：分层预测和损失限制网络，Tiered prediction with loss limitnetwork。In Fig. 7, Ground truth is the real data of the head pose angle, HopeNet is an existing method, and its full name is: fine-grained head pose estimation without key points; TPL-net is the head pose estimation method of the present invention, and this The full name of the layered prediction network used in the invention is: layered prediction and loss limit network, Tiered prediction with loss limit network.

图8为本发明实施例提供的头部姿态估计系统架构图，如图8所示，包括：Fig. 8 is an architecture diagram of the head pose estimation system provided by the embodiment of the present invention, as shown in Fig. 8, including:

人脸图像确定单元810，用于确定包含人脸的图像；A human face image determination unit 810, configured to determine an image comprising a human face;

头部姿态估计单元820，用于将所述图像输入到预先训练好的分层预测网络，预测得到人脸姿态朝向的俯仰角、偏航角以及翻滚角，以估计人脸头部姿态；所述分层预测网络包括：骨干网络、特征金字塔网络、降维模块以及分层预测模块；所述骨干网络用于提取不同尺寸的图像空间特征，所述特征金字塔网络用于将不同尺寸图像空间特征融合，得到融合特征，所述降维模块用于对所述融合特征进行三种不同维度的降维，得到图像三种维度的空间特征，不同维度对应不同的图像通道数；所述分层预测模块包括：三个全连接层；所述三个全连接层分别对所述三种维度的空间特征进行预测，每个全连接层预测得到人脸姿态朝向的一个角度，以使所述分层预测网络预测人脸姿态朝向三个角度各自关注的图像区域不同，减少三个角度预测之间的相互干扰；所述尺寸以像素为单位。The head posture estimation unit 820 is used to input the image into the pre-trained layered prediction network to predict the pitch angle, yaw angle and roll angle of the face posture orientation, so as to estimate the head posture of the human face; The layered prediction network includes: a backbone network, a feature pyramid network, a dimensionality reduction module, and a layered prediction module; Fusion to obtain fusion features, the dimensionality reduction module is used to perform dimensionality reduction in three different dimensions on the fusion features to obtain spatial features in three dimensions of the image, and different dimensions correspond to different numbers of image channels; the layered prediction The module includes: three fully connected layers; the three fully connected layers predict the spatial features of the three dimensions respectively, and each fully connected layer predicts an angle towards the orientation of the face, so that the layered The prediction network predicts that the three angles of the face pose are different in the image area of interest to reduce the mutual interference between the three angle predictions; the size is in pixels.

可以理解的是，图8中各个单元的详细功能实现可参见前述方法实施例中的介绍，在此不做赘述。It can be understood that, for the detailed function implementation of each unit in FIG. 8 , reference may be made to the introduction in the foregoing method embodiments, and details are not repeated here.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A head pose estimation method, characterized in that, comprises the steps:

Identify images that contain human faces;

The image is input to the pre-trained layered prediction network, and the pitch angle, yaw angle and roll angle of the attitude of the face are predicted to estimate the head pose of the face; the layered prediction network includes: a backbone network , a feature pyramid network, a dimensionality reduction module, and a hierarchical prediction module; the backbone network is used to extract image space features of different sizes, and the feature pyramid network is used to fuse image space features of different sizes to obtain fusion features, and the reduction The dimension module is used to perform dimensionality reduction in three different dimensions on the fusion feature to obtain spatial features of the image in three dimensions, and different dimensions correspond to different image channel numbers; the layered prediction module includes: three fully connected layers; The three fully connected layers predict the spatial features of the three dimensions respectively, and each fully connected layer predicts an angle of the orientation of the face pose, so that the layered prediction network predicts the orientation of the face pose toward the three dimensions. Each angle focuses on different image areas, reducing the mutual interference between the three angle predictions; the size is in pixels.

2. The method according to claim 1, wherein the loss function in the training process of the layered prediction network adopts a self-adjusting loss limitation coefficient, so that the average absolute error of the three fully connected layer prediction angles is less than the threshold When the cross entropy loss item of the prediction angle is greater than the mean square error loss item of the prediction angle, the loss size reversal problem is corrected, and when the average absolute error of the prediction angle of the three fully connected layers is not less than the threshold value, the above two The error penalty brought by the loss item makes the hierarchical prediction network converge at a faster speed during the training process; when the average absolute error of the three fully connected layer prediction angles is less than the threshold, the error penalty of the hierarchical prediction network is The first penalty, when the average absolute error of the prediction angles of the three fully connected layers is not less than the threshold, the error penalty of the layered prediction network is the second penalty, and the correction loss size reverse problem means that the first penalty is controlled to be less than the second penalty, and the score is guaranteed The layer prediction network can train and learn normally.

3. The method according to claim 1, wherein the backbone network includes four residual blocks; the face image is sequentially processed through the four residual blocks, and four kinds of image space features with decreasing sizes are sequentially obtained ;

The above four size-decreasing image space features are fused by the feature pyramid network. The fusion strategy is to first fuse the first size space feature with the second size space feature, and combine the first size space feature with the third size space feature. The spatial features are fused to obtain the new second-dimensional spatial features and the new third-dimensional spatial features, and then the new second-dimensional spatial features are fused with the new third-dimensional spatial features, and the new third-dimensional spatial features are fused. The second dimension space feature is fused with the fourth dimension space feature to obtain the updated third dimension space feature and the new fourth dimension space feature, and finally the updated third dimension space feature and the new dimension space feature The fourth size space feature is fused to obtain the fusion feature of the fourth size; wherein, the size from the first size to the fourth size decreases step by step;

The dimensionality reduction module includes three convolution kernels; the fusion features of the fourth size are sequentially processed by the three convolution kernels, and each convolution kernel performs a dimensionality reduction on the input image features, and successively obtains different sizes. The spatial characteristics of the three dimensions change and the number of channels decreases step by step.

4. The method according to claim 3, characterized in that, the adjustment formulas of the three angles predicted by the layered prediction network are:

in,

as well as

5. The method according to any one of claims 1 to 4, wherein the loss function of the hierarchical prediction network

for:

in,

is the category to which the predicted angle belongs.

6. A head posture estimation system, characterized in that, comprising:

A human face image determining unit, configured to determine an image containing a human face;

The head posture estimation unit is used to input the image into the pre-trained layered prediction network, and predicts the pitch angle, yaw angle and roll angle towards the face posture to estimate the head posture of the human face; The layered prediction network includes: a backbone network, a feature pyramid network, a dimensionality reduction module, and a layered prediction module; the backbone network is used to extract image space features of different sizes, and the feature pyramid network is used to fuse image space features of different sizes , to obtain the fusion feature, the dimensionality reduction module is used to perform dimensionality reduction in three different dimensions on the fusion feature, to obtain the spatial features of the image in three dimensions, and different dimensions correspond to different image channel numbers; the layered prediction module Including: three fully connected layers; the three fully connected layers respectively predict the spatial features of the three dimensions, and each fully connected layer predicts an angle towards the orientation of the face posture, so that the layered prediction The network predicts the face poses of the three angles to focus on different image areas, reducing the mutual interference between the three angle predictions; the size is in pixels.

7. The system according to claim 6, wherein the loss function in the layered prediction network training process used by the head pose estimation unit adopts a self-adjusting loss restriction coefficient to predict When the average absolute error of the angle is less than the threshold, correct the loss size reversal problem caused by the cross-entropy loss item of the predicted angle being greater than the mean square error loss item of the predicted angle, and the average absolute error of the predicted angle in the three fully connected layers is not less than the threshold When , the error penalty brought by the above two loss items is increased, so that the layered prediction network converges at a faster speed during the training process; when the average absolute error of the prediction angle of the three fully connected layers is less than the threshold, the divided The error penalty of the layer prediction network is the first penalty. When the average absolute error of the prediction angle of the three fully connected layers is not less than the threshold, the error penalty of the layered prediction network is the second penalty, and the correction loss size reverse problem refers to controlling the first penalty. If it is less than the second penalty, it ensures that the hierarchical prediction network can be trained and learned normally.

8. The system according to claim 6, wherein the backbone network used by the head pose estimation unit includes four residual blocks; the face image is sequentially processed through the four residual blocks to obtain four sequentially. Image space features with decreasing size; the above four image space features with decreasing size are fused by the feature pyramid network. The dimension space feature is fused with the third dimension space feature to obtain the new second dimension space feature and the new third dimension space feature, and then the new second dimension space feature and the new third dimension space feature Features are fused, and the new second dimension space feature is fused with the fourth dimension space feature to obtain the updated third dimension space feature and the new fourth dimension space feature, and finally the updated No. The three-dimensional spatial features and the new fourth-dimensional spatial features are fused to obtain the fusion features of the fourth size; wherein, the sizes from the first size to the fourth size are gradually reduced; the dimensionality reduction module includes Three convolution kernels; the fusion features of the fourth size are sequentially processed by the three convolution kernels, and each convolution kernel performs a dimensionality reduction on the input image features, and sequentially obtains the same size and the number of channels step by step Decreasing spatial characteristics of the three dimensions.

9. The system according to claim 8, wherein the adjustment formula of the three angles predicted by the layered prediction network used by the head pose estimation unit is:

in,

as well as

10. The system according to any one of claims 6 to 9, wherein the loss function of the hierarchical prediction network used by the head pose estimation unit

for:

in,

is the category to which the predicted angle belongs.