CN108960178A

CN108960178A - A kind of manpower Attitude estimation method and system

Info

Publication number: CN108960178A
Application number: CN201810771201.XA
Authority: CN
Inventors: 王贵锦; 陈醒濠; 季向阳
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-07-13
Filing date: 2018-07-13
Publication date: 2018-12-07

Abstract

The estimation method and system of a kind of manpower posture provided by the invention, wherein method includes: the depth image for obtaining target hand, and depth image is input to the convolutional layer of the first default neural network, exports the characteristic pattern of target hand；The Attitude estimation result of characteristic pattern and last iteration is input to the decision-making level of the first default neural network, exports the Attitude estimation result of current iteration；If the deviation between the Attitude estimation result of current iteration and the Attitude estimation result of last iteration is less than preset threshold, using the Attitude estimation result of current iteration as the final carriage estimated result of target hand.This method and system can farthest improve the accuracy of Attitude estimation result, solve the problems, such as to cause Attitude estimation result inaccurate due to the self-similarity between finger.

Description

Method and system for estimating human hand pose

技术领域technical field

本发明涉及图像识别技术领域，更具体地，涉及一种人手姿态估计方法及系统。The present invention relates to the technical field of image recognition, and more specifically, to a method and system for estimating a human hand pose.

背景技术Background technique

人手姿态估计问题指的是从图像中准确估计出人手骨架节点的三维坐标位置。这是计算机视觉和人机交互领域的一个关键问题，在虚拟现实、增强现实、非接触交互以及手势识别等领域有重要的意义。随着商用、低廉的深度相机的兴起和发展，基于深度图像的人手姿态估计算法成为了关注的热点。The problem of hand pose estimation refers to accurately estimating the three-dimensional coordinate positions of the human hand skeleton nodes from the image. This is a key issue in the fields of computer vision and human-computer interaction, and has important implications in fields such as virtual reality, augmented reality, non-contact interaction, and gesture recognition. With the rise and development of commercial and cheap depth cameras, hand pose estimation algorithms based on depth images have become a hot topic.

现有的人手姿态估计方法通常分为三类：模型拟合方法，鉴别式方法以及混合方法。模型拟合方法利用优化方法把预定义好的手模型匹配到输入深度图像中。鉴别式方法则完全是数据驱动的方法，其目标是通过有标签的训练数据，学习出一个回归器，对输入的深度图像预测手的姿态信息。混合方法则是前述两种方法的结合，通常先通过鉴别式方法得到初始的估计，再利用模型拟合方法来对结果进行修正。Existing methods for hand pose estimation generally fall into three categories: model fitting methods, discriminative methods, and hybrid methods. The model fitting method uses an optimization method to fit a predefined hand model to the input depth image. The discriminative method is a completely data-driven method, and its goal is to learn a regressor through labeled training data to predict the pose information of the hand from the input depth image. The hybrid method is a combination of the above two methods. Usually, the initial estimate is obtained through the discriminant method, and then the model fitting method is used to correct the result.

然而，由于手本身面积较小，在跟相机距离较远时深度图像的噪声较大；手自由度高，关节之间关系复杂，而且容易产生自遮挡现象；此外，手指本身也有较高的自相似性。以上这些问题导致现有的人手字条估计方法难以获得高精度的人手姿态估计结果。However, due to the small area of the hand itself, the noise of the depth image is large when the distance from the camera is relatively large; the degree of freedom of the hand is high, the relationship between the joints is complicated, and self-occlusion is easy to occur; in addition, the finger itself has a high self-occlusion. similarity. The above problems make it difficult for the existing hand note estimation methods to obtain high-precision hand pose estimation results.

有鉴于此，亟需提供一种人手姿态估计方法及系统，以有效提高人手姿态估计结果的精度。In view of this, there is an urgent need to provide a method and system for estimating the pose of a human hand, so as to effectively improve the accuracy of the result of estimating the pose of the human hand.

发明内容Contents of the invention

本发明为了克服现有的人手姿态估计方法难以获得高精度的人手姿态估计结果的问题，提供一种人手姿态估计方法及系统。In order to overcome the problem that it is difficult to obtain high-precision human hand pose estimation results in the existing hand pose estimation methods, the present invention provides a human hand pose estimation method and system.

一方面，本发明提供一种人手姿态的估计方法，包括：On the one hand, the present invention provides a kind of estimation method of human hand posture, comprising:

获取目标手部的深度图像，将所述深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；Obtain the depth image of the target hand, input the depth image to the convolutional layer of the first preset neural network, and output the feature map of the target hand;

将所述特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；The feature map and the attitude estimation result of the last iteration are input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output;

若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。If the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is less than a preset threshold, the pose estimation result of the current iteration is taken as the final pose estimation result of the target hand.

优选地，所述输出当前迭代的姿态估计结果，之后还包括：Preferably, the output of the pose estimation result of the current iteration further includes:

若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差大于预设阈值，则将当前迭代的姿态估计结果作为上一次迭代的姿态估计结果输入至所述决策层。If the deviation between the attitude estimation result of the current iteration and the attitude estimation result of the previous iteration is greater than a preset threshold, the attitude estimation result of the current iteration is input to the decision-making layer as the attitude estimation result of the previous iteration.

优选地，所述决策层包括特征优化层和全连接层；Preferably, the decision-making layer includes a feature optimization layer and a fully connected layer;

相应地，所述将所述特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果，具体为：Correspondingly, the said feature map and the attitude estimation result of the last iteration are input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output, specifically:

将所述特征图和上一次迭代的姿态估计结果输入至所述特征优化层，利用所述特征优化层根据上一次迭代的姿态估计结果从所述特征图中提取每个关节点对应的区域特征；Input the feature map and the pose estimation result of the last iteration into the feature optimization layer, and use the feature optimization layer to extract the regional features corresponding to each joint point from the feature map according to the pose estimation result of the last iteration ;

将所有关节点对应的区域特征输入至所述全连接层，输出当前迭代的姿态估计结果。Input the regional features corresponding to all related nodes to the fully connected layer, and output the pose estimation result of the current iteration.

优选地，所述利用所述特征优化层根据上一次迭代的姿态估计结果从所述特征图中提取每个关节点对应的区域特征，具体为：Preferably, the feature optimization layer is used to extract the region features corresponding to each joint point from the feature map according to the pose estimation result of the last iteration, specifically:

对于上一次迭代的姿态估计结果中的任意一个关节点，利用所述特征优化层在所述特征图中获取该关节点对应的投影点，以所述投影点为中心提取预设大小的区域特征，获得该关节点对应的区域特征。For any joint point in the pose estimation result of the last iteration, use the feature optimization layer to obtain the projection point corresponding to the joint point in the feature map, and extract the area features of a preset size centered on the projection point , to obtain the regional features corresponding to the joint point.

优选地，所述全连接层包括第一全连接层、第二全连接层和第三全连接层；Preferably, the fully connected layer includes a first fully connected layer, a second fully connected layer and a third fully connected layer;

相应地，将所有关节点对应的区域特征输入至所述全连接层，输出当前迭代的姿态估计结果，具体为：Correspondingly, the regional features corresponding to all related nodes are input to the fully connected layer, and the pose estimation result of the current iteration is output, specifically:

将所有关节点对应的区域特征输入至所述第一全连接层，利用所述第一全连接层将属于同一手指的关节点对应的区域特征进行串接，获得每个手指对应的局部特征；Inputting the regional features corresponding to all related nodes to the first fully connected layer, using the first fully connected layer to concatenate the regional features corresponding to the joint points belonging to the same finger to obtain the local features corresponding to each finger;

将所有手指对应的局部特征输入至所述第二全连接层，利用所述第二全连接层将所有手指对应的局部特征进行串接，获得所述目标手部对应的整体特征；Inputting the local features corresponding to all fingers to the second fully connected layer, using the second fully connected layer to concatenate the local features corresponding to all fingers to obtain the overall features corresponding to the target hand;

将所述整体特征输入至所述第三全连接层，输出当前迭代的姿态估计结果。The overall feature is input to the third fully connected layer, and the pose estimation result of the current iteration is output.

优选地，所述将所述特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，之前还包括获得初始姿态估计结果的步骤，具体为：Preferably, the input of the feature map and the pose estimation result of the last iteration to the decision-making layer of the first preset neural network also includes a step of obtaining an initial pose estimation result, specifically:

将所述深度图像输入至第二预设神经网络，根据所述第二预设神经网络的输出结果，获得初始姿态估计结果。The depth image is input to a second preset neural network, and an initial pose estimation result is obtained according to an output result of the second preset neural network.

优选地，所述将所述深度图像输入至第二预设神经网络，之前还包括：Preferably, the input of the depth image to the second preset neural network also includes:

获取多个手部的深度图像样本，在每个所述深度图像样本中标记出预设数量的关节点，将标记后的每个所述深度图像样本作为训练样本；Obtaining a plurality of depth image samples of hands, marking a preset number of joint points in each of the depth image samples, and using each of the marked depth image samples as training samples;

利用所有所述训练样本对所述第二预设神经网络进行训练。Using all the training samples to train the second preset neural network.

一方面，本发明提供一种人手姿态的估计系统，包括：In one aspect, the present invention provides a system for estimating the posture of a human hand, comprising:

特征提取模块，用于获取目标手部的深度图像，将所述深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；The feature extraction module is used to obtain the depth image of the target hand, and input the depth image to the convolutional layer of the first preset neural network, and output the feature map of the target hand;

结果迭代模块，用于将所述特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；The result iteration module is used to input the feature map and the attitude estimation result of the last iteration to the decision-making layer of the first preset neural network, and output the attitude estimation result of the current iteration;

结果确定模块，用于若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。The result determination module is used to use the pose estimation result of the current iteration as the final pose estimation result of the target hand if the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is less than a preset threshold.

一方面，本发明提供一种电子设备，包括：In one aspect, the present invention provides an electronic device, comprising:

至少一个处理器；以及at least one processor; and

与所述处理器通信连接的至少一个存储器，其中：at least one memory communicatively coupled to the processor, wherein:

所述存储器存储有可被所述处理器执行的程序指令，所述处理器调用所述程序指令能够执行上述任一所述的方法。The memory stores program instructions executable by the processor, and the processor invokes the program instructions to execute any of the methods described above.

一方面，本发明提供一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令使所述计算机执行上述任一所述的方法。In one aspect, the present invention provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute any one of the methods described above.

本发明提供的一种人手姿态的估计方法及系统，获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。该方法及系统通过上一次迭代的姿态估计结果来引导当前迭代的特征提取，从而在卷积得到的特征图的基础上进一步进行特征提取，获得更为优化的特征，最终根据更为优化的特征进行姿态估计，获得当前迭代的姿态估计结果，由此可使得当前迭代的姿态估计结果相较于上一次迭代的估计结果更为精准，最终在迭代过程基本收敛时，将当前迭代的姿态估计结果作为手部的最终姿态估计结果，能够最大程度地提高姿态估计结果的准确性，解决了由于手指间的自相似性导致姿态估计结果不准确的问题。A method and system for estimating the posture of a human hand provided by the present invention obtains a depth image of the target hand, inputs the depth image to the convolutional layer of the first preset neural network, and outputs a feature map of the target hand; combines the feature map and The attitude estimation result of the last iteration is input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output; if the deviation between the attitude estimation result of the current iteration and the attitude estimation result of the previous iteration is less than the preset threshold, The pose estimation result of the current iteration is taken as the final pose estimation result of the target hand. The method and system guide the feature extraction of the current iteration through the attitude estimation result of the previous iteration, so as to further perform feature extraction on the basis of the feature map obtained by convolution to obtain more optimized features, and finally according to the more optimized features Perform attitude estimation to obtain the attitude estimation result of the current iteration, which can make the attitude estimation result of the current iteration more accurate than the estimation result of the previous iteration. Finally, when the iteration process basically converges, the attitude estimation result of the current iteration As the final pose estimation result of the hand, the accuracy of the pose estimation result can be improved to the greatest extent, and the problem of inaccurate pose estimation results due to the self-similarity between fingers is solved.

附图说明Description of drawings

图1为本发明实施例的一种人手姿态估计方法的整体流程示意图；1 is a schematic diagram of an overall flow of a method for estimating a human hand pose according to an embodiment of the present invention;

图2为本发明实施例的第一预设神经网络的迭代过程示意图；2 is a schematic diagram of an iterative process of a first preset neural network according to an embodiment of the present invention;

图3为本发明实施例的一种人手姿态估计系统的整体结构示意图；3 is a schematic diagram of the overall structure of a human hand pose estimation system according to an embodiment of the present invention;

图4为本发明实施例的一种电子设备的结构框架示意图。FIG. 4 is a schematic structural frame diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

图1为本发明实施例的一种人手姿态估计方法的整体流程示意图，如图1所示，本发明提供一种人手姿态的估计方法，包括：Fig. 1 is a schematic diagram of the overall flow of a method for estimating the posture of a human hand according to an embodiment of the present invention. As shown in Fig. 1 , the present invention provides a method for estimating the posture of a human hand, including:

S1，获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；S1, acquiring a depth image of the target hand, inputting the depth image to the convolutional layer of the first preset neural network, and outputting a feature map of the target hand;

具体地，本实施例中，将需要进行姿态估计的手部作为目标手部。首先，获取目标手部的深度图像，其中深度图像的每个像素点的灰度值可用于表征场景中某一点距离摄像机的远近。获取深度图像的方法可以分为两类：被动测距传感和主动深度传感，其中被动测距传感中最常用的方法是双目立体视觉，该方法通过两个相隔一定距离的摄像机同时获取同一场景的两幅图像，通过立体匹配算法找到两幅图像中对应的像素点，随后根据三角原理计算出时差信息，而视差信息通过转换可用于表征场景中物体的深度信息；主动测距传感则是设备本身通过发射能量来完成深度信息的采集。本实施例中，目标手部的深度图像的获取方式可以根据实际需求进行设置，此处不做具体限定。Specifically, in this embodiment, the hand that needs pose estimation is taken as the target hand. First, obtain the depth image of the target hand, where the gray value of each pixel in the depth image can be used to represent the distance of a point in the scene from the camera. The methods for obtaining depth images can be divided into two categories: passive ranging sensing and active depth sensing. Among them, the most commonly used method in passive ranging sensing is binocular stereo vision, which uses two cameras separated by a certain distance to simultaneously Obtain two images of the same scene, find the corresponding pixel points in the two images through the stereo matching algorithm, and then calculate the time difference information according to the triangulation principle, and the parallax information can be used to represent the depth information of the object in the scene through conversion; the active ranging transmission The sense is that the device itself completes the collection of depth information by emitting energy. In this embodiment, the acquisition method of the depth image of the target hand can be set according to actual needs, and is not specifically limited here.

进一步地，将目标手部的深度图像输入至第一预设神经网络的卷积层，通过第一预设神经网络的卷积层对深度图像进行卷积，最终卷积层通过卷积处理后输出目标手部的特征图。其中，第一预设神经网络为卷积神经网络，且第一预设神经网络中可以包括多个卷积层，且每个卷积层之后还可以设置对应的池化层，一般地，卷积层和池化层交替设置，以通过池化层对卷积层输出的结果进行池化。本实施例中，卷积层和池化层的数量可以根据实际需求进行设置，此处不做具体限定。Further, the depth image of the target hand is input to the convolution layer of the first preset neural network, and the depth image is convolved through the convolution layer of the first preset neural network, and finally the convolution layer is processed by convolution Output the feature map of the target hand. Wherein, the first preset neural network is a convolutional neural network, and the first preset neural network may include multiple convolutional layers, and each convolutional layer may be followed by a corresponding pooling layer. Generally, the convolutional The convolutional layer and the pooling layer are alternately set to pool the output of the convolutional layer through the pooling layer. In this embodiment, the number of convolutional layers and pooling layers can be set according to actual needs, and is not specifically limited here.

需要说明的是，若第一预设神经网络包括多个卷积层，则将最后一个卷积层输出的结果作为目标手部的特征图。It should be noted that if the first preset neural network includes multiple convolutional layers, the result output by the last convolutional layer is used as the feature map of the target hand.

S2，将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；S2, input the feature map and the attitude estimation result of the last iteration to the decision-making layer of the first preset neural network, and output the attitude estimation result of the current iteration;

需要说明的是，通过上述方法步骤获得目标手部的特征图之后，由于特征图中各手指具有较高的自相似性，有鉴于此，本实施例中，采用姿态引导迭代的方式在特征图的基础上进一步进行特征提取，提取出更为优化的特征。It should be noted that after the feature map of the target hand is obtained through the above method steps, since each finger in the feature map has a high self-similarity, in view of this, in this embodiment, the gesture-guided iterative method is adopted in the feature map On the basis of further feature extraction, more optimized features are extracted.

具体地，本实施例中，在第一预设神经网络中自定义了一个决策层，该决策层设置在卷积层之后。该决策层具有两个输入接口，其中一个输入接口用于输入卷积层输出的特征图，另一输入接口用于输入上一次迭代的姿态估计结果。本实施例中，将第一预设神经网络的一次输出结果的运算作为一次迭代，其中上一次迭代的姿态估计结果为第一预设神经网络在上一次运算中输出的姿态估计结果。Specifically, in this embodiment, a decision-making layer is customized in the first preset neural network, and the decision-making layer is set after the convolution layer. The decision layer has two input interfaces, one of which is used to input the feature map output by the convolutional layer, and the other input interface is used to input the pose estimation result of the previous iteration. In this embodiment, an operation of an output result of the first preset neural network is regarded as an iteration, wherein the pose estimation result of the previous iteration is the pose estimation result output by the first preset neural network in the last computation.

在将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层之后，即可通过上一次迭代的姿态估计结果来引导决策层从特征图中提取出更为优化的特征。可以理解的是，上一次迭代的姿态估计结果中已经得到了目标手部的各关节点三维坐标的估计值，在此基础上，根据上一次迭代的姿态估计结果中各关节点三维坐标的估计值即可从特征图中有针对性地获取当前迭代中各关节点对应的特征，由此提取出的各关节点对应的特征更为准确，即可得到更为优化的特征。最终，决策层即可根据提取出的更为优化的特征进行决策，输出当前迭代的姿态估计结果。After the feature map and the pose estimation result of the last iteration are input to the decision-making layer of the first preset neural network, the pose estimation result of the last iteration can be used to guide the decision-making layer to extract more optimized features from the feature map . It can be understood that the estimated value of the three-dimensional coordinates of each joint point of the target hand has been obtained in the attitude estimation result of the last iteration. The corresponding features of each joint point in the current iteration can be obtained in a targeted manner from the feature map, and the extracted features corresponding to each joint point are more accurate, and more optimized features can be obtained. Finally, the decision-making layer can make decisions based on the extracted more optimized features, and output the pose estimation results of the current iteration.

需要说明的是，由于第一预设神经网络的每次迭代过程中，其中的决策层均需要同时输入一个特征图和一个姿态估计结果。对于当前迭代而言，输入至决策层的姿态估计结果为上一次迭代的姿态估计结果。由此可以看出，上述方法步骤默认是从第二次迭代开始的。因此，对于第一次迭代而言，还需要初始化一个姿态估计结果，以使得第一预设神经网络在第一次迭代过程中将该初始化的姿态估计结果输入至决策层。在实际应用中，该初始化的姿态估计结果可以通过现有的姿态估计方法进行获取，具体获取方式可以根据实际需求进行设置，此处不做具体限定。It should be noted that, during each iteration of the first preset neural network, the decision-making layer therein needs to simultaneously input a feature map and a pose estimation result. For the current iteration, the pose estimation result input to the decision-making layer is the pose estimation result of the previous iteration. It can be seen from this that the steps of the above method start from the second iteration by default. Therefore, for the first iteration, a pose estimation result needs to be initialized, so that the first preset neural network inputs the initialized pose estimation result to the decision-making layer during the first iteration. In practical applications, the initialized attitude estimation result can be obtained through an existing attitude estimation method, and the specific acquisition method can be set according to actual needs, which is not specifically limited here.

S3，若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。S3. If the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is less than a preset threshold, use the pose estimation result of the current iteration as the final pose estimation result of the target hand.

具体地，在上述迭代过程中，若当前迭代的姿态估计结果与上一次迭代的姿态估计结果之间的偏差小于预设阈值，即第一预设神经网络在当前迭代中输出的结果与上一次迭代中输出的结果之间的偏差小于预设阈值，则可确定当前迭代的姿态估计结果与上一次迭代的姿态估计结果基本一致，也即第一预设神经网络的迭代过程基本收敛，无需再执行后续迭代步骤。此时，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。其中，预设阈值可以根据实际需求进行设置，此处不做具体限定。Specifically, in the above iterative process, if the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is smaller than the preset threshold, that is, the result output by the first preset neural network in the current iteration is different from the previous If the deviation between the output results in the iteration is less than the preset threshold, it can be determined that the attitude estimation result of the current iteration is basically consistent with the attitude estimation result of the previous iteration, that is, the iterative process of the first preset neural network basically converges, and no further Execute subsequent iteration steps. At this point, the pose estimation result of the current iteration is taken as the final pose estimation result of the target hand. Wherein, the preset threshold may be set according to actual requirements, and is not specifically limited here.

本发明提供的一种人手姿态的估计方法，获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。该方法通过上一次迭代的姿态估计结果来引导当前迭代的特征提取，从而在卷积得到的特征图的基础上进一步进行特征提取，获得更为优化的特征，最终根据更为优化的特征进行姿态估计，获得当前迭代的姿态估计结果，由此可使得当前迭代的姿态估计结果相较于上一次迭代的估计结果更为精准，最终在迭代过程基本收敛时，将当前迭代的姿态估计结果作为手部的最终姿态估计结果，能够最大程度地提高姿态估计结果的准确性，解决了由于手指间的自相似性导致姿态估计结果不准确的问题。A method for estimating the posture of a human hand provided by the present invention is to obtain a depth image of the target hand, input the depth image to the convolution layer of the first preset neural network, and output the feature map of the target hand; combine the feature map with the previous The iterative attitude estimation result is input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output; if the deviation between the attitude estimation result of the current iteration and the attitude estimation result of the previous iteration is less than the preset threshold, the The pose estimation result of the current iteration is used as the final pose estimation result of the target hand. This method guides the feature extraction of the current iteration through the pose estimation result of the previous iteration, and further performs feature extraction on the basis of the feature map obtained by convolution to obtain more optimized features, and finally performs pose based on the more optimized features. Estimation, the attitude estimation result of the current iteration is obtained, which can make the attitude estimation result of the current iteration more accurate than the estimation result of the previous iteration, and finally when the iteration process basically converges, the attitude estimation result of the current iteration is used as the manual The final pose estimation result of the part can maximize the accuracy of the pose estimation result and solve the problem of inaccurate pose estimation results due to the self-similarity between fingers.

基于上述任一实施例，提供一种人手姿态的估计方法，输出当前迭代的姿态估计结果，之后还包括：若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差大于预设阈值，则将当前迭代的姿态估计结果作为上一次迭代的姿态估计结果输入至决策层。Based on any of the above-mentioned embodiments, a method for estimating the posture of a human hand is provided, which outputs the posture estimation result of the current iteration, and then further includes: if the deviation between the posture estimation result of the current iteration and the posture estimation result of the previous iteration is greater than a preset threshold , then the attitude estimation result of the current iteration is input to the decision-making layer as the attitude estimation result of the previous iteration.

具体地，在上述迭代过程中，若当前迭代的姿态估计结果与上一次迭代的姿态估计结果之间的偏差大于预设阈值，即第一预设神经网络在当前迭代中输出的结果与上一次迭代中输出的结果之间的偏差大于预设阈值，则可确定第一预设神经网络的迭代过程还没有收敛，此时则将当前迭代的姿态估计结果作为上一次迭代的姿态估计结果输入至决策层。即，在完成当前迭代之后，再将当前迭代的姿态估计结果和第一预设神经网络卷积层输出的特征图输入至第一预设神经网络的决策层，继续进行下一次迭代，以得到下一次迭代的姿态估计结果。其中，预设阈值可以根据实际需求进行设置，此处不做具体限定。Specifically, in the above iterative process, if the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is greater than the preset threshold, that is, the result output by the first preset neural network in the current iteration is different from the previous If the deviation between the output results in the iteration is greater than the preset threshold, it can be determined that the iterative process of the first preset neural network has not converged, and at this time, the attitude estimation result of the current iteration is input to the decision-making level. That is, after the current iteration is completed, the pose estimation result of the current iteration and the feature map output by the convolutional layer of the first preset neural network are input to the decision-making layer of the first preset neural network, and the next iteration is continued to obtain The pose estimation result of the next iteration. Wherein, the preset threshold may be set according to actual requirements, and is not specifically limited here.

本发明提供的一种人手姿态的估计方法，在输出当前迭代的姿态估计结果之后，若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差大于预设阈值，则将当前迭代的姿态估计结果作为上一次迭代的姿态估计结果输入至决策层。该方法能够确保第一预设神经网络不断进行迭代直至迭代收敛，从而将最后一次迭代的姿态估计结果作为最终的姿态估计结果，有利于确保姿态估计结果的准确性。In the method for estimating the posture of a human hand provided by the present invention, after outputting the posture estimation result of the current iteration, if the deviation between the posture estimation result of the current iteration and the posture estimation result of the previous iteration is greater than a preset threshold, the The attitude estimation result is input to the decision-making layer as the attitude estimation result of the previous iteration. The method can ensure that the first preset neural network continuously iterates until the iteration converges, so that the attitude estimation result of the last iteration is used as the final attitude estimation result, which is beneficial to ensure the accuracy of the attitude estimation result.

基于上述任一实施例，提供一种人手姿态的估计方法，决策层包括特征优化层和全连接层；相应地，将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果，具体为：将特征图和上一次迭代的姿态估计结果输入至特征优化层，利用特征优化层根据上一次迭代的姿态估计结果从特征图中提取每个关节点对应的区域特征；将所有关节点对应的区域特征输入至全连接层，输出当前迭代的姿态估计结果。Based on any of the above-mentioned embodiments, a method for estimating the pose of a human hand is provided, the decision-making layer includes a feature optimization layer and a fully connected layer; correspondingly, the feature map and the pose estimation result of the last iteration are input to the first preset neural network The decision-making layer outputs the pose estimation result of the current iteration, specifically: input the feature map and the pose estimation result of the previous iteration to the feature optimization layer, and use the feature optimization layer to extract each The regional features corresponding to the joint points; input the regional features corresponding to all relevant nodes to the fully connected layer, and output the pose estimation result of the current iteration.

具体地，本实施例中，第一预设神经网络的决策层进一步包括特征优化层和全连接层。在此基础上，将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层之后，特征图和上一次迭代的姿态估计结果将首先被输入至决策层中的特征优化层，特征优化层在获得特征图和上一次迭代的姿态估计结果之后，根据上一次迭代的姿态估计结果从特征图中提取目标手部的每个关节点对应的区域特征。可以理解的是，上一次迭代的姿态估计结果中已经得到了目标手部的各关节点三维坐标的估计值，在此基础上，特征优化层根据上一次迭代的姿态估计结果中各关节点三维坐标的估计值即可从特征图中有针对性地获取当前迭代中各关节点对应的特征，可以针对每个关节点提取一定区域的特征，作为每个关节点对应的区域特征。Specifically, in this embodiment, the decision-making layer of the first preset neural network further includes a feature optimization layer and a fully-connected layer. On this basis, after the feature map and the pose estimation result of the last iteration are input to the decision-making layer of the first preset neural network, the feature map and the pose estimation result of the last iteration will first be input to the feature optimization in the decision-making layer. After obtaining the feature map and the pose estimation result of the last iteration, the feature optimization layer extracts the region features corresponding to each joint point of the target hand from the feature map according to the pose estimation result of the last iteration. It can be understood that the estimated value of the three-dimensional coordinates of each joint point of the target hand has been obtained in the pose estimation result of the last iteration. On this basis, the feature optimization layer is based on the three-dimensional The estimated value of the coordinates can be targeted to obtain the features corresponding to each joint point in the current iteration from the feature map, and the features of a certain area can be extracted for each joint point as the area feature corresponding to each joint point.

通过上述方法步骤，即可获得目标手部的所有关节点对应的区域特征。之后，将所有关节点对应的区域特征输入至决策层中的全连接层，利用全连接层对所有关节点对应的区域特征进行拼接，最终根据拼接后的特征对当前迭代中目标手部的姿态进行估计，输出当前迭代的姿态估计结果。Through the above method steps, the regional features corresponding to all relevant nodes of the target hand can be obtained. Afterwards, the regional features corresponding to all related nodes are input to the fully connected layer in the decision-making layer, and the regional features corresponding to all related nodes are spliced by using the fully connected layer, and finally the pose of the target hand in the current iteration is calculated according to the spliced features. Estimate and output the pose estimation result of the current iteration.

本发明提供的一种人手姿态的估计方法，将特征图和上一次迭代的姿态估计结果输入至特征优化层，利用特征优化层根据上一次迭代的姿态估计结果从特征图中提取每个关节点对应的区域特征；将所有关节点对应的区域特征输入至全连接层，输出当前迭代的姿态估计结果。该方法通过上一次迭代的姿态估计结果来引导当前迭代中各关节点特征的提取，并最终将提取出的各关节点特征进行拼接以获得当前迭代的姿态估计结果，能够使得当前迭代的姿态估计结果相较于上一次迭代的姿态估计结果更为准确。A method for estimating the posture of a human hand provided by the present invention, the feature map and the pose estimation result of the last iteration are input to the feature optimization layer, and each joint point is extracted from the feature map according to the pose estimation result of the last iteration by using the feature optimization layer Corresponding regional features; input the regional features corresponding to all related nodes to the fully connected layer, and output the pose estimation result of the current iteration. This method guides the extraction of joint point features in the current iteration through the pose estimation results of the previous iteration, and finally stitches the extracted joint point features to obtain the pose estimation results of the current iteration, which can make the pose estimation of the current iteration The result is more accurate than the pose estimation result of the previous iteration.

基于上述任一实施例，提供一种人手姿态的估计方法，利用特征优化层根据上一次迭代的姿态估计结果从特征图中提取每个关节点对应的区域特征，具体为：对于上一次迭代的姿态估计结果中的任意一个关节点，利用特征优化层在特征图中获取该关节点对应的投影点，以投影点为中心提取预设大小的区域特征，获得该关节点对应的区域特征。Based on any of the above embodiments, a method for estimating the pose of a human hand is provided, using the feature optimization layer to extract the region features corresponding to each joint point from the feature map according to the pose estimation result of the last iteration, specifically: for the last iteration For any joint point in the pose estimation result, the feature optimization layer is used to obtain the projection point corresponding to the joint point in the feature map, and the area feature of the preset size is extracted with the projection point as the center, and the area feature corresponding to the joint point is obtained.

具体地，在将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的特征优化层之后，特征优化层即可获得上一次迭代的姿态估计结果中各关节点的三维坐标。在此基础上，对于上一次迭代的姿态估计结果中的任意一个关节点，根据该关节点的三维坐标在特征图中获取该关节点对应的投影点。即，可将上一次迭代的姿态估计结果投影至特征图中，以在特征图中获得上一次迭代的姿态估计结果中各关节点对应的投影点。Specifically, after inputting the feature map and the pose estimation result of the last iteration into the feature optimization layer of the first preset neural network, the feature optimization layer can obtain the three-dimensional coordinates of each joint point in the pose estimation result of the last iteration. On this basis, for any joint point in the pose estimation result of the last iteration, the projection point corresponding to the joint point is obtained in the feature map according to the three-dimensional coordinates of the joint point. That is, the pose estimation result of the last iteration can be projected into the feature map, so as to obtain the projection points corresponding to each joint point in the pose estimation result of the last iteration in the feature map.

通过上述方法步骤，即可获得上一次迭代的姿态估计结果中所有关节点对应的投影点。在此基础上，对于任意一个投影点，以该投影点为中心，从特征图中提取预设大小的区域特征，即可获得该投影点对应的区域特征，也即为该投影点所对应的关节点对应的区域特征。由此，即可获得所有关节点对应的区域特征，并作为当前迭代的所有关节点对应的区域特征。其中，预设大小可以设置为7×7，即以投影点为中心提取7×7大小的区域特征，在实际应用中，预设大小可以根据实际需求进行设置，此处不做具体限定。Through the above method steps, the projection points corresponding to all relevant nodes in the pose estimation result of the last iteration can be obtained. On this basis, for any projected point, with the projected point as the center, the region feature of the preset size is extracted from the feature map, and the region feature corresponding to the projected point can be obtained, that is, the region corresponding to the projected point The region features corresponding to the joint points. In this way, the regional features corresponding to all relevant nodes can be obtained and used as the regional features corresponding to all relevant nodes in the current iteration. Among them, the preset size can be set to 7×7, that is, the region feature of 7×7 size is extracted centering on the projection point. In practical applications, the preset size can be set according to actual needs, which is not specifically limited here.

本发明提供的一种人手姿态的估计方法，对于上一次迭代的姿态估计结果中的任意一个关节点，利用特征优化层在特征图中获取该关节点对应的投影点，以投影点为中心提取预设大小的区域特征，获得该关节点对应的区域特征。该方法通过上一次迭代的姿态估计结果来引导当前迭代中各关节点特征的提取，能够提取出更为优化的关节点特征，从而使得当前迭代的姿态估计结果相较于上一次迭代的姿态估计结果更为准确。In the method for estimating the posture of a human hand provided by the present invention, for any joint point in the posture estimation result of the last iteration, the feature optimization layer is used to obtain the projection point corresponding to the joint point in the feature map, and the projection point is extracted as the center. The area feature of the preset size is used to obtain the area feature corresponding to the joint point. This method guides the extraction of joint point features in the current iteration through the attitude estimation results of the previous iteration, and can extract more optimized joint point features, so that the attitude estimation results of the current iteration are better than those of the previous iteration. The result is more accurate.

基于上述任一实施例，提供一种人手姿态的估计方法，全连接层包括第一全连接层、第二全连接层和第三全连接层；相应地，将所有关节点对应的区域特征输入至全连接层，输出当前迭代的姿态估计结果，具体为：将所有关节点对应的区域特征输入至第一全连接层，利用第一全连接层将属于同一手指的关节点对应的区域特征进行串接，获得每个手指对应的局部特征；将所有手指对应的局部特征输入至第二全连接层，利用第二全连接层将所有手指对应的局部特征进行串接，获得目标手部对应的整体特征；将整体特征输入至第三全连接层，输出当前迭代的姿态估计结果。Based on any of the above-mentioned embodiments, a method for estimating the pose of a human hand is provided. The fully-connected layer includes a first fully-connected layer, a second fully-connected layer, and a third fully-connected layer; To the fully connected layer, output the pose estimation result of the current iteration, specifically: input the regional features corresponding to all related nodes to the first fully connected layer, and use the first fully connected layer to perform the regional features corresponding to the joint points belonging to the same finger Connect in series to obtain the local features corresponding to each finger; input the local features corresponding to all fingers to the second fully connected layer, and use the second fully connected layer to concatenate the local features corresponding to all fingers to obtain the corresponding The overall feature; the overall feature is input to the third fully connected layer, and the pose estimation result of the current iteration is output.

具体地，本实施例中，第一预设神经网络的全连接层包括第一全连接层、第二全连接层和第三全连接层。在此基础上，将所有关节点对应的区域特征输入至第一预设神经网络的全连接层之后，所有关节点对应的区域特征将首先被输入至第一全连接层，利用第一全连接层将属于同一手指的关节点对应的区域特征进行串接，获得每个手指对应的局部特征。即，在所有关节点对应的区域特征中，将同属于拇指的所有关节点对应的区域特征进行串接，获得拇指对应的局部特征；将同属于食指的所有关节点对应的区域特征进行串接，获得食指对应的局部特征；将同属于中指的所有关节点对应的区域特征进行串接，获得中指对应的局部特征；将同属于无名指的所有关节点对应的区域特征进行串接，获得无名指对应的局部特征；将同属于小指的所有关节点对应的区域特征进行串接，获得小指对应的局部特征。Specifically, in this embodiment, the fully connected layer of the first preset neural network includes a first fully connected layer, a second fully connected layer, and a third fully connected layer. On this basis, after inputting the regional features corresponding to all relevant nodes into the fully connected layer of the first preset neural network, the regional features corresponding to all relevant nodes will first be input into the first fully connected layer, using the first fully connected The layer concatenates the regional features corresponding to the joint points belonging to the same finger to obtain the local features corresponding to each finger. That is, among the regional features corresponding to all related nodes, the regional features corresponding to all related nodes belonging to the thumb are concatenated to obtain the local features corresponding to the thumb; the regional features corresponding to all related nodes belonging to the index finger are concatenated , to obtain the local features corresponding to the index finger; concatenate the regional features corresponding to all related nodes belonging to the middle finger to obtain the local features corresponding to the middle finger; concatenate the regional features corresponding to all related nodes belonging to the ring finger to obtain the corresponding local features; concatenate the regional features corresponding to all related nodes belonging to the little finger to obtain the local features corresponding to the little finger.

进一步地，再将所有手指对应的局部特征输入至第二全连接层，利用第二全连接层将所有手指对应的局部特征进行串接，获得目标手部对应的整体特征。最终，将目标手部对应的整体特征输入至第三全连接层，第三全连接层根据整体特征进行回归运算之后，即可输出当前迭代的姿态估计结果。Further, input the local features corresponding to all fingers to the second fully connected layer, and use the second fully connected layer to concatenate the local features corresponding to all fingers to obtain the overall features corresponding to the target hand. Finally, the overall feature corresponding to the target hand is input to the third fully connected layer, and the third fully connected layer can output the pose estimation result of the current iteration after performing a regression operation according to the overall feature.

本发明提供的一种人手姿态的估计方法，将所有关节点对应的区域特征输入至第一全连接层，利用第一全连接层将属于同一手指的关节点对应的区域特征进行串接，获得每个手指对应的局部特征；将所有手指对应的局部特征输入至第二全连接层，利用第二全连接层将所有手指对应的局部特征进行串接，获得目标手部对应的整体特征；将整体特征输入至第三全连接层，输出当前迭代的姿态估计结果。该方法在将所有关节点对应的区域特征进行串接时，先将属于同一手指的关节点对应的区域特征进行串接获得每个手指对应的局部特征，再将所有手指对应的局部特征进行串接获得手部的整体特征，使得最终获得的整体特征能够有效满足各关节点之间的约束关系，进而有利于提高姿态估计结果的准确性。The method for estimating the posture of a human hand provided by the present invention is to input the regional features corresponding to all related nodes into the first fully connected layer, and use the first fully connected layer to concatenate the regional features corresponding to the joint points belonging to the same finger to obtain The local features corresponding to each finger; input the local features corresponding to all fingers to the second fully connected layer, and use the second fully connected layer to concatenate the local features corresponding to all fingers to obtain the overall features corresponding to the target hand; The overall feature is input to the third fully connected layer, and the pose estimation result of the current iteration is output. When concatenating the regional features corresponding to all relevant nodes, this method first concatenates the regional features corresponding to the joint points belonging to the same finger to obtain the local features corresponding to each finger, and then concatenates the local features corresponding to all fingers. The overall features of the hand are obtained by receiving the overall features, so that the finally obtained overall features can effectively satisfy the constraint relationship between each joint point, which is conducive to improving the accuracy of the pose estimation results.

基于上述任一实施例，提供一种人手姿态的估计方法，将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，之前还包括获得初始姿态估计结果的步骤，具体为：将深度图像输入至第二预设神经网络，根据第二预设神经网络的输出结果，获得初始姿态估计结果。Based on any of the above embodiments, a method for estimating the pose of a human hand is provided, the feature map and the pose estimation result of the last iteration are input to the decision-making layer of the first preset neural network, and the step of obtaining the initial pose estimation result is also included before, Specifically, the depth image is input to the second preset neural network, and an initial pose estimation result is obtained according to an output result of the second preset neural network.

具体地，由于第一预设神经网络的每次迭代过程中，其中的决策层均需要同时输入一个特征图和一个姿态估计结果。对于当前迭代而言，输入至决策层的姿态估计结果为上一次迭代的姿态估计结果。由此可以看出，上述方法步骤默认是从第二次迭代开始的。因此，对于第一次迭代而言，还需要初始化一个姿态估计结果，以使得第一预设神经网络在第一次迭代过程中将该初始化的姿态估计结果输入至决策层。Specifically, during each iteration of the first preset neural network, the decision-making layer therein needs to simultaneously input a feature map and a pose estimation result. For the current iteration, the pose estimation result input to the decision-making layer is the pose estimation result of the previous iteration. It can be seen from this that the steps of the above method start from the second iteration by default. Therefore, for the first iteration, a pose estimation result needs to be initialized, so that the first preset neural network inputs the initialized pose estimation result to the decision-making layer during the first iteration.

本实施例中，在进行迭代之前，首先将目标手部的深度图像输入至第二预设神经网络，通过第二预设神经网络对目标手部的姿态进行初步估计，将第二预设神经网络的输出结果作为目标手部的初始姿态估计结果。其中第二预设神经网络可以为卷积神经网络，且第二预设神经网络是预先训练好的。在实际应用中，该初始化的姿态估计结果还可以通过其他姿态估计方法进行获取，具体获取方式可以根据实际需求进行设置，此处不做具体限定。In this embodiment, before performing iterations, the depth image of the target hand is first input to the second preset neural network, and the posture of the target hand is initially estimated through the second preset neural network, and the second preset neural network The output of the network is used as the initial pose estimation result of the target hand. Wherein the second preset neural network may be a convolutional neural network, and the second preset neural network is pre-trained. In practical applications, the initialized attitude estimation result can also be obtained by other attitude estimation methods, and the specific acquisition method can be set according to actual needs, which is not specifically limited here.

本发明提供的一种人手姿态的估计方法，将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层之前，将深度图像输入至第二预设神经网络，根据第二预设神经网络的输出结果，获得初始姿态估计结果。该方法通过第二预设神经网络获得目标手部的初始姿态估计结果，有利于第一预设神经网络根据初始姿态估计结果进行第一次迭代。The present invention provides a method for estimating the posture of a human hand. Before inputting the feature map and the posture estimation result of the last iteration into the decision-making layer of the first preset neural network, the depth image is input into the second preset neural network. According to the first 2. Preset the output result of the neural network to obtain the initial pose estimation result. The method obtains the initial pose estimation result of the target hand through the second preset neural network, which is beneficial for the first preset neural network to perform the first iteration according to the initial pose estimation result.

基于上述任一实施例，提供一种人手姿态的估计方法，将深度图像输入至第二预设神经网络，之前还包括：获取多个手部的深度图像样本，在每个深度图像样本中标记出预设数量的关节点，将标记后的每个深度图像样本作为训练样本；利用所有训练样本对第二预设神经网络进行训练。Based on any of the above-mentioned embodiments, a method for estimating the pose of a human hand is provided, and the depth image is input to the second preset neural network. Before that, it also includes: acquiring a plurality of depth image samples of hands, and marking each depth image sample A preset number of joint points is generated, and each marked depth image sample is used as a training sample; all training samples are used to train the second preset neural network.

具体地，在将深度图像输入至第二预设神经网络之前，还需先对第二预设神经网络进行训练。具体训练过程如下：Specifically, before inputting the depth image into the second preset neural network, the second preset neural network needs to be trained. The specific training process is as follows:

首先获取手部的深度图像样本，在每个深度图像样本中标记出预设数量的关节点，本实施例中，针对每个深度图像样本共标记出14个关节点，即在每个深度图像样本的所有手指部分和手掌部分共标记出14个关节点。在此基础上，将标记后的每个深度图像作为一个训练样本，最终将所有训练样本输入至第二预设神经网络，对第二预设神经网络进行训练。在训练的过程中，可以预先设置一个目标误差，当第二预设神经网络的输出的姿态估计结果与实际的姿态估计结果之间的误差小于目标误差时，则第二预设神经网络训练完成。First obtain the depth image samples of the hand, and mark a preset number of joint points in each depth image sample. In this embodiment, a total of 14 joint points are marked for each depth image sample, that is, in each depth image A total of 14 joint points are marked for all finger parts and palm parts of the sample. On this basis, each marked depth image is used as a training sample, and finally all the training samples are input into the second preset neural network to train the second preset neural network. In the process of training, a target error can be set in advance, when the error between the attitude estimation result of the output of the second preset neural network and the actual attitude estimation result is less than the target error, then the second preset neural network training is completed .

可以理解的是，若第二预设神经网络的训练过程所使用的训练样本中标记出了14个关节点，则将目标手部的深度图像输入至第二预设神经网络输出的初始姿态估计结果中也包含有14个关节点的三维坐标，相应地，由于第一预设神经网络的第一次迭代是根据初始姿态估计结果进行特征提取的，并最终获得第一次迭代的姿态估计结果，且后续的每一次迭代中，均是根据上一次迭代的姿态结果进行当前迭代的特征提取的，因此，第一预设神经网络的每次迭代输出的姿态估计结果中关节点的数量均与初始姿态估计结果中关节点的数量相同，即第一预设神经网络的每次迭代输出的姿态估计结果中关节点的数量也均为14个。It can be understood that if 14 joint points are marked in the training samples used in the training process of the second preset neural network, then the depth image of the target hand is input to the initial pose estimation output by the second preset neural network The result also contains the three-dimensional coordinates of 14 joint points. Correspondingly, since the first iteration of the first preset neural network is based on the initial pose estimation result, feature extraction is performed, and the pose estimation result of the first iteration is finally obtained. , and in each subsequent iteration, the feature extraction of the current iteration is carried out according to the attitude result of the previous iteration. Therefore, the number of joint points in the attitude estimation result output by each iteration of the first preset neural network is the same as The number of joint points in the initial pose estimation result is the same, that is, the number of joint points in the pose estimation result output by each iteration of the first preset neural network is also 14.

本发明提供的一种人手姿态的估计方法，将深度图像输入至第二预设神经网络之前，在每个深度图像样本中标记出预设数量的关节点，将标记后的每个深度图像样本作为训练样本；利用所有训练样本对第二预设神经网络进行训练。该方法通过对第二预设神经网络进行训练，有利于在第二预设神经网络训练完成后，利用第二预设神经网络获得目标手部的初始姿态估计结果，进而有利于第一预设神经网络根据初始姿态估计结果进行第一次迭代。In the method for estimating the posture of a human hand provided by the present invention, before inputting the depth image into the second preset neural network, a preset number of joint points are marked in each depth image sample, and each depth image sample after marking is As training samples; using all the training samples to train the second preset neural network. In this method, by training the second preset neural network, it is beneficial to use the second preset neural network to obtain the initial pose estimation result of the target hand after the second preset neural network is trained, which in turn is beneficial to the first preset neural network. The neural network performs the first iteration based on the initial pose estimation results.

为了便于理解第一预设神经网络的上述迭代步骤，现以下述示例进行具体说明：In order to facilitate the understanding of the above iterative steps of the first preset neural network, the following examples are used for specific description:

图2为本发明实施例的第一预设神经网络的迭代过程示意图，如图2所示，首先将目标手部的深度图像输入至第一预设神经网络，其中第一预设神经网络为CNN卷积神经网络，通过第一预设神经网络的卷积处理后获得目标手部对应的特征图。Fig. 2 is a schematic diagram of the iterative process of the first preset neural network according to the embodiment of the present invention. As shown in Fig. 2, first, the depth image of the target hand is input to the first preset neural network, wherein the first preset neural network is The CNN convolutional neural network obtains the feature map corresponding to the target hand after the convolution processing of the first preset neural network.

对于第t次迭代而言，将第t-1次迭代的姿态估计结果pose_t-1和特征图输入至第一预设神经网络的特征优化层(图中并未示出第一预设神经网络的各层结构)，在特征优化层中，将第t-1次迭代的姿态估计结果投影至特征图中，从而在特征图中获得第t-1次迭代的姿态估计结果中每个关节点对应的投影点，最终以每个投影点为中心，提取7×7大小的区域特征，由此即可获得第t次迭代中每个关节点对应的区域特征。如图2所示，第t次迭代中的各关节点对应的区域特征包括各手指指尖和指跟以及掌心点等所在处的14个关节点对应的区域特征。For the tth iteration, the pose estimation result pose _t-1 and the feature map of the t-1th iteration are input to the feature optimization layer of the first preset neural network (the first preset neural network is not shown in the figure Each layer structure of the network), in the feature optimization layer, the pose estimation result of the t-1th iteration is projected into the feature map, so that each joint in the pose estimation result of the t-1th iteration is obtained in the feature map Points corresponding to the projection points, and finally take each projection point as the center to extract the regional features of 7×7 size, so that the regional features corresponding to each joint point in the t-th iteration can be obtained. As shown in FIG. 2 , the regional features corresponding to each joint point in the t-th iteration include the regional features corresponding to the 14 joint points where the fingertips, heels, and palm points of each finger are located.

再将所有关节点对应的区域特征输入至第一预设神经网络的第一全连接层，在第一全连接层中将属于同一手指的区域特征进行串接，获得各手指对应的局部特征；再将所有手指对应的局部特征输入至第一预设神经网络的第二全连接层，在第二全连接层中将所有手指对应的局部特征进行串接获得目标手部的整体特征；最终，将目标手部的整体特征输入至第一预设神经网络的第三全连接层，在第三全连接层中根据整体特征进行回归计算获得第t次迭代的姿态估计结果pose_t。Then input the regional features corresponding to all related nodes to the first fully connected layer of the first preset neural network, and concatenate the regional features belonging to the same finger in the first fully connected layer to obtain the local features corresponding to each finger; Then input the local features corresponding to all fingers to the second fully connected layer of the first preset neural network, and concatenate the local features corresponding to all fingers in the second fully connected layer to obtain the overall features of the target hand; finally, Input the overall features of the target hand into the third fully connected layer of the first preset neural network, and perform regression calculation according to the overall features in the third fully connected layer to obtain the pose estimation result pose _t of the t-th iteration.

对于第t+1次迭代而言，再将第t次迭代的姿态估计结果pose_t和特征图输入至第一预设神经网络的特征优化层，以进行第t+1次迭代。For the t+1th iteration, the pose estimation result pose _t and the feature map of the tth iteration are input to the feature optimization layer of the first preset neural network to perform the t+1th iteration.

图3为本发明实施例的一种人手姿态估计系统的整体结构示意图，如图3所示，基于上述任一实施例，提供一种人手姿态的估计系统，包括：FIG. 3 is a schematic diagram of the overall structure of a human hand posture estimation system according to an embodiment of the present invention. As shown in FIG. 3 , based on any of the above-mentioned embodiments, a human hand posture estimation system is provided, including:

特征提取模块1，用于获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；The feature extraction module 1 is used to obtain the depth image of the target hand, input the depth image to the convolutional layer of the first preset neural network, and output the feature map of the target hand;

结果迭代模块2，用于将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；The result iteration module 2 is used to input the feature map and the attitude estimation result of the last iteration to the decision-making layer of the first preset neural network, and output the attitude estimation result of the current iteration;

结果确定模块3，用于若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。The result determination module 3 is configured to use the pose estimation result of the current iteration as the final pose estimation result of the target hand if the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is less than a preset threshold.

具体地，本发明提供一种人手姿态估计系统，包括特征提取模块1、结果迭代模块2和结果确定模块3，通过各模块之间的配合实现上述任一方法实施例中的方法步骤，具体实现过程可以参见上述方法实施例，此处不再赘述。Specifically, the present invention provides a human hand pose estimation system, which includes a feature extraction module 1, a result iteration module 2, and a result determination module 3, and realizes the method steps in any of the above method embodiments through cooperation between the modules, and specifically realizes For the process, reference may be made to the foregoing method embodiments, and details are not repeated here.

本发明提供的一种人手姿态的估计系统，获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。该系统通过上一次迭代的姿态估计结果来引导当前迭代的特征提取，从而在卷积得到的特征图的基础上进一步进行特征提取，获得更为优化的特征，最终根据更为优化的特征进行姿态估计，获得当前迭代的姿态估计结果，由此可使得当前迭代的姿态估计结果相较于上一次迭代的估计结果更为精准，最终在迭代过程基本收敛时，将当前迭代的姿态估计结果作为手部的最终姿态估计结果，能够最大程度地提高姿态估计结果的准确性，解决了由于手指间的自相似性导致姿态估计结果不准确的问题。A system for estimating the posture of a human hand provided by the present invention obtains a depth image of the target hand, inputs the depth image to the convolutional layer of the first preset neural network, and outputs a feature map of the target hand; combines the feature map with the previous The iterative attitude estimation result is input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output; if the deviation between the attitude estimation result of the current iteration and the attitude estimation result of the previous iteration is less than the preset threshold, the The pose estimation result of the current iteration is used as the final pose estimation result of the target hand. The system guides the feature extraction of the current iteration through the pose estimation result of the previous iteration, and further performs feature extraction on the basis of the feature map obtained by convolution to obtain more optimized features, and finally performs pose based on the more optimized features. Estimation, the attitude estimation result of the current iteration is obtained, which can make the attitude estimation result of the current iteration more accurate than the estimation result of the previous iteration, and finally when the iteration process basically converges, the attitude estimation result of the current iteration is used as the manual The final pose estimation result of the part can maximize the accuracy of the pose estimation result and solve the problem of inaccurate pose estimation results due to the self-similarity between fingers.

图4示出本发明实施例的一种电子设备的结构框图。参照图4，所述电子设备，包括：处理器(processor)41、存储器(memory)42和总线43；其中，所述处理器41和存储器42通过所述总线43完成相互间的通信；所述处理器41用于调用所述存储器42中的程序指令，以执行上述任一方法实施例所提供的方法，例如包括：获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。Fig. 4 shows a structural block diagram of an electronic device according to an embodiment of the present invention. Referring to Fig. 4, described electronic equipment comprises: processor (processor) 41, memory (memory) 42 and bus 43; Wherein, described processor 41 and memory 42 complete mutual communication through described bus 43; The processor 41 is used to call the program instructions in the memory 42 to execute the method provided by any of the above method embodiments, for example, including: acquiring the depth image of the target hand, and inputting the depth image to the first preset neural network The convolutional layer of the target hand outputs the feature map of the target hand; the feature map and the pose estimation result of the previous iteration are input to the decision-making layer of the first preset neural network, and the pose estimation result of the current iteration is output; if the pose estimation result of the current iteration If the deviation between the result and the pose estimation result of the previous iteration is less than the preset threshold, the pose estimation result of the current iteration is taken as the final pose estimation result of the target hand.

本实施例公开一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述任一方法实施例所提供的方法，例如包括：获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。This embodiment discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer, the computer The method provided by any of the above method embodiments can be executed, for example, including: acquiring a depth image of the target hand, inputting the depth image to the convolutional layer of the first preset neural network, and outputting a feature map of the target hand; The graph and the attitude estimation result of the last iteration are input to the decision-making layer of the first preset neural network, and the attitude estimation result of the current iteration is output; if the deviation between the attitude estimation result of the current iteration and the attitude estimation result of the last iteration is less than the preset threshold, the pose estimation result of the current iteration is taken as the final pose estimation result of the target hand.

本实施例提供一种非暂态计算机可读存储介质，所述非暂态计算机可读存储介质存储计算机指令，所述计算机指令使所述计算机执行上述任一方法实施例所提供的方法，例如包括：获取目标手部的深度图像，将深度图像输入至第一预设神经网络的卷积层，输出目标手部的特征图；将特征图和上一次迭代的姿态估计结果输入至第一预设神经网络的决策层，输出当前迭代的姿态估计结果；若当前迭代的姿态估计结果与上一次迭代的姿态估计结果间的偏差小于预设阈值，则将当前迭代的姿态估计结果作为目标手部的最终姿态估计结果。This embodiment provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the method provided in any of the above method embodiments, for example Including: obtaining the depth image of the target hand, inputting the depth image to the convolutional layer of the first preset neural network, and outputting the feature map of the target hand; inputting the feature map and the attitude estimation result of the last iteration to the first preset The decision-making layer of the neural network is set to output the pose estimation result of the current iteration; if the deviation between the pose estimation result of the current iteration and the pose estimation result of the previous iteration is less than the preset threshold, the pose estimation result of the current iteration is used as the target hand The final pose estimation result.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

以上所描述的电子设备等实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The above-described embodiments such as electronic equipment are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

最后，本申请的方法仅为较佳的实施方案，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。Finally, the method of the present application is only a preferred embodiment, and is not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. a kind of estimation method of manpower posture characterized by comprising

The depth image, is input to the convolutional layer of the first default neural network by the depth image for obtaining target hand, is exported The characteristic pattern of target hand；

The Attitude estimation result of the characteristic pattern and last iteration is input to the decision-making level of the first default neural network, output The Attitude estimation result of current iteration；

If the deviation between the Attitude estimation result of current iteration and the Attitude estimation result of last iteration is less than preset threshold, Using the Attitude estimation result of current iteration as the final carriage estimated result of target hand.

2. the method according to claim 1, wherein the Attitude estimation of the output current iteration is as a result, later Further include:

If the deviation between the Attitude estimation result of current iteration and the Attitude estimation result of last iteration is greater than preset threshold, The decision-making level is input to using the Attitude estimation result of current iteration as the Attitude estimation result of last iteration.

3. the method according to claim 1, wherein the decision-making level includes characteristic optimization layer and full articulamentum；

Correspondingly, the Attitude estimation result by the characteristic pattern and last iteration is input to the first default neural network Decision-making level, export the Attitude estimation of current iteration as a result, specifically:

The Attitude estimation result of the characteristic pattern and last iteration is input to the characteristic optimization layer, it is excellent using the feature Change layer and the corresponding provincial characteristics of each artis is extracted from the characteristic pattern according to the Attitude estimation result of last iteration；

The corresponding provincial characteristics of all artis is input to the full articulamentum, exports the Attitude estimation result of current iteration.

4. according to the method described in claim 3, it is characterized in that, described utilize the characteristic optimization layer according to last iteration Attitude estimation result the corresponding provincial characteristics of each artis is extracted from the characteristic pattern, specifically:

For any one artis in the Attitude estimation result of last iteration, using the characteristic optimization layer in the spy The corresponding subpoint of the artis is obtained in sign figure, and the provincial characteristics of default size is extracted centered on the subpoint, is obtained The corresponding provincial characteristics of the artis.

5. according to the method described in claim 3, it is characterized in that, the full articulamentum include the first full articulamentum, it is second complete Articulamentum and the full articulamentum of third；

Correspondingly, the corresponding provincial characteristics of all artis is input to the full articulamentum, the posture for exporting current iteration is estimated Meter as a result, specifically:

The corresponding provincial characteristics of all artis is input to the described first full articulamentum, will be belonged to using the described first full articulamentum It is concatenated in the corresponding provincial characteristics of the artis of same finger, obtains the corresponding local feature of each finger；

The corresponding local feature of all fingers is input to the described second full articulamentum, will be owned using the described second full articulamentum The corresponding local feature of finger is concatenated, and the corresponding global feature of the target hand is obtained；

The global feature is input to the full articulamentum of the third, exports the Attitude estimation result of current iteration.

6. the method according to claim 1, wherein the posture by the characteristic pattern and last iteration is estimated Meter result is input to the decision-making level of the first default neural network, further includes the steps that obtaining initial attitude estimated result, tool before Body are as follows:

The depth image is input to the second default neural network, according to the output of the described second default neural network as a result, Obtain initial attitude estimated result.

7. according to the method described in claim 6, it is characterized in that, described be input to the second default nerve for the depth image Network, before further include:

The depth image sample for obtaining multiple hands marks the joint of preset quantity in each depth image sample Point, using depth image sample described each of after label as training sample；

The described second default neural network is trained using all training samples.

8. a kind of estimating system of manpower posture characterized by comprising

The depth image is input to the first default nerve for obtaining the depth image of target hand by characteristic extracting module The convolutional layer of network exports the characteristic pattern of target hand；

As a result iteration module, for the Attitude estimation result of the characteristic pattern and last iteration to be input to the first default nerve The decision-making level of network exports the Attitude estimation result of current iteration；

As a result determining module, if for inclined between the Attitude estimation result of current iteration and the Attitude estimation result of last iteration Difference is less than preset threshold, then using the Attitude estimation result of current iteration as the final carriage estimated result of target hand.

9. a kind of electronic equipment characterized by comprising

At least one processor；And

At least one processor being connect with the processor communication, in which:

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 7 is any.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.