CN106485750A

CN106485750A - A kind of estimation method of human posture based on supervision Local Subspace

Info

Publication number: CN106485750A
Application number: CN201610819942.1A
Authority: CN
Inventors: 邱雨; 潘力立; 王正宁
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-09-13
Filing date: 2016-09-13
Publication date: 2017-03-08

Abstract

A kind of estimation method of human posture based on supervision Local Subspace of the disclosure of the invention, belongs to technical field of computer vision, is related to estimation method of human posture.The method sets up Local Linear Model from the training set of sparse and nonuniform sampling, solves versatility and robustness problem that conventional learning algorithm is subjected to well, and reducing in estimation procedure is affected on estimated result by sparse and non-homogeneous training sample.And certain improvement is carried out to basic algorithm during Algorithm for Training, while accuracy is ensured, operation efficiency has been greatly enhanced, therefore it can be better achieved the task of real-time body's Attitude estimation.

Description

A Human Pose Estimation Method Based on Supervised Local Subspace

技术领域technical field

本发明属于计算机视觉技术领域，涉及人体姿态估计方法。The invention belongs to the technical field of computer vision and relates to a human body pose estimation method.

背景技术Background technique

随着当今社会人机交互技术的快速发展，人与机器之间自然的、多模态的交互成为人与机器之间交互的主要方式。人类首先遇到的问题就是需要机器能够正确地认识和理解人的行为，正是在这种背景情况下，姿态估计被人们提出来了。它是目前人机交互的重要技术之一，能够应用在人体运动分析、虚拟现实、智能监控、制作游戏等领域，人体姿态估计潜在的巨大应用价值，引起了学术界、工业界的广泛关注。现有的人体姿态估计工作可以分为无模型、基于模型的两类方法。With the rapid development of human-computer interaction technology in today's society, the natural and multi-modal interaction between humans and machines has become the main way of interaction between humans and machines. The first problem that humans encounter is that machines need to be able to correctly recognize and understand human behavior. It is in this context that attitude estimation has been proposed. It is one of the important technologies of human-computer interaction at present. It can be applied in fields such as human motion analysis, virtual reality, intelligent monitoring, and game production. The potential huge application value of human body pose estimation has attracted extensive attention from academia and industry. Existing human pose estimation work can be divided into two categories: model-free and model-based methods.

无模型的人体姿态估计方法又可以划分为基于学习的方法和基于样本方法。Model-free human pose estimation methods can be divided into learning-based methods and sample-based methods.

(1)基于学习的方法：使用训练样本学习到一个从图像特征空间到人体姿态空间的映射，若从新的观测图像中提取图像特征输入到从训练样本中学习得到的映射之中，即可估计出对应的人体姿态。例如文献Ankur Agarwal,Bill Triggs.3D human pose fromsilhouettes by relevance vector regression,Computer Vision and PatternRecognition,vol.2,no.2,pp.882-888,2004中作者使用人体轮廓的形状上下文作为特征，采用相关向量机作为回归器，用稀疏贝叶斯非线性回归方法学习得到一个紧凑的映射，并将特征空间映射到姿态空间，对输入特征直接输出其相应的人体姿态相关的数值。在参考文献Romer Rosales,Assilis Athitsos,Leonid Sigal,et al.3D Hand PoseReconstruction Using Specialized Mappings,ICCV,vol.1,pp.378-385,2001中则是将输入空间分成许多简单的小区域，这里的每个小区域都有相对应的映射函数，并使用了一种反馈匹配机制对姿态进行重构，由于训练数据范围较小，映射函数有较好的拟合效果，所以这种方法能够很大程度上提高估计准确度。虽然基于学习的方法执行速度快，不需要专门初始化，具有较小的存储代价，且无需保存样本数据库，但是基于学习方法的估计结果往往受训练样本规模的影响较大。(1) Learning-based method: use the training samples to learn a mapping from the image feature space to the human body pose space. If the image features are extracted from the new observation image and input into the mapping learned from the training samples, it can be estimated The corresponding human posture. For example, in the document Ankur Agarwal, Bill Triggs. 3D human pose from silhouettes by relevance vector regression, Computer Vision and Pattern Recognition, vol.2, no.2, pp.882-888, 2004, the author uses the shape context of the human body contour as a feature, and adopts the correlation As a regressor, the vector machine uses the sparse Bayesian nonlinear regression method to learn a compact mapping, maps the feature space to the pose space, and directly outputs the corresponding human pose-related values for the input features. In the reference Romer Rosales, Assilis Athitsos, Leonid Sigal, et al.3D Hand PoseReconstruction Using Specialized Mappings, ICCV, vol.1, pp.378-385, 2001, the input space is divided into many simple small areas, here Each small area has a corresponding mapping function, and a feedback matching mechanism is used to reconstruct the pose. Since the training data range is small, the mapping function has a better fitting effect, so this method can greatly improve the estimation accuracy to a certain extent. Although the learning-based method is fast, does not require special initialization, has a small storage cost, and does not need to save the sample database, the estimation results of the learning-based method are often greatly affected by the size of the training sample.

(2)基于样本的方法：首先需要建立模板库，这个模板库中存储了大量特征以及人体姿态的训练样本。当输入估计测试图像时，提取相应特征再用某种度量与模板库里的样本进行比较，即找到和待估计图像相似的训练样本，最后使用最近邻算法估计测试图像的人体姿态。人体姿态十分复杂，不同的姿态所投影得到的图像特征描述符可能非常相似，即特征描述符与姿态空间之间是一对多的关系。例如，在文献Nicholas R.Howe.SilhouetteLook up for Automatic Pose Tracking,Computer Vision and Pattern RecognitionWorkshop,pp.15-22,2004中作者从模板数据库中检索出多个接近的候选样本，再用时域相似性约束得到这些样本中的最佳匹配。基于样本的方法必须要有足够的样本覆盖人体所有可能的姿态，但是因为人体姿态过于复杂，有限的样本难以覆盖整个人体姿态空间，因此基于样本的方法只适用于特定姿态的估计。(2) Sample-based method: First, a template library needs to be established, which stores a large number of features and training samples of human body poses. When the estimated test image is input, the corresponding features are extracted and compared with the samples in the template library by a certain metric, that is, training samples similar to the image to be estimated are found, and finally the human body pose of the test image is estimated using the nearest neighbor algorithm. Human body poses are very complex, and the image feature descriptors projected by different poses may be very similar, that is, there is a one-to-many relationship between feature descriptors and pose space. For example, in the literature Nicholas R.Howe.SilhouetteLook up for Automatic Pose Tracking, Computer Vision and Pattern RecognitionWorkshop, pp.15-22, 2004, the author retrieved multiple close candidate samples from the template database, and then used the temporal similarity constraint Get the best match among these samples. The sample-based method must have enough samples to cover all possible poses of the human body, but because the human pose is too complex, limited samples are difficult to cover the entire human pose space, so the sample-based method is only suitable for the estimation of specific poses.

基于模型的方法将人体划分成一些相互联系的部件，用图模型表示人体架构，并使用图推理方法优化人体姿态，即在进行人体姿态估计的过程中使用先验的人体模型，并且模型的参数也随着当前状态的变化而更新。基于模型的人体姿态估计中主要由图模型、优化算法、部件的观测模型三部分组成。图模型用来表示部件连接之间的约束关系，其中树模型是最常用的模型，树模型是根据部件之间的连接情况来定义的，所以相对直观。观测模型对人体部件的表观建立模型，它是用来度量人体部件的图像相似度，从而确定人体部件的具体位置。The model-based method divides the human body into some interrelated parts, uses the graph model to represent the human body structure, and uses the graph reasoning method to optimize the human body posture, that is, the prior human body model is used in the process of human body pose estimation, and the parameters of the model Also updated as the current state changes. Model-based human pose estimation mainly consists of three parts: graph model, optimization algorithm, and component observation model. The graph model is used to represent the constraint relationship between component connections, and the tree model is the most commonly used model. The tree model is defined according to the connection between components, so it is relatively intuitive. The observation model builds a model for the appearance of human body parts, which is used to measure the image similarity of human body parts, so as to determine the specific position of human body parts.

优化算法是利用建立好的图模型和部件观测模型来估计得到人体姿态。其中置信度传播是较为常用的算法，但是由于在人体姿态估计中，人体部件的状态向量维数相对较高，直接使用置信度传播算法不现实。在文献Deva Ramanan.Learning to parse imagesof articulated bodies，Neural Information Processing Systems,pp.1129-1136,2006中作者提出了和积算法，它继承了消息传递机制，但通过引入因子图将全局的概率密度函数分解成若干个局部概率密度函数的乘积，将算法使用范围拓展到了无向图(比如条件随机场)上，但和积算法仍然有一个限制，它只有在无环的因子图上才可以保证算法收敛。The optimization algorithm is to use the established graph model and component observation model to estimate the human body pose. Among them, belief propagation is a commonly used algorithm, but since the state vector dimension of human body parts is relatively high in human body pose estimation, it is unrealistic to directly use the belief propagation algorithm. In the literature Deva Ramanan.Learning to parse images of articulated bodies, Neural Information Processing Systems, pp.1129-1136, 2006, the author proposed the sum-product algorithm, which inherits the message passing mechanism, but introduces the global probability density function by introducing a factor graph It is decomposed into the product of several local probability density functions, which extends the scope of the algorithm to undirected graphs (such as conditional random fields), but the sum product algorithm still has a limitation, it can only guarantee the algorithm on the acyclic factor graph. convergence.

基于模型的人体姿态估计方法具有较强的通用性，同时也减少了训练样本的存储代价。对于人体模型，人们可以方便地利用先验知识来解决自遮挡以及其他遮挡问题，估计准确度相对较高，适用于人体姿态分析等领域的应用，但是缺点也比较明显：(1)优化速度相对较慢，一般不能满足实时性要；2)初始化的好坏对姿态优化的结果影响很大。The model-based human pose estimation method has strong versatility, and also reduces the storage cost of training samples. For human body models, people can easily use prior knowledge to solve self-occlusion and other occlusion problems, and the estimation accuracy is relatively high, which is suitable for applications in fields such as human body posture analysis, but the disadvantages are also obvious: (1) The optimization speed is relatively low. It is slow, and generally cannot meet the real-time requirements; 2) The quality of initialization has a great influence on the result of attitude optimization.

发明内容Contents of the invention

本发明的任务是提供了一种基于监督局部子空间的人体姿态估计方法，它属于无模型中基于学习的估计方法，该方法从稀疏和非均匀采样的训练集中建立局部线性模型，很好地解决了以往学习算法遭受的通用性和鲁棒性问题，减少了估计过程中受稀疏和非均匀训练样本对估计结果的影响。并且在算法训练过程中对基础算法进行了一定的改进，在保证精确度的同时，很大程度上提高了运算效率，因此它能更好地实现实时人体姿态估计的任务。The task of the present invention is to provide a human body pose estimation method based on supervised local subspace, which belongs to the estimation method based on learning in the model-free method. It solves the generality and robustness problems suffered by previous learning algorithms, and reduces the influence of sparse and non-uniform training samples on the estimation results during the estimation process. And in the algorithm training process, the basic algorithm has been improved to a certain extent. While ensuring the accuracy, the calculation efficiency has been greatly improved, so it can better realize the task of real-time human pose estimation.

为了方便地描述本发明内容，首先对一些术语进行定义。In order to describe the content of the present invention conveniently, some terms are defined first.

定义1：人机交互。人机交互是一门研究系统与用户之间的交互关系的学科。系统可以是各种各样的机器，也可以是计算机的系统和软件。人机交互界面通常是指用户可见的部分。用户通过人机交互界面与系统交流，并进行操作。Definition 1: Human-computer interaction. Human-computer interaction is a discipline that studies the interactive relationship between systems and users. A system can be a variety of machines, or it can be a computer system and software. The human-computer interface usually refers to the part visible to the user. The user communicates with the system through the human-computer interaction interface and performs operations.

定义2：人体姿态。人体的姿态分二维和三维两种情况。二维人体姿态是指人体各关节在图像二维平面分布的一种描述，通常用线段或者矩形来描述人体各关节在图像二维平面的投影，线段的长度和角度分布或者矩形的大小和方向就代表了人体二维姿态，二维姿态不存在二义性问题；三维人体姿态是指人体目标在真实三维空间中的位置和角度信息，通常用关节树模型来表述估计的姿态，也有一些研究者采用更加复杂的模型，三维姿态的获取通常是通过模型反投影的方法。Definition 2: Human body pose. The posture of the human body is divided into two-dimensional and three-dimensional situations. Two-dimensional human body posture refers to a description of the distribution of human joints on the two-dimensional image plane. Lines or rectangles are usually used to describe the projection of human joints on the two-dimensional image plane, the length and angle distribution of line segments or the size and direction of rectangles. It represents the two-dimensional posture of the human body, and there is no ambiguity in the two-dimensional posture; the three-dimensional human posture refers to the position and angle information of the human target in the real three-dimensional space, and the joint tree model is usually used to express the estimated posture. There are also some studies The latter adopts a more complex model, and the three-dimensional pose is usually obtained through the method of model back projection.

定义3：过拟合。在机器学习中通过训练样本进行分类或回归模型训练时，模型得到的输出值和真实目标值基本一致，但在测试样本集上模型得到的输出值和目标值相差却很大，这类为了得到一致假设而使假设变得过度复杂的现象称为过拟合。Definition 3: Overfitting. In machine learning, when the classification or regression model is trained through training samples, the output value obtained by the model is basically consistent with the real target value, but the output value obtained by the model on the test sample set is quite different from the target value. The phenomenon in which assumptions become overly complex due to consistent assumptions is called overfitting.

定义4：前景。前景是指图像中靠近镜头位置的人物或景物。Definition 4: Foreground. The foreground refers to the person or scene in the image that is close to the camera.

定义5：背景。背景是指图像中主体背后的景物，能表现任务和事物所处的时空环境。Definition 5: Background. The background refers to the scenery behind the subject in the image, which can express the space-time environment in which tasks and things are located.

定义6：形状上下文特征。形状上下文在2002年提出的，最初用于检测物体形状之间匹配点。形状上下文描述子可充分地利用像素点周围的上下文信息，对图像内部区域的形状特征进行很好地描绘，在形状匹配问题中，具有非常好的鲁棒性。该描述子的基本原理是：对于给定的一幅图像，首先用边缘特征提取算法(如Canny边缘检测器)检测出它的边缘信息；然后对边缘像素点进行采样，提取出一系列特征点(这些特征点可以是内部边缘上的点，也可以是外部边缘上的点，并且不需要是边缘曲线上曲率最大处的点，可以通过均匀采样获得。)；针对每一个特征点，以其为原点建立起极坐标系，根据角度的变化将其周围的空间划分成一系列扇形区域，同时根据半径的大小将周围空间划分成一系列同心圆，统计分布每一个区域的特征点的个数；(见图1中(c))最后根据统计得到的数据建立相应的向量，即为相应像素点形状上下文特征向量。Definition 6: Shape context features. Shape context was proposed in 2002 and was originally used to detect matching points between object shapes. The shape context descriptor can make full use of the context information around the pixels to describe the shape features of the inner area of the image well, and it has very good robustness in the shape matching problem. The basic principle of the descriptor is: for a given image, first use an edge feature extraction algorithm (such as Canny edge detector) to detect its edge information; then sample the edge pixels to extract a series of feature points (These feature points can be points on the inner edge or points on the outer edge, and do not need to be points on the edge curve with the greatest curvature, and can be obtained by uniform sampling.); for each feature point, its Establish a polar coordinate system for the origin, divide the surrounding space into a series of fan-shaped areas according to the change of the angle, and divide the surrounding space into a series of concentric circles according to the size of the radius, and count the number of feature points distributed in each area; ( See (c) in Fig. 1) Finally, a corresponding vector is established according to the statistically obtained data, which is the corresponding pixel shape context feature vector.

定义7：K均值算法。K均值算法是一种基于距离的迭代式算法，它将m个观测样本分类到k个聚类中，以使得每个观察样本距离它所在的聚类中心点比距离其它聚类中心点的距离更小。具体过程为：Definition 7: K-means algorithm. The K-means algorithm is an iterative algorithm based on distance, which classifies m observation samples into k clusters, so that each observation sample is farther away from the center point of the cluster it is in than the distance from other cluster center points smaller. The specific process is:

1)假设样本空间为其中m表示样本的个数，n表示每一个训练样本的维数。然后从训练样本空间中随机选取k个聚类中心点，分别为 1) Suppose the sample space is Among them, m represents the number of samples, and n represents the dimensionality of each training sample. Then randomly select k cluster center points from the training sample space, respectively

2)重复下面的过程直至收敛：2) Repeat the following process until convergence:

对于每一个样本i∈[1,m]，计算它应该属于哪一个聚类： For each sample i∈[1,m], calculate which cluster it should belong to:

对于每一个聚类j∈[1,k]，重新计算该聚类的中心： For each cluster j∈[1,k], recalculate the center of the cluster:

其中，1{c_i＝j}为示性函数，当c_i＝j条件满足时函数值等于1，否则为0。经过若干次迭代后，算法达到收敛即聚类中心不再变化或者变化很小了，测试可以得到我们想要的k个聚类中心点以及每个样本所属的聚类。Wherein, 1{c _i =j} is an indicative function, and the value of the function is equal to 1 when the condition of c _i =j is satisfied, otherwise it is 0. After several iterations, the algorithm reaches convergence, that is, the cluster center does not change or changes very little. The test can get the k cluster center points we want and the cluster to which each sample belongs.

定义8：BVH格式。BVH文件包含角色的骨骼和肢体关节旋转数据。BVH是一种通用的人体特征动画文件格式，广泛地被当今流行的各种动画制作软件支持。通常可从记录人类行为运动的运动捕获硬件获得。Definition 8: BVH format. BVH files contain the character's bone and limb joint rotation data. BVH is a common human feature animation file format, which is widely supported by various animation production software popular today. Typically available from motion capture hardware that records the movement of human actions.

按照本发明的一种基于监督局部子空间的人体姿态估计方法，它包含以下步骤：According to a kind of human body posture estimation method based on supervision local subspace of the present invention, it comprises the following steps:

步骤1：对需要进行人体姿态估计的原始图像，去除背景并得到人体轮廓信息，达到突出前景的作用；Step 1: For the original image that requires human pose estimation, remove the background and obtain the human body contour information to achieve the effect of highlighting the foreground;

步骤2：对步骤1获得图像进行二值化，再对上述人体轮廓图片提取形状上下文特征，其中提取形状上下文特征的算法相关参数分别是采样点个数为200，圆形极坐标均分为12个扇形区域，半径分为5份；因此对于每一个训练样本，它所对应的形状上下文特征为一个60*200维矩阵即200个60维的形状上下文向量；Step 2: Binarize the image obtained in step 1, and then extract shape context features from the above-mentioned human outline image. The relevant parameters of the algorithm for extracting shape context features are the number of sampling points is 200, and the circular polar coordinates are divided into 12 A fan-shaped area, the radius is divided into 5 parts; therefore, for each training sample, its corresponding shape context feature is a 60*200-dimensional matrix, that is, 200 60-dimensional shape context vectors;

步骤3：采用降维操作将每张图片的形状上下文特征降到100维获得图像特征X；Step 3: Use the dimensionality reduction operation to reduce the shape context feature of each picture to 100 dimensions to obtain the image feature X;

步骤4：将步骤3获得的图像特征X通过姿态角度Θ进行局部子空间重构，具体公式为Step 4: The image feature X obtained in step 3 is reconstructed in the local subspace through the attitude angle Θ, the specific formula is

其中，f是人体姿态空间到图像特征空间的映射函数，是指第i个局部子空间对应的参数集合，是子空间的中心，是切线空间的主要成分，为第i个子空间中心所对应的人体姿态角度，m为子空间的数量，d为样本的输入特征维数；Among them, f is the mapping function from the human body pose space to the image feature space, is the set of parameters corresponding to the i-th local subspace, is the center of the subspace, is the principal component of the tangent space, is the human body posture angle corresponding to the center of the i-th subspace, m is the number of subspaces, and d is the input feature dimension of the sample;

步骤5：将步骤4中的近似函数f(Θ)进行一阶泰勒展开；由某一局部子空间对每个训练样本进行重构，如果姿态角度为θ_p的训练样本x_p在参数决定的子空间中，就将x_p近似成:Step 5: First-order Taylor expansion of the approximate function f(Θ) in step 4; each training sample is reconstructed from a certain local subspace, if the training sample x _p with the attitude angle θ _p is in the parameter In the determined subspace, x _p is approximated as:

x_p≈c_i+G_i△θ_pi x _p ≈ c _i +G _i △θ _pi

其中，同时定义且N(θ_p)表示邻近θ_p的子空间索引序号；in, define at the same time And N(θ _p ) represents the subspace index number adjacent to θ _p ;

步骤6:根据步骤5确定出该算法的误差函数为Step 6: Determine the error function of the algorithm according to step 5 as

其中第一项为每个训练样本(x_p,θ_p)由近邻子空间重构所造成的重构误差的加权和，近邻子空间的选取依据是子空间中心所对应的姿态角度与θ_p欧氏距离的大小关系；第二项进行正则化，通过近邻子空间将每个子空间的均值进行重构；这个步骤确保了子空间参数的平稳变化，并且能够从稀疏的非均匀的数据中估计出来；其中λ＝(n/m)²是一个正则化参数，它等于训练样本个数n除以子空间个数m的平方，w_pi定义了每个近邻子空间对数据样本重构的权重，具体公式为：The first item is the weighted sum of the reconstruction errors caused by the reconstruction of the neighbor subspace for each training sample (x _p , θ _p ), and the selection of the neighbor subspace is based on the attitude angle corresponding to the center of the subspace The size relationship with θ _p Euclidean distance; the second term is regularized, and the mean value of each subspace is reconstructed through the nearest neighbor subspace; this step ensures the smooth change of subspace parameters, and can be used from sparse non-uniform Estimated from the data; where λ=(n/m) ² is a regularization parameter, which is equal to the number of training samples n divided by the square of the number of subspaces m, w _pi defines the weight of each adjacent subspace on the data sample The weight of the structure, the specific formula is:

其中，是测量角度θ_p和之间相似性的正值函数，函数表达式为：in, are the measurement angle θ _p and The positive value function of the similarity between , the function expression is:

步骤7：令表示子空间的中心，表示子空间的基，其中d是输入的维数，在本专利中d为100，m为子空间的个数，之后采用闭合解算法优化计算得到C和G；Step 7: Order represents the center of the subspace, Represents the basis of the subspace, where d is the dimension of the input, in this patent, d is 100, m is the number of subspaces, and then the closed solution algorithm is used to optimize the calculation to obtain C and G;

步骤8：对于一个新的测试样本点的图像特征x_t，为了提高效率；本算法采用如下两个步骤从子空间中确定该测试样本的近邻子空间：Step 8: For the image feature x _t of a new test sample point, in order to improve efficiency; this algorithm uses the following two steps to determine the nearest neighbor subspace of the test sample from the subspace:

(1)首先找出2|Γ_t|个候选子空间；|Γ_t|为我们设定的近邻子空间个数，一般为2～16。这些子空间的中心c_i在输入空间中最接近x_t，然后根据公式算出θ_ti，其中为选取出的某近邻子空间的基，为选取出的某近邻子空间的中心；θ_ti表示对于一个新的测试数据点x_t它根据子空间i重构所得到的姿态角度θ_ti；(1) First find 2 |Γ _t | candidate subspaces; |Γ _t | is the number of adjacent subspaces we set, generally 2-16. The centers _ci of these subspaces are closest to x _t in the input space, then according to the formula Calculate θ _ti , where is the basis of a selected neighbor subspace, is the center of a selected neighboring subspace; θ _ti represents the attitude angle θ _ti obtained by reconstructing a new test data point x _t according to subspace i;

(2)比较重构误差的大小，从2|Γ_t|个候选中选择出|Γ_t|个近邻子空间，并将最小重构误差所对应的θ_ti，记作测试数据点x_t对应的姿态角度θ_t0，其中重构误差的公式为：(2) Compare the magnitude of the reconstruction error, select |Γ _t | neighboring subspaces from 2|Γ _t | candidates, and record θ _ti corresponding to the minimum reconstruction error as the test data point x _t corresponding to The attitude angle θ _t0 of , where the formula of the reconstruction error is:

最后，最佳的θ_t通过最小化公式得到，其中权重w_ti由上述θ_t0算出，具体公式为最后可以得到测试样本点x_t对应的最佳姿态角度θ_t：Finally, the optimal θ _t is minimized by the formula , where the weight w _ti is calculated by the above θ _t0 , the specific formula is Finally, the best attitude angle θ _t corresponding to the test sample point x _t can be obtained:

步骤9：将得到的θ_t解析成BVH格式的文本并表示成相应人体姿态图像。Step 9: Parse the obtained θ _t into text in BVH format and represent it as a corresponding human body pose image.

进一步的，所述步骤3的具体步骤为：Further, the specific steps of the step 3 are:

首先将所有训练样本中的形状上下文特征矩阵从左到由合并在一起，然后采用K均值算法得到100个60维的向量，我们称这些向量为数据空间的聚类中心。最后每个样本的200个形状上下文向量带有高斯权重地向这100个聚类中心进行投票，具体的投票方式为形状上下文与某个聚类中心的欧氏距离越近就越趋于1，越远就趋于0。最后可以得到后面步骤所需要的图像特征X，其中第i列对应第i个训练样本的图像特征且每个样本的特征为100维。First, the shape context feature matrices in all training samples are merged from left to right, and then the K-means algorithm is used to obtain 100 60-dimensional vectors. We call these vectors the cluster centers of the data space. Finally, the 200 shape context vectors of each sample vote for the 100 cluster centers with Gaussian weights. The specific voting method is that the closer the Euclidean distance between the shape context and a certain cluster center is, the closer it is to 1. The farther away it tends to 0. Finally, the image feature X required in the following steps can be obtained, where the i-th column corresponds to the image feature of the i-th training sample and the feature of each sample is 100 dimensions.

进一步的，所述步骤7的具体步骤为：Further, the specific steps of the step 7 are:

具体做法为：用U＝[C,G]代替C和G，则可以将步骤6中的误差函数改写为：The specific method is: replace C and G with U=[C,G], then the error function in step 6 can be rewritten as:

其中e∈1^1×d，W_p＝diag{w_pi,i∈Γ_p}，W_j＝diag{w_ji,i∈Γ_j}，另外，是一个0-1选择矩阵，该矩阵对角线上非零元素的序号为训练样本x_p对应的近邻子空间的序列，其余元素均为0，例如训练样本x_p对应的近邻子空间的序号分别为2，4，5，6，则在中有且仅有元素S₂₂,S₄₄,S₅₅,S₆₆为1，其他元素均为0；也是一个0-1选择矩阵，该矩阵对角线上非零元素的序号为每个子空间中心c_j对应的近邻子空间的序列；s为一个m×m的单位矩阵，s_j为一个m维列向量，I是一个m×m的单位矩阵，E(U)是关于U的一个凸二次函数。最后得到U的解的表达式为：U＝A(B^(x)+B^(c))^-1 where e∈1 ^1×d , W _p = diag{w _pi , i∈Γ _p }, W _j ＝diag{w _ji ,i∈Γ _j }, in addition, is a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the training sample x _p , and the rest of the elements are all 0, for example, the number of the adjacent subspace corresponding to the training sample x _p are 2, 4, 5, 6 respectively, then in There are and only elements S ₂₂ , S ₄₄ , S ₅₅ , and S ₆₆ are 1, and all other elements are 0; It is also a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the center c _j of each subspace; s is an m×m identity matrix, and s _j is an m-dimensional Column vector, I is an m×m identity matrix, E(U) is a convex quadratic function about U. Finally, the expression for the solution of U is: U=A(B ^(x) +B ^(c) ) ^-1

其中，A,B^(x),B^(c),C,G分别为：Among them, A, B ^(x) , B ^(c) , C, G are respectively:

C＝U[1:d,1:m]C=U[1:d,1:m]

G＝U[1:d,1+m:56×m]G=U[1:d,1+m:56×m]

通过数次迭代优化后，C和G能迅速收敛，C和G此时得到和的近似最优解。After several iterations of optimization, C and G can quickly converge, and C and G can now obtain an approximate optimal solution of the sum.

本发明的创新之处在于：The innovation of the present invention is:

本专利提出了在人体姿态估计问题中引入基于监督局部子空间的估计方法，它属于无模型中基于学习的估计方法，该算法从稀疏和非均匀采样的训练集中建立局部线性模型，很好地解决了以往学习算法遭受的通用性和鲁棒性问题，减少了估计过程中受稀疏和非均匀训练样本对估计结果的影响。同时提出了一种闭合解算法优化局部子空间参数，比之前的交替优化算法更快地达到收敛，在保证精确度的同时，很大程度上提高了运算效率，因此它能更好地实现实时人体姿态估计的任务。此外，由于该算法是一个生成的模型，它也能够很好的处理图像噪声。This patent proposes to introduce an estimation method based on supervised local subspace in the problem of human pose estimation, which belongs to the estimation method based on learning in model-free. It solves the generality and robustness problems suffered by previous learning algorithms, and reduces the influence of sparse and non-uniform training samples on the estimation results during the estimation process. At the same time, a closed solution algorithm is proposed to optimize the local subspace parameters, which can reach convergence faster than the previous alternate optimization algorithm. While ensuring the accuracy, it greatly improves the operation efficiency, so it can better realize real-time The task of human pose estimation. In addition, since the algorithm is a generative model, it can also handle image noise well.

附图说明：Description of drawings:

图1形状上下文特征描述；Figure 1 Shape context feature description;

图2对人体轮廓提取形状上下文特征时200个采样点的分布；Fig. 2 Distribution of 200 sampling points when extracting shape context features from human body contours;

图3去除背景后的人体轮廓图；Figure 3 is the outline of the human body after removing the background;

图4姿态估计结果展示。Figure 4 shows the pose estimation results.

图1中(c)图表示处于极坐标原点的样本点的形状信息，周围与它相邻的点(在极坐标覆盖的范围之内)落于不同的小格子，就表示不同的相对向量，这些相对向量就成为这个点的形状上下文；同时图(a)和图(b)中菱形点和方块点，他们的形状上下文直方图(d)、(e)图，基本上一致，而三角形点的形状上下文就有不同，这和我们实际的观察基本上是一致的。图4中左边为我们输入的人体轮廓信息，右边为我们估计并解析出的BVH格式的人体姿态。Figure 1 (c) shows the shape information of the sample point at the origin of the polar coordinates, and the points adjacent to it (within the range covered by the polar coordinates) fall in different small grids, which represent different relative vectors. These relative vectors become the shape context of this point; at the same time, the shape context histograms (d) and (e) of the diamond point and square point in figure (a) and figure (b) are basically the same, while the triangle point The shape context is different, which is basically consistent with our actual observation. The left side of Figure 4 is the human body contour information we input, and the right side is the human body pose we estimated and analyzed in BVH format.

具体实施方式：detailed description:

结合说明书附图对本发明的一种基于监督局部子空间的人体姿态估计方法进行说明，它包含以下步骤：A kind of human body pose estimation method based on supervised local subspace of the present invention is described in conjunction with accompanying drawing of description, and it comprises the following steps:

步骤1：对需要进行人体姿态估计的原始图像，去除背景并得到人体轮廓信息，达到突出前景的作用(见图3)；Step 1: For the original image that requires human pose estimation, remove the background and obtain the human body contour information to achieve the effect of highlighting the foreground (see Figure 3);

步骤2：对上述人体轮廓图片提取形状上下文特征，其中提取形状上下文特征的算法相关参数分别是采样点个数为200(见图2)，圆形极坐标均分为12个扇形区域，半径分为5份；因此对于每一个训练样本，它所对应的形状上下文特征为一个60*200维矩阵即200个60维的形状上下文向量；Step 2: Extract shape context features from the above-mentioned human body contour image, wherein the relevant parameters of the algorithm for extracting shape context features are respectively that the number of sampling points is 200 (see Figure 2), the circular polar coordinates are divided into 12 fan-shaped areas, and the radius is divided into is 5 copies; therefore, for each training sample, its corresponding shape context feature is a 60*200-dimensional matrix, that is, 200 60-dimensional shape context vectors;

步骤3：采用降维操作将每张图片的形状上下文特征降到100维获得图像特征X；具体操作为：首先将所有训练样本中的形状上下文特征矩阵从左到由合并在一起，然后采用K均值算法得到100个60维的向量，我们称这些向量为数据空间的聚类中心。最后每个样本的200个形状上下文向量带有高斯权重地向这100个聚类中心进行投票，具体的投票方式为形状上下文与某个聚类中心的欧氏距离越近就越趋于1，越远就趋于0。最后可以得到后面步骤所需要的图像特征X，其中第i列对应第i个训练样本的图像特征且每个样本的特征为100维。Step 3: Use the dimensionality reduction operation to reduce the shape context feature of each picture to 100 dimensions to obtain the image feature X; the specific operation is: firstly merge the shape context feature matrices in all training samples from left to right, and then use K The mean algorithm obtains 100 60-dimensional vectors, which we call the cluster centers of the data space. Finally, the 200 shape context vectors of each sample vote for the 100 cluster centers with Gaussian weights. The specific voting method is that the closer the Euclidean distance between the shape context and a certain cluster center is, the closer it is to 1. The farther away it tends to 0. Finally, the image feature X required in the following steps can be obtained, where the i-th column corresponds to the image feature of the i-th training sample and the feature of each sample is 100 dimensions.

该步骤中在选取100个中心点过程中，可以用任意的聚类算法，我们采用K-Means算法一是因为它运行简单快速，而是因为它为后面形状上下文特征向量进行投票中的高斯权重提供了相应的均值u和方差σ²。In the process of selecting 100 center points in this step, any clustering algorithm can be used. We use the K-Means algorithm because it is simple and fast to run, but because it is the Gaussian weight in voting for the shape context feature vector The corresponding mean u and variance σ ² are provided.

x_p≈c_i+G_i△θ_pi x _p ≈ c _i +G _i △θ _pi

其中e∈1^1×d，W_p＝diag{w_pi,i∈Γ_p}，W_j＝diag{w_ji,i∈Γ_j}，另外，是一个0-1选择矩阵，该矩阵对角线上非零元素的序号为训练样本x_p对应的近邻子空间的序列，其余元素均为0，例如训练样本x_p对应的近邻子空间的序号分别为2，4，5，6，则在中有且仅有元素S₂₂,S₄₄,S₅₅,S₆₆为1，其他元素均为0；也是一个0-1选择矩阵，该矩阵对角线上非零元素的序号为每个子空间中心c_j对应的近邻子空间的序列；s为一个m×m的单位矩阵，s_j为一个m维列向量，I是一个m×m的单位矩阵，E(U)是关于U的一个凸二次函数。最后得到U的解的表达式为U＝A(B^(x)+B^(c))^-1 where e∈1 ^1×d , W _p = diag{w _pi , i∈Γ _p }, W _j ＝diag{w _ji ,i∈Γ _j }, in addition, is a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the training sample x _p , and the rest of the elements are all 0, for example, the number of the adjacent subspace corresponding to the training sample x _p are 2, 4, 5, 6 respectively, then in There are and only elements S ₂₂ , S ₄₄ , S ₅₅ , and S ₆₆ are 1, and all other elements are 0; It is also a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the center c _j of each subspace; s is an m×m identity matrix, and s _j is an m-dimensional Column vector, I is an m×m identity matrix, E(U) is a convex quadratic function about U. Finally, the expression of the solution of U is U=A(B ^(x) +B ^(c) ) ^-1

C＝U[1:d,1:m]C=U[1:d,1:m]

G＝U[1:d,1+m:56×m]G=U[1:d,1+m:56×m]

步骤9：将得到的θ_t解析成BVH格式的文本并表示成相应人体姿态图像。(见图3)Step 9: Parse the obtained θ _t into text in BVH format and represent it as a corresponding human body pose image. (See Figure 3)

根据本发明的方法，首先利用Matlab语言编写基于监督局部子空间学习算法的程序；接着选取若干图片对算法进行训练，求解子空间中心和基向量参数；最后将需要进行姿态估计的含人体的图片输入到算法中，即可得到由该算法预测出的人体姿态。本发明的方法，可以用于自然场景中人体的姿态估计，如行为检测，人机交互。最后我们将我们的算法在Poser dataset数据库上进行测试，该数据集由选取其中的1691张作为训练样本，418张为测试样本。当在算法中选择子空间个数为61，近邻子空间个数为8时，在该数据集上能得到最佳的测试误差为7.125°。According to the method of the present invention, at first utilize Matlab language to write the program based on supervised local subspace learning algorithm; Then select some pictures to train the algorithm, solve the subspace center and basis vector parameters; Finally, the pictures containing the human body that need to be estimated for attitude Input into the algorithm, the human body posture predicted by the algorithm can be obtained. The method of the present invention can be used for pose estimation of a human body in a natural scene, such as behavior detection and human-computer interaction. Finally, we tested our algorithm on the Poser dataset database, which selected 1691 of them as training samples and 418 as testing samples. When the number of subspaces is 61 and the number of adjacent subspaces is 8 in the algorithm, the best test error can be obtained on this data set is 7.125°.

Claims

1. A human body pose estimation method based on supervised local subspace, which comprises the following steps:

Step 1: For the original image that requires human pose estimation, remove the background and obtain the human body contour information to achieve the effect of highlighting the foreground;

Step 2: Binarize the image obtained in step 1, and then extract shape context features from the above-mentioned human outline image. The relevant parameters of the algorithm for extracting shape context features are the number of sampling points is 200, and the circular polar coordinates are divided into 12 A fan-shaped area, the radius is divided into 5 parts; therefore, for each training sample, its corresponding shape context feature is a 60*200-dimensional matrix, that is, 200 60-dimensional shape context vectors;

Step 3: Use the dimensionality reduction operation to reduce the shape context feature of each picture to 100 dimensions to obtain the image feature X;

Step 4: The image feature X obtained in step 3 is reconstructed in the local subspace through the attitude angle Θ, the specific formula is

X x = = f f ((Θ Θ)) \approx \approx f f ((< < {c c}_{11},, {G G}_{11},, {\overset{^^}{θ θ}}_{11} > >,, ... ...,, < < {c c}_{m m},, {G G}_{m m},, {\overset{^^}{θ θ}}_{m m} > >,, Θ Θ))

Among them, f is the mapping function from the human body pose space to the image feature space, is the set of parameters corresponding to the i-th local subspace, is the center of the subspace, is the principal component of the tangent space, is the human body posture angle corresponding to the center of the i-th subspace, m is the number of subspaces, and d is the input feature dimension of the sample;

Step 5: Perform first-order Taylor expansion on the approximate function f(Θ) in step 4; reconstruct each training sample from a certain local subspace, if the training sample x _p with attitude angle θ _p is in the parameter In the determined subspace, x _p is approximated as:

x _p ≈ c _i +G _i △θ _pi

in, define at the same time And N(θ _p ) represents the subspace index number adjacent to θ _p ;

Step 6: Determine the error function of the algorithm according to step 5 as

E E. (({c c}_{i i},, {G G}_{i i})) = = {Σ Σ}_{p p = = 11}^{n no} \underset{i i &Element; &Element; {Γ Γ}_{p p}}{Σ Σ} {w w}_{p p i i}^{22} | | | | {x x}_{p p} - - (({c c}_{i i} + + {G G}_{i i} {Δθ Δθ}_{p p i i})) | | {| |}_{22}^{22} + + λ λ {Σ Σ}_{j j = = 11}^{m m} \underset{i i &Element; &Element; {Γ Γ}_{j j}}{Σ Σ} {w w}_{j j i i}^{22} | | | | {c c}_{j j} - - (({c c}_{i i} + + {G G}_{i i} {Δθ Δθ}_{j j i i})) | | {| |}_{22}^{22}

The first item is the weighted sum of the reconstruction errors caused by the reconstruction of the neighbor subspace for each training sample (x _p , θ _p ), and the selection of the neighbor subspace is based on the attitude angle corresponding to the center of the subspace The size relationship with θ _p Euclidean distance; the second term is regularized, and the mean value of each subspace is reconstructed through the nearest neighbor subspace; this step ensures the smooth change of subspace parameters, and can be used from sparse non-uniform Estimated from the data; where λ=(n/m) ² is a regularization parameter, which is equal to the number of training samples n divided by the square of the number of subspaces m, w _pi defines the weight of each adjacent subspace on the data sample The weight of the structure, the specific formula is:

{w w}_{p p i i} = = \frac{ψ ψ (({θ θ}_{p p},, {\overset{^^}{θ θ}}_{i i}))}{{Σ Σ}_{q q = = 11}^{| | {Γ Γ}_{p p} | |} ψ ψ (({θ θ}_{p p},, {\overset{^^}{θ θ}}_{q q}))}

in, are the measurement angle θ _p and The positive value function of the similarity between , the function expression is:

ψ ψ (({θ θ}_{p p},, {\overset{^^}{θ θ}}_{i i})) = = \frac{11}{| | | | {θ θ}_{p p} - - {\overset{^^}{θ θ}}_{i i} | | {| |}_{22}^{22}}

Step 7: Order represents the center of the subspace, Represents the basis of the subspace, where d is the dimension of the input, in this patent, d is 100, m is the number of subspaces, and then the closed solution algorithm is used to optimize the calculation to obtain C and G;

Step 8: For the image feature x _t of a new test sample point, in order to improve efficiency; this algorithm uses the following two steps to determine the nearest neighbor subspace of the test sample from the subspace:

(1) First find 2 |Γ _t | candidate subspaces; |Γ _t | is the number of adjacent subspaces we set, generally 2-16. The centers _ci of these subspaces are closest to x _t in the input space, then according to the formula Calculate θ _ti , where is the basis of a selected neighbor subspace, is the center of a selected neighboring subspace; θ _ti represents the attitude angle θ _ti obtained by reconstructing a new test data point x _t according to subspace i;

(2) Compare the magnitude of the reconstruction error, select |Γ _t | neighboring subspaces from 2|Γ _t | candidates, and record θ _ti corresponding to the minimum reconstruction error as the test data point x _t corresponding to The attitude angle θ _t0 of , where the formula of the reconstruction error is:

e e r r r r o o r r = = | | | | {x x}_{t t} - - [[{c c}_{i i} + + {G G}_{i i} (({θ θ}_{t t i i} - - {\overset{^^}{θ θ}}_{i i}))]] | | {| |}_{22}^{22}

Finally, the optimal θ _t is minimized by the formula , where the weight w _ti is calculated by the above θ _t0 , the specific formula is Finally, the best attitude angle θ _t corresponding to the test sample point x _t can be obtained:

{θ θ}_{t t} = = {((\underset{i i &Element; &Element; {Γ Γ}_{t t}}{Σ Σ} {w w}_{t t i i}^{22} {G G}_{i i}^{T T} {G G}_{i i}))}^{- - 11} \underset{i i &Element; &Element; {Γ Γ}_{t t}}{Σ Σ} {w w}_{t t i i}^{22} {G G}_{i i}^{T T} (({x x}_{t t} - - {c c}_{i i} + + {G G}_{i i} {\overset{^^}{θ θ}}_{i i}))

Step 9: Parse the obtained θ _t into text in BVH format and represent it as a corresponding human body pose image.

2. a kind of human body pose estimation method based on supervision local subspace as claimed in claim 1, is characterized in that the concrete steps of described step 3 are:

First, the shape context feature matrices in all training samples are merged from left to right, and then the K-means algorithm is used to obtain 100 60-dimensional vectors. We call these vectors the cluster centers of the data space. Finally, the 200 shape context vectors of each sample vote for the 100 cluster centers with Gaussian weights. The specific voting method is that the closer the Euclidean distance between the shape context and a certain cluster center is, the closer it is to 1. The farther away it tends to 0. Finally, the image feature X required in the following steps can be obtained, where the i-th column corresponds to the image feature of the i-th training sample and the feature of each sample is 100 dimensions.

3. a kind of human pose estimation method based on supervising local subspace as claimed in claim 1, is characterized in that the concrete steps of described step 7 are:

The specific method is: replace C and G with U=[C,G], then the error function in step 6 can be rewritten as:

E E. ((U u)) = = {Σ Σ}_{p p = = 11}^{n no} | | | | (({x x}_{p p} {e e}^{T T} - - {UV UV}_{p p} {S S}_{{Γ Γ}_{p p}})) {W W}_{p p} | | {| |}_{F f}^{22} + + λ λ {Σ Σ}_{j j = = 11}^{m m} | | | | {UV UV}_{j j} {W W}_{j j} | | {| |}_{F f}^{22}

where e∈1 ^1×d , W _p = diag{w _pi , i∈Γ _p }, W _j ＝diag{w _ji ,i∈Γ _j }, in addition, is a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the training sample x _p , and the rest of the elements are all 0, for example, the number of the adjacent subspace corresponding to the training sample x _p are 2, 4, 5, 6 respectively, then in There are and only elements S ₂₂ , S ₄₄ , S ₅₅ , and S ₆₆ are 1, and all other elements are 0; It is also a 0-1 selection matrix, the number of non-zero elements on the diagonal of the matrix is the sequence of the adjacent subspace corresponding to the center c _j of each subspace; s is an m×m identity matrix, and s _j is an m-dimensional Column vector, I is an m×m identity matrix, E(U) is a convex quadratic function about U. Finally, the expression for the solution of U is: U=A(B ^(x) +B ^(c) ) ^-1

Among them, A, B ^(x) , B ^(c) , C, G are respectively:

A A = = {Σ Σ}_{p p = = 11}^{n no} {x x}_{p p} {e e}^{T T} {W W}_{p p} {W W}_{p p}^{T T} {S S}_{{Γ Γ}_{p p}}^{T T} {V V}_{p p}^{T T}

{B B}^{((x x))} = = {Σ Σ}_{p p = = 11}^{n no} {V V}_{p p} {S S}_{{Γ Γ}_{p p}} {W W}_{p p} {W W}_{p p}^{T T} {S S}_{{Γ Γ}_{p p}}^{T T} {V V}_{p p}^{T T}

{B B}^{((c c))} = = {λΣ λΣ}_{j j = = 11}^{m m} {V V}_{j j} {W W}_{j j} {W W}_{j j}^{T T} {V V}_{j j}^{T T}

C=U[1:d,1:m]

G=U[1:d,1+m:56×m]

After several iterations of optimization, C and G can quickly converge, and C and G can now obtain an approximate optimal solution of the sum.