CN109409231B

CN109409231B - Multi-feature fusion sign language recognition method based on adaptive hidden Markov

Info

Publication number: CN109409231B
Application number: CN201811131806.9A
Authority: CN
Inventors: 郭丹; 宋培培; 赵烨; 汪萌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2020-07-10
Anticipated expiration: 2038-09-27
Also published as: CN109409231A

Abstract

The invention discloses a multi-feature fusion sign language recognition method based on adaptive hidden Markov. An adaptive hidden Markov model under different features in the feature pool set, and a feature selection strategy is proposed to obtain the appropriate back-end score fusion features; after selecting the back-end score fusion features, calculate each back-end score The score vector under the fusion feature is assigned different weights, and then the back-end score fusion is performed to obtain the optimal fusion effect. The present invention can realize accurate identification of sign language video sign language categories, and improve the robustness of identification.

Description

Multi-feature fusion sign language recognition method based on adaptive hidden Markov

技术领域technical field

本发明属于计算机视觉技术领域，涉及到模式识别、人工智能等技术，具体地说是一种基于自适应隐马尔可夫的多特征融合手语识别方法。The invention belongs to the technical field of computer vision, relates to pattern recognition, artificial intelligence and other technologies, in particular to a multi-feature fusion sign language recognition method based on adaptive hidden Markov.

技术背景technical background

聋哑人是残疾人中的一个庞大群体，因为无法说话，聋哑人通常使用手语作为沟通手段。当没有学过手语的正常人需要和聋哑人进行交流时，就产生了沟通障碍，而且社会中大部分正常人都没有接受过手语教育。因此手语翻译系统作为便于聋哑人融入社会的辅助方式，对于聋哑人而言，有着重大的意义。但目前手语翻译仍然是计算机视觉领域的一个难题，原因是手语者的身材、做手语的速度和习惯等多方面的因素千差万别，手语识别的情况十分复杂，往往难以取得令人满意的准确率。Deaf people are a large group of people with disabilities, and because they cannot speak, deaf people often use sign language as a means of communication. When normal people who have not learned sign language need to communicate with deaf people, communication barriers arise, and most normal people in society have not received sign language education. Therefore, the sign language interpretation system is of great significance for the deaf-mute as an auxiliary method to facilitate the integration of the deaf and mute into the society. However, sign language translation is still a difficult problem in the field of computer vision at present. The reason is that there are various factors such as the size of signers, the speed and habits of sign language, and the situation of sign language recognition is very complicated, and it is often difficult to obtain satisfactory accuracy.

手语识别是一个序列学习的问题，目前提出的模型有动态时间规整DTW、支持向量机SVM、曲线匹配和神经网络NN等。动态时间规整的计算开销较大，而支持向量机SVM常用于二分类问题，面临多分类问题时则无法使用。使用神经网络的先决条件是拥有大量的训练数据用于模型的训练和优化，当训练数据有限时，神经网络并不能得到最优的模型，因而影响手语识别精度。Sign language recognition is a sequence learning problem, and currently proposed models include dynamic time warping DTW, support vector machine SVM, curve matching and neural network NN. Dynamic time warping is computationally expensive, and support vector machine (SVM) is often used for binary classification problems, but cannot be used when faced with multi-classification problems. The prerequisite for using a neural network is to have a large amount of training data for model training and optimization. When the training data is limited, the neural network cannot obtain the optimal model, thus affecting the accuracy of sign language recognition.

多模态特征融合，传统的特征融合包括前端融合和后端得分融合，前端融合执行在特征的层面，而后端得分融合则是执行在分类识别概率得分层面。后端得分融合通常时间开销过大，而且在不同的模型中，效果较差的特征可能会主导特征融合，降低了融合后的效果。Multi-modal feature fusion, traditional feature fusion includes front-end fusion and back-end score fusion. Front-end fusion is performed at the feature level, while back-end score fusion is performed at the classification and recognition probability score level. Back-end score fusion is usually time-intensive, and in different models, less effective features may dominate feature fusion, reducing the effect of fusion.

发明内容SUMMARY OF THE INVENTION

本发明是为了改善手语识别精度，提供一种基于自适应隐马尔可夫的多特征融合手语识别方法，以期能够实现对于手语视频手语类别的精确识别，并提高识别的鲁棒性。In order to improve the sign language recognition accuracy, the present invention provides a multi-feature fusion sign language recognition method based on adaptive hidden Markov, so as to realize accurate recognition of sign language video sign language categories and improve the robustness of recognition.

本发明为解决技术问题采用如下技术方案：The present invention adopts the following technical scheme for solving the technical problem:

本发明一种基于自适应隐马尔可夫的多特征融合手语识别方法的特点是按如下步骤进行：The feature of a multi-feature fusion sign language recognition method based on adaptive hidden Markov of the present invention is to carry out the following steps:

步骤1、获取手语视频数据库，并将所述手语视频数据库中的手语视频分为训练数据集和测试数据集；所述训练数据集中包含N种手语单词对应的手语视频，每种手语单词对应多个手语视频；将所述N种手语单词记作C＝{c₁,…,c_n,…,c_N}，其中c_n为第n种手语单词，1≤n≤N；Step 1, obtain a sign language video database, and divide the sign language video in the sign language video database into a training data set and a test data set; the training data set contains the corresponding sign language videos of N kinds of sign language words, and each sign language word corresponds to many sign language videos; denote the N kinds of sign language words as C={c ₁ ,...,c _n ,...,c _N }, where c _n is the nth sign language word, 1≤n≤N;

将所述训练数据集中的每种手语单词对应的多个手语视频作为相应手语单词所对应的手语视频集，从而得到N种手语单词所对应的手语视频集，记为Set₁,…,Set_n,…,Set_N，其中Set_n为第n种手语单词c_n所对应的手语视频集；The multiple sign language videos corresponding to each sign language word in the training data set are used as the sign language video set corresponding to the corresponding sign language word, so as to obtain the sign language video set corresponding to N kinds of sign language words, which are marked as Set ₁ ,...,Set _n ,...,Set _N , where Set _n is the sign language video set corresponding to the nth sign language word c _n ;

步骤2、构建特征种类集合F：Step 2. Construct a feature type set F:

对所述训练数据集中的手语视频提取M种特征，得到特征种类集合F＝{f₁,f₂,…,f_M}，f_M表示第M种特征，M表示特征种类的总数；Extracting M kinds of features from the sign language video in the training data set to obtain a feature category set F={f ₁ , f ₂ ,..., f _M }, where f _M represents the M-th feature, and M represents the total number of feature types;

步骤3、构建特征池集合F′：Step 3. Construct a feature pool set F':

步骤3.1、定义变量i，并初始化i＝1；Step 3.1, define variable i, and initialize i=1;

步骤3.2、定义第i个融合特征集合为F_i，并初始化F_i＝F；Step 3.2, define the i-th fusion feature set as F _i , and initialize F _i =F;

步骤3.3、令i＝2；Step 3.3, let i=2;

步骤3.4、从所述特征种类集合F中任取i种不同的特征并按序拼接为一种融合特征，从而得到由

种融合特征组成的第i个融合特征集合F_i；Step 3.4, arbitrarily select i different features from the feature type set F and splicing them into a fusion feature in sequence, so as to obtain by

The i-th fusion feature set F _i composed of fusion features;

步骤3.5、令i+1赋值给i，判断i≤M是否成立，若成立，则执行步骤3.4；否则，表示得到M个融合特征集合F₁,…,F_i,…,F_M，并执行步骤3.6；Step 3.5, assign i+1 to _i , and judge whether i≤M is established, if so, execute step 3.4; otherwise, it means to obtain _M fusion feature sets F ₁ ,...,Fi ,...,FM , and execute Step 3.6;

步骤3.6、将所述M个融合特征集合F₁,…,F_i,…,F_M中所有的特征构成特征池集合F′，并记为F′＝{f′₁,…,f′_m',…,f″_M}，其中f′_m’表示第m′种特征池特征，M′表示特征池特征的总数；Step 3.6. All the features in the _M fusion feature sets F ₁ ,...,F _i ,...,FM form a feature pool set F', and denote it as F'={f' ₁ ,...,f' _{m '} ,...,f" _M }, where f'_m' represents the m'th feature pool feature, and M' represents the total number of feature pool features;

步骤4、采用高斯混合-隐马尔可夫GMM-HMM模型，构建N种手语单词在M′种特征池特征下的自适应隐马尔可夫模型集合：Step 4. Use the Gaussian mixture-Hidden Markov GMM-HMM model to construct an adaptive hidden Markov model set of N kinds of sign language words under the features of M' feature pools:

步骤4.1、初始化n＝1；Step 4.1, initialize n=1;

步骤4.2、初始化m′＝1；Step 4.2, initialize m'=1;

步骤4.3、使用AP吸引子传播聚类算法对第n种手语单词c_n对应的手语视频集Set_n进行聚类处理，得到所述第n种手语单词c_n对应的手语视频集Set_n在所述第m′种特征池特征f″_m下的特征聚类数

Step 4.3. Use the AP attractor propagation clustering algorithm to perform clustering processing on the sign language video set Set _n corresponding to the nth sign language word cn, and obtain the sign language video set Set _n corresponding to the _{nth sign language word cn at the location where the sign language video set Set n} _corresponds . The number of feature clusters under the m′-th feature pool feature f″ _m

步骤4.4、定义所述第n种手语单词c_n在第m′种特征池特征f′_m’下的自适应隐马尔可夫模型为

并根据式(1)计算所述自适应隐马尔可夫模型

的隐状态个数

Step 4.4, define the adaptive hidden Markov model of the nth sign language word c _n under the m'th feature pool feature f'm _' as

and calculate the adaptive hidden Markov model according to formula (1)

The number of hidden states of

式(1)中，G为高斯混合模型中高斯函数个数；In formula (1), G is the number of Gaussian functions in the Gaussian mixture model;

步骤4.5、根据所述隐状态个数

和高斯函数个数G，利用Baum-Welch算法在所述第n种手语单词c_n对应的手语视频集Set_n上进行学习，获得所述第n种手语单词c_n在第m′种特征池特征f′_m’下的自适应隐马尔可夫模型

Step 4.5, according to the number of hidden states

and the number of Gaussian functions G, use the Baum-Welch algorithm to learn on the sign language video set Set _n corresponding to the nth sign language word c _n , and obtain the nth sign language word c _n in the m'th type feature pool Adaptive Hidden Markov Models with Features f'_m'

步骤4.6、令m′+1赋值给m′，判断m′≤M′是否成立，若成立，则执行步骤4.3；否则，执行步骤4.7；Step 4.6, assign m'+1 to m', and judge whether m'≤M' is established, if so, go to step 4.3; otherwise, go to step 4.7;

步骤4.7、令n+1赋值给n，判断n≤N是否成立，若成立，则执行步骤4.2；否则，表示得到N种手语单词在M′种特征池特征下的自适应隐马尔可夫模型集合

并执行步骤5；Step 4.7. Assign n+1 to n, and judge whether n≤N is established. If so, go to step 4.2; gather

and go to step 5;

步骤5、构建选择特征集合F″用于后端得分融合：Step 5. Construct a selection feature set F" for back-end score fusion:

步骤5.1、初始化m′＝1；Step 5.1, initialize m'=1;

步骤5.2、获取所述训练数据集中任意一个训练视频A，根据式(2)计算所述训练视频A在第m′种特征池特征f′_m’下的得分向量

Step 5.2: Obtain any training video A in the training data set, and calculate the score vector of the training video A under the m'th feature pool feature f'm _' according to formula (2).

式(2)中，

表示所述训练视频A在所述第n种手语单词c_n在第m′种特征池特征f_m″下的自适应隐马尔可夫模型

上的手语识别概率得分；In formula (2),

Represents the adaptive hidden Markov model of the training video A in the nth sign language word c _n under the m′th feature pool feature f _m ″

Sign language recognition probability score on ;

步骤5.3、重复步骤5.2，直至得到所述训练数据集中所有手语视频在第m′种特征池特征f′_m’下的得分向量；并计算所述训练数据集中所有手语视频在第m′种特征池特征f′_m’下的得分向量的平均方差之和，记为第m′种特征池特征f′_m’的训练方差Var_m′；Step 5.3. Repeat step 5.2 until the score vector of all sign language videos in the training data set under the m'th feature pool feature f'm _' is obtained; and calculate the m'th feature of all sign language videos in the training data set. The sum of the average variances of the score vectors under the pool feature f'm' is recorded as the training variance Var _m _' of the m'th feature pool feature f'm _' ;

步骤5.4、令m′+1赋值给m′，判断m′≤M′是否成立，若成立，则执行步骤5.2；否则，表示得到所述M′种特征池特征所对应的训练方差Var₁,…,Var_m′,…,Var_M′，1≤m′≤M′，并执行步骤5.5；Step 5.4, assign m'+1 to m', and judge whether m'≤M' is established, if so, execute step 5.2; otherwise, it means that the training variance Var ₁ corresponding to the M' feature pool features is obtained, …,Var _m′ ,…,Var _M′ , 1≤m′≤M′, and go to step 5.5;

步骤5.5、对所述训练方差Var₁,…,Var_m′,…,Var_M′进行降序排序，得到排序后的训练方差；Step 5.5, sort the training variances Var ₁ ,...,Var _m' ,...,Var _M' in descending order to obtain the sorted training variances;

设置参数1≤K<M′，选取前K个排序后的训练方差所对应的特征池特征，并构成选择特征集合F″，并记为F″＝{f″₁,…,f″_k,…,f″_K}，其中f″_k表示第k种选择特征，1≤k≤K；Set the parameter 1≤K<M′, select the feature pool features corresponding to the first K sorted training variances, and form the selected feature set F″, and denote it as F″={f″ ₁ ,...,f″ _k , ...,f″ _K }, where f″ _k represents the k-th selection feature, 1≤k≤K;

步骤6、获取所述测试数据集中任意一个测试视频B，计算所述测试视频B在所述选择特征集合F″＝{f″₁,…,f″_k,…,f″_K}下的各得分向量：Step 6: Acquire any test video B in the test data set, and calculate each of the test video B under the selection feature set F″={f″ ₁ ,...,f″ _k ,...,f″ _K }. Score vector:

步骤6.1、初始化k＝1；Step 6.1, initialize k=1;

步骤6.2、根据式(3)计算所述测试视频B在第k种选择特征f″_k下的得分向量

Step 6.2. Calculate the score vector of the test video B under the k-th selection feature f" _k according to formula (3).

式(3)中，

表示第n种手语单词c_n在第k种选择特征f″_k下的自适应马尔可夫模型；

表示所述测试视频B在所述第n种手语单词c_n在第k种选择特征f″_k下的自适应隐马尔可夫模型

上的手语识别概率得分；In formula (3),

represents the adaptive Markov model of the nth sign language word c _n under the kth selection feature f″ _k ;

represents the adaptive hidden Markov model of the test video B under the _nth sign language word cn under the kth selection feature f" _k

Sign language recognition probability score on ;

步骤6.3、利用Min-Max标准化对所述测试视频B在第k种选择特征f″_k下的得分向量

进行归一化处理，得到归一化后的得分向量

并对所述归一化后的得分向量

中的元素进行降序排列后，画出其分数曲线，再计算分数曲线下的区域面积，从而得到所述归一化后的得分向量

所对应的权重面积

Step 6.3, use Min-Max normalization to the score vector of the test video B under the k-th selection feature f" _k

Perform normalization processing to get the normalized score vector

and the normalized score vector

After the elements are arranged in descending order, draw the score curve, and then calculate the area under the score curve, so as to obtain the normalized score vector

The corresponding weight area

步骤6.4、令k+1赋值给k，若k＞K成立，则表示得到所述测试视频B在K种选择特征下的归一化后的得分向量

及其所对应的权重面积

并执行步骤7；否则，执行步骤6.2；Step 6.4, assign k+1 to k, if k>K is established, it means that the normalized score vector of the test video B under the K selection features is obtained

and its corresponding weight area

And go to step 7; otherwise, go to step 6.2;

步骤7、后端得分融合计算，并输出所述测试视频B对应的手语单词：Step 7, the back-end score fusion calculation, and output the sign language words corresponding to the test video B:

步骤7.1、根据式(4)计算所测试视频B在第k种选择特征f″_k下的归一化后的得分向量

的权重

从而得到所述K种选择特征下的归一化后的得分向量

的各自权重

Step 7.1. Calculate the normalized score vector of the tested video B under the k-th selection feature f″ _k according to formula (4).

the weight of

Thereby, the normalized score vector under the K selection features is obtained

their respective weights

步骤7.2、根据式(5)得到所述测试视频B的后端得分融合向量

Step 7.2, obtain the back-end score fusion vector of the test video B according to formula (5)

步骤7.3、根据式(6)获得所述后端得分融合向量

中最大值所对应的手语单词序号n^*：Step 7.3, obtain the back-end score fusion vector according to formula (6)

The sign language word number n ^* corresponding to the maximum value:

从而得到所述测试视频B对应的手语单词为第n^*种手语单词

Thus obtaining the sign language word corresponding to the test video ^B is the nth sign language word

与已有技术相比，本发明的有益效果体现在：Compared with the prior art, the beneficial effects of the present invention are embodied in:

1、本发明采用了高斯混合-隐马尔可夫模型GMM-HMM，该模型常用来解决序列问题，可以在训练数据较少的情况下，依旧取得较好的效果；本发明利用自适应隐马尔可夫模型，结合前端后端得分融合和特征选择策略，提高了手语识别的精度和鲁棒性；1. The present invention adopts the Gaussian mixture-Hidden Markov model GMM-HMM, which is often used to solve sequence problems, and can still achieve good results with less training data; the present invention uses adaptive hidden Markov models. Kov model, combined with front-end and back-end score fusion and feature selection strategy, improves the accuracy and robustness of sign language recognition;

2、本发明提出了一种自适应隐马尔可夫模型，使用AP吸引子传播聚类算法得到每个手语单词对应的手语视频集在每种特征下的特征聚类数，自适应地得到最佳隐马尔可夫模型参数，为每种特征下的每种手语单词训练不同的自适应隐马尔可夫模型，显著提升了预测效果；2. The present invention proposes an adaptive hidden Markov model, which uses the AP attractor propagation clustering algorithm to obtain the number of feature clusters under each feature of the sign language video set corresponding to each sign language word, and adaptively obtains the maximum number of feature clusters. Optimize the parameters of the hidden Markov model, train different adaptive hidden Markov models for each sign language word under each feature, which significantly improves the prediction effect;

3、本发明采用了前端融合和后端得分融合策略，利用前端融合从提取的视频特征中选取不同特征进行拼接，得到所有可能的融合特征；进一步的，本发明提出的后端得分融合方法提供了一种自适应的权值分配方法，揭示了这些特征的重要性并以加权方式聚合它们的识别概率得分，避免了效果差的特征主导特征融合而影响融合结果；3. The present invention adopts front-end fusion and back-end score fusion strategies, and uses front-end fusion to select different features from the extracted video features for splicing to obtain all possible fusion features; further, the back-end score fusion method proposed by the present invention provides An adaptive weight allocation method is proposed, which reveals the importance of these features and aggregates their recognition probability scores in a weighted manner, avoiding the fusion of poorly effective features that dominate the fusion results and affect the fusion results;

4、本发明提出了一种特征选择策略，选取合适的后端得分融合特征用于后端得分融合，该策略通过对比所有特征的方差性能，选取方差性能好的特征用于后端得分融合，从而避免了使用效果差的特征进行融合而影响融合效果。4. The present invention proposes a feature selection strategy, which selects appropriate back-end score fusion features for back-end score fusion. This strategy selects features with good variance performance for back-end score fusion by comparing the variance performance of all features. In this way, it is avoided to use the features with poor effect for fusion to affect the fusion effect.

附图说明Description of drawings

图1为本发明方法的示意图。Figure 1 is a schematic diagram of the method of the present invention.

具体实施方式Detailed ways

本实施例中，如图1所示，一种基于自适应隐马尔可夫的多特征融合手语识别方法，本方法采用高斯混合-隐马尔可夫模型GMM-HMM，首先对手语视频数据库提取多种特征并进行前端融合，即构建特征池集合；之后构建各手语视频在特征池集合中不同特征下的自适应隐马尔可夫模型，并提出了一种特征选择策略，以得到合适的后端得分融合特征；选择好后端得分融合特征之后，计算各后端得分融合特征下的得分向量，为其分配不同的权重，再进行后端得分融合，从而得到最优的融合效果。具体的说，如图1所示，包括如下步骤：In this embodiment, as shown in FIG. 1, a multi-feature fusion sign language recognition method based on adaptive hidden Markov, this method adopts Gaussian mixture-hidden Markov model GMM-HMM, Then, the adaptive hidden Markov model of each sign language video under different features in the feature pool set is constructed, and a feature selection strategy is proposed to obtain a suitable back-end Score fusion features: After selecting the back-end score fusion features, calculate the score vector under each back-end score fusion feature, assign different weights to them, and then perform back-end score fusion to obtain the optimal fusion effect. Specifically, as shown in Figure 1, it includes the following steps:

步骤1、获取手语视频数据库，并将手语视频数据库中的手语视频分为训练数据集和测试数据集；训练数据集中包含N种手语单词对应的手语视频，每种手语单词对应多个手语视频；将N种手语单词记作C＝{c₁,…,c_n,…,c_N}，其中c_n为第n种手语单词，1≤n≤N；Step 1. Obtain a sign language video database, and divide the sign language videos in the sign language video database into a training data set and a test data set; the training data set contains sign language videos corresponding to N kinds of sign language words, and each sign language word corresponds to multiple sign language videos; Denote N kinds of sign language words as C={c ₁ ,...,c _n ,...,c _N }, where c _n is the nth sign language word, 1≤n≤N;

将训练数据集中的每种手语单词对应的多个手语视频作为相应手语单词所对应的手语视频集，从而得到N种手语单词所对应的手语视频集，记为Set₁,…,Set_n,…,Set_N，其中Set_n为第n种手语单词c_n所对应的手语视频集；The multiple sign language videos corresponding to each sign language word in the training data set are taken as the sign language video set corresponding to the corresponding sign language word, so as to obtain the sign language video set corresponding to N kinds of sign language words, which are recorded as Set ₁ ,...,Set _n ,... , Set _N , where Set _n is the sign language video set corresponding to the nth sign language word c _n ;

在本实施例中，手语视频数据库中共有370种手语单词对应的手语视频，每种手语单词对应25个手语视频，手语视频分别由5个人演示，每个人重复演示5遍；手语单词可以是单词或词组；In this embodiment, there are 370 sign language videos corresponding to sign language words in the sign language video database, each sign language word corresponds to 25 sign language videos, and the sign language videos are demonstrated by 5 people, and each person repeats the demonstration 5 times; sign language words can be words or phrase;

步骤2、构建特征种类集合F：Step 2. Construct a feature type set F:

对训练数据集中的手语视频提取M种特征，得到特征种类集合F＝{f₁,f₂,…,f_M}，f_M表示第M种特征，M表示特征种类的总数；Extract M features from the sign language video in the training data set, and obtain a feature category set F={f ₁ , f ₂ ,..., f _M }, where f _M represents the M-th feature, and M represents the total number of feature types;

步骤3、构建特征池集合F′：Step 3. Construct a feature pool set F':

步骤3.3、令i＝2；Step 3.3, let i=2;

步骤3.4、从特征种类集合F中任取i种不同的特征并按序拼接为一种融合特征，从而得到由

种融合特征组成的第i个融合特征集合F_i；Step 3.4, randomly select i different features from the feature type set F and splicing them into a fusion feature in sequence, so as to obtain the

The i-th fusion feature set F _i composed of fusion features;

步骤3.6、将M个融合特征集合F₁,…,F_i,…,F_M中所有的特征构成特征池集合F′，并记为F′＝{f′₁,…,f′_m’,…,f′_M’}，其中f′_m’表示第m′种特征池特征，M′表示特征池特征的总数；Step 3.6. All the features in the _M fusion feature sets F ₁ ,...,F _i ,...,FM constitute a feature pool set F', and denote it as F'={f' ₁ ,...,f'_m' , ...,f'_M' }, where f'_m' represents the m'th feature pool feature, and M' represents the total number of feature pool features;

在本实施例中，对手语视频数据库中所有手语视频提取方向梯度直方图HOG特征，并使用PCA主成分分析法对所有特征进行降维处理，从而得到HOG特征；In the present embodiment, the HOG feature of the directional gradient histogram is extracted from all the sign language videos in the sign language video database, and the PCA principal component analysis method is used to perform dimensionality reduction processing on all the features, thereby obtaining the HOG feature;

对手语视频数据库中所有手语视频提取骨架结点坐标的SP特征，并对所有SP特征使用随机高斯扰动进行处理，适当增加噪声，避免造成过拟合，从而得到SP特征；Extract SP features of skeleton node coordinates from all sign language videos in the sign language video database, and use random Gaussian perturbation to process all SP features, increase noise appropriately to avoid overfitting, and obtain SP features;

对手语视频数据库中所有手语视频的SP特征和HOG特征拼接，从而得到SP-HOG前端融合特征；Splicing SP features and HOG features of all sign language videos in the sign language video database to obtain SP-HOG front-end fusion features;

步骤4.1、初始化n＝1；Step 4.1, initialize n=1;

步骤4.2、初始化m′＝1；Step 4.2, initialize m'=1;

步骤4.3、使用AP吸引子传播聚类算法对第n种手语单词c_n对应的手语视频集Set_n进行聚类处理，得到第n种手语单词c_n对应的手语视频集Set_n在第m′种特征池特征f′_m’下的特征聚类数

Step 4.3. Use the AP attractor propagation clustering algorithm to perform clustering processing on the sign language video set Set _n corresponding to the nth sign language word c _n , and obtain the sign language video set Set _n corresponding to the nth sign language word c _n in the m'th. The number of feature clusters under the feature pool feature f'_m'

步骤4.4、定义第n种手语单词c_n在第m′种特征池特征f′_m’下的自适应隐马尔可夫模型为

并根据式(1)计算自适应隐马尔可夫模型

的隐状态个数

Step 4.4. Define the adaptive hidden Markov model of the nth sign language word c _n under the m'th feature pool feature f'm _' as

And according to formula (1) to calculate the adaptive hidden Markov model

The number of hidden states of

式(1)中，G为高斯混合模型中高斯函数个数；在本实施例中，G被设置为3；In formula (1), G is the number of Gaussian functions in the Gaussian mixture model; in this embodiment, G is set to 3;

步骤4.5、根据隐状态个数

和高斯函数个数G，利用Baum-Welch算法在第n种手语单词c_n对应的手语视频集Set_n上进行学习，获得第n种手语单词c_n在第m′种特征池特征f′_m’下的自适应隐马尔可夫模型

Baum-Welch算法是一种解决隐马尔可夫模型参数估计问题的经典算法；Step 4.5, according to the number of hidden states

and the number of Gaussian functions G, use the Baum-Welch algorithm to learn on the sign language video set Set _n corresponding to the nth sign language word c _n , and obtain the nth sign language word c _n in the _m'th type feature pool feature f'm _' Adaptive Hidden Markov Models under

Baum-Welch algorithm is a classical algorithm to solve the parameter estimation problem of Hidden Markov Model;

and go to step 5;

步骤5.1、初始化m′＝1；Step 5.1, initialize m'=1;

步骤5.2、获取训练数据集中任意一个训练视频A，根据式(2)计算训练视频A在第m′种特征池特征f′_m’下的得分向量

Step 5.2. Obtain any training video A in the training data set, and calculate the score vector of training video A under the m'th feature pool feature f'm _' according to formula (2).

式(2)中，

表示训练视频A在第n种手语单词c_n在第m′种特征池特征f′_m’下的自适应隐马尔可夫模型

上的手语识别概率得分；由维特比Vertibe算法在自适应隐马尔可夫模型

上计算得到；维特比Vertibe算法是一种动态规划算法，被广泛应用于求解隐马尔可夫模型的预测问题；In formula (2),

Represents the adaptive hidden Markov model of training video A in the nth sign language word c _n under the m'th feature pool feature f'm _'

Sign Language Recognition Probability Scores on ; by the Viterbi Vertibe Algorithm in Adaptive Hidden Markov Models

The Viterbi algorithm is a dynamic programming algorithm, which is widely used to solve the prediction problem of hidden Markov models;

步骤5.3、重复步骤5.2，直至得到训练数据集中所有手语视频在第m′种特征池特征f′_m’下的得分向量；并计算训练数据集中所有手语视频在第m′种特征池特征f′_m’下的得分向量的平均方差之和，记为第m′种特征池特征f′_m’的训练方差Var_m′；Step 5.3, repeat step 5.2 until the score vector of all sign language videos in the training dataset under the m'th feature pool feature f'_m' is obtained; and calculate the m'th feature pool feature f' of all sign language videos in the training dataset The sum of the average variances of the score vectors under _m' is denoted as the training variance Var _m' of the m'th feature pool feature f' _m ';

在本实施例中，将训练数据集中所有手语视频在第m′种特征池特征f′_m’下的得分向量的平均方差之和作为特征选择的标准；得分向量

的平均方差反映得分向量中的变量与得分向量

均值的偏离度，较小的平均方差意味着训练视频A在第m′种特征池特征f′_m’下，与不同种手语单词相关的概率难以区分，相反，更大的平均方差表明得分向量

具有良好的辨别力；In this embodiment, the sum of the average variances of the score vectors of all sign language videos in the training data set under the m'-th feature pool feature f'_m' is used as the criterion for feature selection; the score vector

The mean variance of reflect the variables in the score vector with the score vector

The degree of deviation from the mean, a smaller mean variance means that the training video A is indistinguishable from the probability associated with different sign language words under the m'th feature pool feature f'm _' , on the contrary, a larger mean variance indicates that the score vector

have good discrimination;

步骤5.4、令m′+1赋值给m′，判断m′≤M′是否成立，若成立，则执行步骤5.2；否则，表示得到M′种特征池特征所对应的训练方差Var₁,…,Var_m′,…,Var_M′，1≤m′≤M′，并执行步骤5.5；Step 5.4. Assign m'+1 to m', and judge whether m'≤M' is true. If it is true, go to step 5.2; otherwise, it means that the training variance Var ₁ ,..., Var _m′ ,…,Var _M′ , 1≤m′≤M′, and execute step 5.5;

步骤5.5、对训练方差Var₁,…,Var_m′,…,Var_M′进行降序排序，得到排序后的训练方差；Step 5.5, sort the training variances Var ₁ ,…,Var _m′ ,…,Var _M′ in descending order to obtain the sorted training variances;

步骤6、获取测试数据集中任意一个测试视频B，计算测试视频B在选择特征集合F″＝{f″₁,…,f″_k,…,f″_K}下的各得分向量：Step 6: Obtain any test video B in the test data set, and calculate each score vector of the test video B under the selection feature set F″={f″ ₁ ,...,f″ _k ,...,f″ _K }:

步骤6.1、初始化k＝1；Step 6.1, initialize k=1;

步骤6.2、根据式(3)计算测试视频B在第k种选择特征f″_k下的得分向量

Step 6.2. Calculate the score vector of the test video B under the k-th selection feature f″ _k according to formula (3).

式(3)中，

表示测试视频B在第n种手语单词c_n在第k种选择特征f″_k下的自适应隐马尔可夫模型

上的手语识别概率得分；In formula (3),

represents the adaptive hidden Markov model of the test video B under the nth sign language word c _n under the kth selection feature f″ _k

Sign language recognition probability score on ;

步骤6.3、利用Min-Max标准化对测试视频B在第k种选择特征f″_k下的得分向量

进行归一化处理，得到归一化后的得分向量

并对归一化后的得分向量

中的元素进行降序排列后，画出其分数曲线，再计算分数曲线下的区域面积，从而得到归一化后的得分向量

所对应的权重面积

Step 6.3. Use Min-Max normalization to measure the score vector of test video B under the kth selection feature f" _k

Perform normalization processing to get the normalized score vector

and the normalized score vector

After the elements are sorted in descending order, draw the score curve, and then calculate the area under the score curve to obtain the normalized score vector

The corresponding weight area

步骤6.4、令k+1赋值给k，若k＞K成立，则表示得到测试视频B在K种选择特征下的归一化后的得分向量

及其所对应的权重面积

并执行步骤7；否则，执行步骤6.2；Step 6.4. Assign k+1 to k. If k>K is established, it means that the normalized score vector of the test video B under the K selection features is obtained.

and its corresponding weight area

And go to step 7; otherwise, go to step 6.2;

步骤7、后端得分融合计算，并输出测试视频B对应的手语单词：Step 7, the back-end score fusion calculation, and output the sign language words corresponding to the test video B:

的权重

从而得到K种选择特征下的归一化后的得分向量

的各自权重

the weight of

Thereby, the normalized score vector under K kinds of selection features is obtained

their respective weights

步骤7.2、根据式(5)得到测试视频B的后端得分融合向量

Step 7.2, obtain the back-end score fusion vector of test video B according to formula (5)

步骤7.3、根据式(6)获得后端得分融合向量

The sign language word number n ^* corresponding to the maximum value:

从而得到测试视频B对应的手语单词为第n^*种手语单词

Thereby, the sign language word corresponding to the test video B is obtained as the n ^* th sign language word

Claims

1. A multi-feature fusion sign language recognition method based on self-adaptive hidden Markov is characterized by comprising the following steps:

step 1, obtaining a sign language video database and counting the sign language video frequencyDividing sign language videos in a database into a training data set and a testing data set; the training data set comprises sign language videos corresponding to N sign language words, and each sign language word corresponds to a plurality of sign language videos; the N sign language words are marked as C ═ C₁,…,c_n,…,c_NIn which c is_nN is more than or equal to 1 and less than or equal to N;

taking a plurality of sign language videos corresponding to each sign language word in the training data Set as a sign language video Set corresponding to the corresponding sign language word, thereby obtaining sign language video sets corresponding to N sign language words, and recording the sign language video sets as Set₁,…,Set_n,…,Set_NWhere Set_nFor the nth sign language word c_nThe corresponding sign language video set;

step 2, constructing a feature type set F:

extracting M kinds of features from the sign language video in the training data set to obtain a feature type set F ═ F₁,f₂,…,f_M}，f_MRepresenting the Mth feature, wherein M represents the total number of the feature types;

step 3, constructing a feature pool set F':

step 3.1, defining a variable i, and initializing i to be 1;

step 3.2, define the ith fusion feature set as F_iAnd initializing F_i＝F；

Step 3.3, changing i to 2;

step 3.4, randomly selecting i different characteristics from the characteristic type set F and sequentially splicing the characteristics into a fused characteristic, thereby obtaining a feature set

The ith fusion feature set F consisting of fusion features_i；

Step 3.5, assigning i +1 to i, judging whether i is less than or equal to M, and if so, executing step 3.4; otherwise, M fusion feature sets F are obtained₁,…,F_i,…,F_MAnd executing the step 3.6;

step 3.6, integrating the M fusion feature sets F₁,…,F_i,…,F_MAll the features in the set constitute a feature pool set F', and is denoted as F ═ F₁′,…,f′_m',…,f′_M′Of f'_m′Representing the M 'th feature pool feature, and M' representing the total number of feature pool features;

step 4, adopting a Gaussian mixture-hidden Markov GMM-HMM model to construct a self-adaptive hidden Markov model set of N sign language words under M' feature pool characteristics:

step 4.1, initializing n to 1;

step 4.2, initializing m' ═ 1;

step 4.3, using the AP attractor to spread the clustering algorithm to the nth sign language word c_nCorresponding sign language video Set_nClustering to obtain the nth sign language word c_nCorresponding sign language video Set_nAt m 'type feature pool feature f'_m′Characteristic cluster number of

Step 4.4, defining the nth sign language word c_nAt m 'type feature pool feature f'_m′The adaptive hidden Markov model of

And calculating the adaptive hidden Markov model according to equation (1)

Number of hidden states of

In the formula (1), G is the number of Gaussian functions in a Gaussian mixture model;

step 4.5, according toThe number of the hidden states

And a number of Gaussian functions G, using the Baum-Welch algorithm to generate the n sign language word c_nCorresponding sign language video Set_nThe nth sign language word c is obtained by learning_nAt m 'type feature pool feature f'_m′Adaptive hidden Markov model under

Step 4.6, assigning M '+ 1 to M', judging whether M 'is less than or equal to M' and executing step 4.3 if M 'is less than or equal to M'; otherwise, executing step 4.7;

step 4.7, assigning N +1 to N, judging whether N is equal to or less than N, and if so, executing step 4.2; otherwise, the self-adaptive hidden Markov model set of the N sign language words under the M' feature pool features is obtained

And executing the step 5;

and 5, constructing a selection feature set F' for rear-end score fusion:

step 5.1, initializing m' ═ 1;

step 5.2, obtaining any training video A in the training data set, and calculating the feature f ' of the training video A in the m ' th feature pool according to the formula (2) '_m′Score vector of

In the formula (2), the reaction mixture is,

represents the training video A inThe nth sign language word c_nAt m 'type feature pool feature f'_m′Adaptive hidden Markov model under

Sign language identification probability score;

step 5.3, repeating the step 5.2 until the m ' th feature pool feature f ' of all sign language videos in the training data set is obtained '_m′A lower score vector; and calculating the feature f ' of the m ' type feature pool of all sign language videos in the training data set '_m′The sum of the average variances of the score vectors is recorded as the m 'th feature pool feature f'_m′Training variance of (Var)_m′；

Step 5.4, assigning M '+ 1 to M', judging whether M 'is less than or equal to M' and executing step 5.2 if M 'is less than or equal to M'; otherwise, the training variance Var corresponding to the M' feature pool features is obtained₁,…,Var_m′,…,Var_M′M 'is more than or equal to 1 and less than or equal to M', and the step 5.5 is executed;

step 5.5, to the training variance Var₁,…,Var_m′,…,Var_M′Sorting in a descending order to obtain sorted training variances;

setting parameter 1 ≤ K<M', selecting the characteristic pool characteristics corresponding to the first K sequenced training variances, and forming a selected characteristic set F ″, which is recorded as F ″ { F ″)₁,…,f″_k,…,f″_KWhere f ″)_kK is more than or equal to 1 and less than or equal to K;

step 6, obtaining any one test video B in the test data set, and calculating the selected feature set F ″ ═ F of the test video B₁″,…,f″_k,…,f″_KScore vectors under }:

step 6.1, initializing k to 1;

step 6.2, calculating the k-th selection characteristic f' of the test video B according to the formula (3)_kScore vector of

In the formula (3), the reaction mixture is,

indicating the nth sign language word c_nSelection of feature f' in the kth_kA lower adaptive Markov model;

indicating that said test video B is in said nth sign language word c_nSelection of feature f' in the kth_kAdaptive hidden Markov model under

Sign language identification probability score;

6.3, selecting a characteristic f' of the k-th type of the test video B by utilizing Min-Max standardization_kScore vector of

Carrying out normalization processing to obtain a normalized score vector

And to the normalized score vector

After the elements in the vector are arranged in a descending order, a fraction curve is drawn, and the area under the fraction curve is calculated, so that the normalized score vector is obtained

Corresponding weight area

And 6.4, assigning K +1 to K, and if K is greater than K, indicating that the normalized score vector of the test video B under K selection characteristics is obtained

And the corresponding weight area

And executing step 7; otherwise, executing step 6.2;

and 7, performing fusion calculation on the rear-end scores, and outputting the sign language words corresponding to the test video B:

step 7.1, calculating the k-th selection characteristic f' of the tested video B according to the formula (4)_kNormalized score vector of

Weight of (2)

Thereby obtaining the normalized score vector under the K selection characteristics

Respective weights of

Step 7.2, obtaining the rear-end score fusion vector of the test video B according to the formula (5)

Step 7.3, obtaining the rear-end score fusion vector according to the formula (6)

Sign language word sequence number n corresponding to the maximum value in the sequence^*：

Thereby obtaining the sign language word corresponding to the test video B as the nth^*Sign language word