CN103474061A

CN103474061A - Automatic distinguishing method based on integration of classifier for Chinese dialects

Info

Publication number: CN103474061A
Application number: CN2013104161737A
Authority: CN
Inventors: 朱贺; 高红民; 王慧斌
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2013-09-12
Filing date: 2013-09-12
Publication date: 2013-12-25

Abstract

The invention discloses a Chinese dialect automatic recognition method based on classifier fusion, which is divided into four steps: Chinese dialect speech feature extraction, dialect model matching and scoring, classification vector extraction and back-end classification. A two-stage feature extraction method is adopted, and a Gaussian mixture model (GMM) is used as an advanced feature extractor. In the process of calculation, the phonetic features are sent to the Gaussian mixture model with prior knowledge of dialect phonetics for scoring, and the obtained scores are normalized and calculated to form a high degree of inter-class difference and intra-class aggregation. Categorical vector. It is fed into the back-end Support Vector Machine (SVM) classifier for classification. Combining the technical advantages of GMM in data distribution fitting and SVM in classification surface modeling, the identification of the dialect area to which Chinese dialects belong is finally realized. The invention can be stably and reliably used for the identification tasks of Chinese telephone dialect speech and the like, and has a high accuracy rate.

Description

Chinese Dialect Automatic Recognition Method Based on Classifier Fusion

技术领域technical field

本发明涉及多分类器融合技术的语音辨识方法，尤其涉及一种汉语方言辨识方法，属于语音信号处理领域。The invention relates to a speech recognition method of multi-classifier fusion technology, in particular to a Chinese dialect recognition method, which belongs to the field of speech signal processing.

背景技术Background technique

汉语方言自动辨识是利用计算机分析一段输入的语音，判别说话人所属方言区域的语音处理技术。在我国这样一个多民族、多方言的国家，汉语方言自动辩识的研究为我国各民族间的无障碍沟通建立了基础，随着我国科学技术的快速发展，其中更是酝酿着巨大的应用价值和广阔的应用前景。作为语音识别研究的一个分支，在早期的研究中，汉语方言辨识系统往往采用单分类器单特征的设计策略，忽视了信息融合在系统设计中的应用，使得系统完全依赖于某一分类器和某一特征，制约了系统性能的提高。Automatic Chinese dialect recognition is a speech processing technology that uses a computer to analyze a piece of input speech and determine the dialect area to which the speaker belongs. In my country, a multi-ethnic and multi-dialect country, the research on automatic identification of Chinese dialects has established a foundation for barrier-free communication among various ethnic groups in our country. With the rapid development of science and technology in our country, it is brewing a huge application value and broad application prospects. As a branch of speech recognition research, in the early research, the Chinese dialect recognition system often adopted the design strategy of single classifier and single feature, ignoring the application of information fusion in system design, making the system completely dependent on a certain classifier and A certain feature restricts the improvement of system performance.

多信息融合是目前信息处理研究领域的热点，它不仅可以更加全面、详细地描述客观现象，还能实现深层信息的挖掘。在语音处理领域，信息融合方式主要采用两种方式：一、多特征融合；二、多分类器融合。前者采用多特征-单分类器的设计策略，通过不同特征得分的加权和，使得一个系统中同时使用多个特征，从而提供更高正确率的决策；而后者则采用多分类器的设计策略，将具有互补性的分类器融合到一个系统中，通过不同分类器在分类策略上的差异实现多重分类并融合分类结果。在相应的分类器融合的研究中，目前多是针对与文本相关的语音识别，而能够适应与文本无关语音识别的融合机制并不多见。Multi-information fusion is currently a hot spot in the field of information processing research. It can not only describe objective phenomena more comprehensively and in detail, but also realize deep information mining. In the field of speech processing, information fusion mainly adopts two methods: first, multi-feature fusion; second, multi-classifier fusion. The former adopts the design strategy of multi-feature-single classifier. Through the weighted sum of different feature scores, multiple features are used in a system at the same time, thus providing a decision with a higher accuracy rate; while the latter adopts the design strategy of multi-classifiers. Integrating complementary classifiers into one system, multiple classifications can be realized through the differences in classification strategies of different classifiers and the classification results can be fused. In the research on the fusion of corresponding classifiers, at present, most of them are aimed at text-related speech recognition, but fusion mechanisms that can adapt to text-independent speech recognition are rare.

发明内容Contents of the invention

发明目的：针对现有技术中存在的问题，本发明以两级分类器为框架，提出了一套新分类器融合机制，具体是一种基于分类器融合的汉语方言自动辨识方法。本发明可以更好的提取类汉语方言语音特征间差异信息，并且更加适应与文本无关的方言、语种识别等识别系统，显著提高分类能力和鲁棒性。Purpose of the invention: Aiming at the problems existing in the prior art, the present invention proposes a new fusion mechanism of classifiers based on two-stage classifiers, specifically a method for automatic identification of Chinese dialects based on the fusion of classifiers. The invention can better extract the difference information between phonetic features of Chinese-like dialects, and is more suitable for recognition systems such as dialect and language recognition that have nothing to do with the text, and significantly improves the classification ability and robustness.

在分类器融合中，融合系统的性能主要取决于以下两点：一、分类器的选取；二、融合机制的设计。在分类器的选取上，通常要求多分类器在分类策略上具有互补性，从而在融合后实现置信度更高的决策。鉴于此，本文选用生成式分类器高斯混合模型（GMM）和判决式分类器支撑矢量机（SVM）为融合对象。作为生成式分类器，GMM具有较好的数据拟合能力，能够较好的描述整体数据的分布状态。但是，由于需要从完备的数据中学习参量，对于训练集的数据量要求过高且训练周期较长。相比较，SVM不具备较好的数据分布的拟合能力但能够较为清晰的描述分类面的状态。因此，GMM和SVM在原理上具有互补性将其融合可以发挥两种分类器的优势。对于融合机制的设计可以采取后端分数融合和多级融合两种方式。前者对SVM的决策进行置信度打分，并将其与GMM的打分进行加权求，以此进行类别决策；后者将GMM作为分类矢量的生成器，生成含有全局信息的分类矢量并送入SVM进行分类。在方言识别中由于数据的分布状态过于复杂且数据量过于庞大，不宜使用SVM对原始语音特征进行分类和打分，此外在分数融合中权重的选择也有一定的难度，因此，多级分类器融合系统更加适应于汉语方言辨识研究。传统基于GMM、SVM的两级分类器融合通常采用Fisher核函数作为融合机制，在所提取的特征中不仅含有方言语音的声学信息也含有该方言的全局信息，是一种高级的分类矢量。但是，其中也存在着诸多局限。首先，Fisher核函数的映射空间存在着维数灾难的隐患，很难满足大数据量的与文本无关的语音识别。其次，对于同一语音基元，不同方言模型的打分间有一定的相关性，如表1所示，而种相关性影响了分类矢量的类代表性。最后，对于方言辨识，我们期望分类特征体现出方言的类间差异，即不同方言模型对一段语音打分间的差异性。In classifier fusion, the performance of the fusion system mainly depends on the following two points: 1. the selection of classifiers; 2. the design of the fusion mechanism. In the selection of classifiers, multi-classifiers are usually required to be complementary in classification strategies, so that decisions with higher confidence can be achieved after fusion. In view of this, this paper chooses the generative classifier Gaussian Mixture Model (GMM) and the decision classifier Support Vector Machine (SVM) as the fusion object. As a generative classifier, GMM has better data fitting ability and can better describe the distribution state of the overall data. However, due to the need to learn parameters from complete data, the data volume of the training set is too high and the training period is long. In comparison, SVM does not have a good ability to fit the data distribution, but it can clearly describe the state of the classification surface. Therefore, GMM and SVM are complementary in principle, and their fusion can take advantage of the two classifiers. For the design of the fusion mechanism, two methods can be adopted: back-end score fusion and multi-level fusion. The former scores the confidence level of the SVM decision and weights it with the GMM score to make category decisions; the latter uses GMM as the generator of the classification vector to generate a classification vector containing global information and send it to the SVM for classification. Classification. In dialect recognition, due to the complex distribution of data and the large amount of data, it is not suitable to use SVM to classify and score the original speech features. In addition, it is also difficult to select the weight in score fusion. Therefore, the multi-level classifier fusion system It is more suitable for the research of Chinese dialect identification. The traditional two-level classifier fusion based on GMM and SVM usually uses the Fisher kernel function as the fusion mechanism. The extracted features contain not only the acoustic information of the dialect speech but also the global information of the dialect, which is an advanced classification vector. However, there are also many limitations. First of all, the mapping space of the Fisher kernel function has the hidden danger of dimensionality disaster, which makes it difficult to meet the needs of text-independent speech recognition with a large amount of data. Secondly, for the same phonetic primitive, there is a certain correlation between the scores of different dialect models, as shown in Table 1, and the kind of correlation affects the class representativeness of the classification vector. Finally, for dialect identification, we expect the classification features to reflect the inter-class differences of dialects, that is, the differences between different dialect models scoring a piece of speech.

表.1不同方言模型对语音基元的打分Table.1 Scoring of phonetic primitives by different dialect models

技术方案：一种基于分类器融合的汉语方言自动辨识方法，选用生成式分类器高斯混合模型（GMM）和判决式分类器支撑矢量机（SVM）为融合对象，生成式分类器高斯混合模型是生成式概率统计模型，其概率密度计算公式为：Technical solution: A method for automatic identification of Chinese dialects based on classifier fusion. The generative classifier Gaussian mixture model (GMM) and the decision classifier support vector machine (SVM) are selected as the fusion objects. The generative classifier Gaussian mixture model is Generative probability and statistics model, the formula for calculating the probability density is:

$P P ((x x | | {W W}_{n no})) = = {Σ Σ}_{i i = = 11}^{k k} {w w}_{ni ni} \frac{11}{{((22 π π))}^{N N} {| | {Σ Σ}_{ni ni} | |}^{11 / / 22}} exp exp ((- - \frac{11}{22} {((x x - - {μ μ}_{ni ni}))}^{T T} {Σ Σ}_{ni ni}^{- - 11} ((x x - - {μ μ}_{ni ni})))) - - - - - - ((11))$

其中，X为一个语音基元的声学特征，w_ni,μ_ni,Σ_ni分别代表方言GMM中每个高斯混合元的权重、均值和协方差矩阵，k为混合元维数。输入汉语方言信号进行语音特征提取，在新分类特征的提取过程中，首先利用已知的训练样本集合训练方言的GMM；然后将语音数据输入到设计好的各种方言GMM中，对语音基元进行似然打分，组成分数矢量[P(x_i|μ₁Σ₁)P(x_i|μ₂Σ₂)…P(x_i|μ_NΣ_N)]，实现从原始语音特征空间到分数空间的映射。最后对该矢量进行归一化处理和差分运算。其计算步骤如下：Among them, X is the acoustic feature of a phonetic primitive, w _ni , μ _ni , and Σ _ni represent the weight, mean and covariance matrix of each Gaussian mixture element in the dialect GMM, respectively, and k is the dimension of the mixture element. Input Chinese dialect signals to extract speech features. In the process of extracting new classification features, first use the known training sample set to train the GMM of the dialect; Carry out likelihood scoring and form a score vector [P(x _i |μ ₁ Σ ₁ )P(x _i |μ ₂ Σ ₂ )…P(xi _| μ _N Σ _N )] to realize the conversion from the original speech feature space to the score Mapping of space. Finally, normalization processing and difference operation are performed on the vector. Its calculation steps are as follows:

一、对语音的得分进行归一化处理：1. Normalize the voice scores:

其中C_i是归一化因子，文中取： $C_{i} = \max_{n} (P (x_{i} | μ_{n} Σ_{n})), n = 1 . . . N$ Among them, C _i is the normalization factor, which is taken in the text: $C_{i} = \max_{no} (P (x_{i} | μ_{no} Σ_{no})), no = 1 . . . N$

二、计算分数差分：Second, calculate the score difference:

φ′(x_i)＝[(SV_i1-SV_i2)(SV_i1-SV_i3)…(SV_i1-SV_iN),(SV_i2-SV_i3)(SV_i2-SV_i4)φ′(xi ₎ =[(SV _i1 -SV _i2 )(SV _i1 -SV _i3 )...(SV _i1 -SV _iN ),(SV _i2 -SV _i3 )(SV _i2 -SV _i4 )

…(SV_i2-SV_iN),…,(SV_iN-1-SV_iN)] (3)…(SV _i2 -SV _iN ),…,(SV _iN-1 -SV _iN )] (3)

随后，基于训练分类矢量训练SVM分类器。Subsequently, an SVM classifier is trained based on the training classification vectors.

本发明采用上述技术方案，具有以下有益效果：它可以很好的解决Fisher核函数在分类矢量设计中所存在的问题，同时体现出类间差异信息，更加适应方言、语种辨识等语音辨识工作。The present invention adopts the above technical scheme and has the following beneficial effects: it can well solve the problems existing in the classification vector design of the Fisher kernel function, and at the same time reflect the difference information between classes, and is more suitable for speech recognition work such as dialect and language recognition.

附图说明Description of drawings

图1为本发明实施例的方法流程图；Fig. 1 is the method flowchart of the embodiment of the present invention;

图2为本发明实施例中的初级特征矢量和分类矢量的分布图。FIG. 2 is a distribution diagram of primary feature vectors and classification vectors in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

如图1所示，基于分类器融合的汉语方言自动辨识方法，选用生成式分类器高斯混合模型（GMM）和判决式分类器支撑矢量机（SVM）为融合对象，生成式分类器高斯混合模型是生成式概率统计模型，它可以大致描述出数据空间中的全局信息，其概率密度计算公式为：As shown in Figure 1, the automatic identification method for Chinese dialects based on classifier fusion uses the generative classifier Gaussian mixture model (GMM) and the decision classifier support vector machine (SVM) as fusion objects, and the generative classifier Gaussian mixture model is a generative probability and statistics model, which can roughly describe the global information in the data space, and its probability density calculation formula is:

其中，X为一个语音基元的声学特征，w_ni,μ_ni,Σ_ni分别代表方言GMM中每个高斯混合元的权重、均值和协方差矩阵，k为混合元维数。在新分类特征的提取过程中，首先利用已知的训练样本集合训练方言的GMM。然后将语音数据输入到设计好的各种方言GMM中，对语音基元进行似然打分，组成分数矢量[P(x_i|μ₁Σ₁)P(x_i|μ₂Σ₂)…P(x_i|μ_NΣ_N)]，实现从原始语音特征空间到分数空间的映射。最后对该矢量进行归一化处理和差分运算。其计算步骤如下：Among them, X is the acoustic feature of a phonetic primitive, w _ni , μ _ni , and Σ _ni represent the weight, mean and covariance matrix of each Gaussian mixture element in the dialect GMM, respectively, and k is the dimension of the mixture element. In the process of extracting new classification features, the GMM of the dialect is first trained using the known training sample set. Then input the voice data into various designed dialect GMMs, perform likelihood scoring on the phonetic primitives, and form a score vector [P( _xi |μ ₁ Σ ₁ )P( _xi |μ ₂ Σ ₂ )…P ( _xi |μ _N Σ _N )], which realizes the mapping from the original speech feature space to the score space. Finally, normalization processing and difference operation are performed on the vector. Its calculation steps are as follows:

一、对语音的得分进行归一化处理：1. Normalize the voice scores:

二、计算分数差分：Second, calculate the score difference:

在该融合过程中，一方面通过归一化处理减小了前文中所提到的打分间的相关性对识别率的影响，另一方面通过不同方言GMM对某一基元打分后的分数差分运算成功的提出方言的类间差异信息，使得

(x_i)中不仅包含有声学信息、全局信息还含有方言类间差异信息。通过吴方言、粤方言、闽方言三种方言分类矢量的分布状态（如图2所示），可以看到与原始特征相比新融合机制下的同种方言语音分类矢量体现出了较强聚类和类间差异效果，更加适用于方言辨识工作。In this fusion process, on the one hand, the influence of the correlation between scores mentioned above on the recognition rate is reduced through normalization processing; The operation successfully proposes the inter-class difference information of dialects, so that

( _xi ) contains not only acoustic information, global information but also difference information between dialect classes. Through the distribution status of classification vectors of Wu dialect, Cantonese dialect and Fujian dialect (as shown in Figure 2), it can be seen that compared with the original features, the speech classification vectors of the same dialect under the new fusion mechanism reflect a stronger aggregation. Class and inter-class difference effects are more suitable for dialect identification.

由于SVM分类器较好的分类能力及突出推广性，在多级分类的器融合中通常选择SVM作为后端分类器。汉语方言辨识本质上是一个多类别分类问题，目前该问题的解决主要是采用决策树算法、“一对一”分类策略、“一对多”等分类策略。但是由于多类别样本数据分布的复杂性，大量实验证明基于以上策略的辨识系统在处理多类别分类问题时并不理想。本发明采用ECOC算法，该算法对待分类别进行二值编码，以此作为类别的标签。在编码的过程中，算法要求码矩阵中每行每列的码字间要保持独立性和可分性。据此，ECOC算法要求当3≤k≤7时，码本的最大长度应为2^k-1-1维，其中k为类别数。其编码规则为：首行为单位矢量，第二行码本的是由2^k-2个0和2^k-2个1交替组成，以此类推，第i行码本是由2^k-i个0和2^k-i个1交替组成。假设以4类问题为分类对象，便需要7维的码书来进行编码设计，如表1所示，其中行向量是ECOC算法针对每一类别的编码。根据码字矩阵中的列向量的类别标签设计分类器，得到f₁,f₂,…f_nn≤2^k-1-1。在测试过程中，该算法首先对输入语音按照f₁,f₂,…f_n分类规则进行分类，然后根据分类结果对未知语音进行编码，设计出该语音的码书，最后将其与已知的类别码书相匹配，实现决策。ECOC算法使用基于Hamming距离最近邻算法进行匹配度量，具有一定的容错性，这一特点在多类别的分类中尤为重要。本文使用ECOC算法实现多类方言的辨识，如表2所示。Due to the better classification ability and outstanding generalization of the SVM classifier, SVM is usually selected as the back-end classifier in the fusion of multi-level classification. Chinese dialect identification is essentially a multi-category classification problem. At present, the solution to this problem mainly adopts decision tree algorithm, "one-to-one" classification strategy, "one-to-many" and other classification strategies. However, due to the complexity of multi-category sample data distribution, a large number of experiments have proved that the identification system based on the above strategy is not ideal when dealing with multi-category classification problems. The present invention adopts the ECOC algorithm, which carries out binary coding for the category to be classified, and uses it as the label of the category. In the process of encoding, the algorithm requires that the codewords of each row and column in the code matrix should maintain independence and separability. Accordingly, the ECOC algorithm requires that when 3≤k≤7, the maximum length of the codebook should be 2 ^k-1 -1 dimensions, where k is the number of categories. The coding rules are as follows: the first row of the unit vector, the second row of the codebook is composed of 2 ^k-2 0s and 2 ^k-2 1s alternately, and so on, the i-th row of the codebook is composed of 2 ^ki 0s and It consists of 2 ^ki and 1 alternately. Assuming that 4 types of problems are used as classification objects, a 7-dimensional codebook is required for coding design, as shown in Table 1, where the row vector is the code for each category of the ECOC algorithm. Design a classifier according to the category labels of the column vectors in the codeword matrix, and obtain f ₁ , f ₂ ,...f _n n≤2 ^k-1 -1. During the testing process, the algorithm first classifies the input speech according to f ₁ , f ₂ ,...f _n classification rules, then encodes the unknown speech according to the classification results, designs the codebook of the speech, and finally compares it with the known The category codebook is matched to achieve decision-making. The ECOC algorithm uses the nearest neighbor algorithm based on the Hamming distance for matching measurement, which has a certain degree of fault tolerance, which is especially important in multi-category classification. This paper uses the ECOC algorithm to realize the identification of multiple dialects, as shown in Table 2.

表.2类别编码Table.2 Category coding

训练时，采用训练数据语音，分别训练不同方言的128维GMM模型，并输出对每段训练语音（15s）的似然打分。随后通过归一化及差分计算，生成训练分类矢量。随后，基于训练分类矢量训练SVM分类器。During training, the training data speech is used to train the 128-dimensional GMM models of different dialects, and the likelihood score for each training speech (15s) is output. Then through normalization and differential calculation, the training classification vector is generated. Subsequently, an SVM classifier is trained based on the training classification vectors.

测试时，将输入的汉语方言语音数据按照上述流程，送入到GMM模型中进行打分，并提取分类矢量，进行分类。During the test, the input Chinese dialect speech data is sent to the GMM model for scoring according to the above process, and the classification vector is extracted for classification.

至此完成一次汉语方言的识别。So far, the identification of a Chinese dialect has been completed.

Claims

1. the Chinese dialect identification method based on Multiple Classifier Fusion, it is characterized in that: select GMM and SVM for merging object, input Chinese dialects signal carries out speech feature extraction, in the leaching process of new characteristic of division, at first utilizes the GMM of known training sample set training dialect; Then speech data is input in the GMM of the various dialects that design, speech primitive is carried out to likelihood marking, form mark vector [P (x _i| μ ₁Σ ₁) P (x _i| μ ₂Σ ₂) ... P (x _i| μ _nΣ _n)], realize the mapping from the raw tone feature space to minute number space; Secondly this mark vector is carried out to normalized and calculus of differences; Subsequently, based on training classification vector training svm classifier device.

2. the Chinese dialect identification method based on Multiple Classifier Fusion as claimed in claim 1, it is characterized in that: GMM is the production probability statistics model, its probability density computing formula is:

P (x | W_{n}) = Σ_{i = 1}^{k} w_{ni} \frac{1}{{(2 π)}^{N} {| Σ_{ni} |}^{1 / 2}} \exp (- \frac{1}{2} {(x - μ_{ni})}^{T} Σ_{ni}^{- 1} (x - μ_{ni})) - - - (1)

Wherein, the acoustic feature that X is a speech primitive, w _ni, μ _ni, Σ _nirepresent respectively weight, average and the covariance matrix of each Gaussian Mixture unit in dialect GMM, k is for mixing first dimension.

3. the Chinese dialect identification method based on Multiple Classifier Fusion as claimed in claim 1 is characterized in that: described mark vector is carried out to normalized and calculus of differences is calculated as follows:

One, the score of voice is carried out to normalized:

SV _i＝(1/C _i)·[P(x _i|μ ₁Σ ₁)P(x _i|μ ₂Σ ₂)…P(x _i|μ _NΣ _N)] (2)

C wherein _ibe normalized factor, get:

C_{i} = \max_{n} (P (x_{i} | μ_{n} Σ_{n})), n = 1 . . . N;

Two, calculate the mark difference:

φ′(x _i)＝[(SV _i1-SV _i2)(SV _i1-SV _i3)…(SV _i1-SV _iN),(SV _i2-SV _i3)(SV _i2-SV _i4)

…(SV _i2-SV _iN),…,(SV _iN-1-SV _iN)] (3)。

4. the Chinese dialect identification method based on Multiple Classifier Fusion as claimed in claim 1 is characterized in that: in training classification vector training svm classifier device, adopt the ECOC algorithm to treat the sub-category binary-coding that carries out, using this label as classification; In the process of coding, in requirement code matrix, between the code word of each row and column, to keep independence and separability; When 3≤k≤7, the maximum length of code book should be 2 ^k-1-1 dimension, wherein k is the classification number; Coding rule is: the first row contains unit vector, the second row code book be by 2 ^k-2individual 0 and 2 ^k-2individual 1 alternately forms, and by that analogy, the capable code book of i is by 2 ^k-iindividual 0 and 2 ^k-iindividual 1 alternately forms; Suppose take that 4 class problems are object of classification, just need the code book of 7 dimensions to carry out code Design, the row vector is the coding of ECOC algorithm for each classification; Class label design category device according to the column vector in code word matrix, obtain f ₁, f ₂... f _nn≤2 ^k-1-1; In test process, this algorithm is at first to inputting voice according to f ₁, f ₂... f _nclassifying rules is classified, and then according to classification results, unknown voice is encoded, and designs the code book of these voice, finally itself and known classification code book is complementary.