CN103117060B

CN103117060B - For modeling method, the modeling of the acoustic model of speech recognition

Info

Publication number: CN103117060B
Application number: CN201310020010.7A
Authority: CN
Inventors: 颜永红; 肖业鸣; 潘接林
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2015-10-28
Anticipated expiration: 2033-01-18
Also published as: CN103117060A

Abstract

The invention relates to a modeling method of an acoustic model for speech recognition and a speech recognition system. The method includes: training an initial model, the modeling unit is a triphone state clustered by a phoneme decision tree, and the model also provides a state transition probability; Perform mandatory alignment to obtain its frame-level state information; pre-train the deep neural network to obtain the initial weights of each hidden layer; use the error back-propagation algorithm to train the initialized network based on the obtained frame-level state information, Update weights. The present invention adopts the context-dependent triphone state as the modeling unit, based on the deep neural network modeling, uses the restricted Boltzmann algorithm to initialize the weights of each hidden layer of the network, and the weights can also be used in the follow-up. The error propagation algorithm is updated, which can effectively alleviate the risk of falling into local extremum during the network pre-training, and further improve the modeling accuracy of the acoustic model.

Description

Modeling method and modeling system of acoustic model for speech recognition

技术领域technical field

本发明涉及语音识别领域，尤其涉及一种用于语音识别的声学模型的建模方法及建模系统。The invention relates to the field of speech recognition, in particular to a modeling method and modeling system of an acoustic model for speech recognition.

背景技术Background technique

目前语音识别的主流框架基于统计模式识别。典型的语音识别系统框架如图1所示：包括语音采集及前端处理模块、特征提取模块、声学模型模块、语言模型模块以及解码器模块。语音识别的基本流程如下：语音采集装置收集人的语音后经过前端处理之后进行特征提取，提取的特征序列如MFCC或PLP通过声学模型获得其观察概率，结合语言模型概率送入解码器获得最有可能的文本序列。所述声学模型建模基于隐马尔科夫框架，采用混合高斯模型对语音特征的概率分布进行建模。所述混合高斯模型会对语音特征及其分布做一些不恰当的假设，如相邻语音特征的线性无关假设，其观察概率服从混合高斯分布等。此外，混合高斯模型进行参数训练时目标函数是使观察特征的似然概率最大，而解码时使用的却是最大后验准则，概率模型上不一致。可见传统的声学模型，建模精度不高，导致语音识别效果欠佳。The current mainstream framework for speech recognition is based on statistical pattern recognition. A typical speech recognition system framework is shown in Figure 1: it includes speech collection and front-end processing modules, feature extraction modules, acoustic model modules, language model modules, and decoder modules. The basic process of speech recognition is as follows: the speech collection device collects the human speech and performs feature extraction after front-end processing. The extracted feature sequence, such as MFCC or PLP, obtains its observation probability through the acoustic model, and sends it to the decoder in combination with the language model probability to obtain the most effective possible text sequences. The modeling of the acoustic model is based on the Hidden Markov Framework, and the probability distribution of speech features is modeled by using a mixed Gaussian model. The mixed Gaussian model will make some inappropriate assumptions about speech features and their distribution, such as the assumption of linear independence of adjacent speech features, and its observation probability obeys the mixed Gaussian distribution. In addition, when the Gaussian mixture model is used for parameter training, the objective function is to maximize the likelihood probability of the observed features, but the maximum a posteriori criterion is used for decoding, and the probability model is inconsistent. It can be seen that the traditional acoustic model has low modeling accuracy, resulting in poor speech recognition effect.

发明内容Contents of the invention

针对上述问题，本发明实施例提出一种用于语音识别的声学模型的建模方法、建模系统。In view of the above problems, an embodiment of the present invention proposes a modeling method and a modeling system for an acoustic model for speech recognition.

在第一方面，本发明实施例提出一种用于语音识别的声学模型的建模方法，所述方法包括：用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型，该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态，所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率；基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息；对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数；基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练，更新其各隐含层的权重。In a first aspect, an embodiment of the present invention proposes a method for modeling an acoustic model for speech recognition, the method comprising: using training data to train a Hidden Markov-Mixed Gaussian HMM-GMM model, the HMM-GMM The modeling unit of the model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, and the HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm; based on The HMM-GMM model is forcibly aligning the triphone sub-states of the speech features of the training data to obtain the frame-level state information of the speech features; pre-training the deep neural network as the acoustic model to obtain Initialize the parameters of the weights of each hidden layer of the deep network; based on the triphone state of the training data voice feature, the deep neural network is trained using the error back propagation algorithm, and the weights of each hidden layer are updated .

优选地，所述基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息，具体为：基于所述HMM-GMM模型，将所述训练数据语音特征与其最可能的三音子状态进行对应，获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the triphone sub-states of the speech features of the training data are forcibly aligned to obtain frame-level state information of the speech features, specifically: based on the HMM-GMM model, Corresponding the speech features of the training data with their most probable triphone states to obtain frame-level state information of the speech features.

优选地，所述对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为：利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛，用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the pre-training of the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network is specifically: using a restricted Boltzmann machine based on the The training data is trained layer by layer until convergence, and the weights of each hidden layer of the deep network are initialized with the obtained parameters.

在第二方面，本发明实施例提出一种用于语音识别声学模型的建模系统，其包括：第一模块，用于用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型，该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态，所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率；第二模块，用于基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息；第三模块，用于对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数；第四模块，用于基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练，更新其各隐含层的权重。In a second aspect, an embodiment of the present invention proposes a modeling system for an acoustic model of speech recognition, which includes: a first module for training a Hidden Markov-Mixed Gaussian HMM-GMM model with training data, the The modeling unit of the HMM-GMM model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, and the HMM-GMM model obtains the state transition of the triphone state through the expected maximum EM algorithm Probability; the second module is used to forcibly align the triphone sub-states of the speech features of the training data based on the HMM-GMM model, and obtain the frame-level state information of the speech features; the third module is used to perform as The deep neural network of the acoustic model is pre-trained to obtain parameters for initializing the weights of each hidden layer of the deep network; the fourth module is used to adopt an error based on the triphone state of the training data voice feature The backpropagation algorithm trains the deep neural network and updates the weights of its hidden layers.

优选地，所述第二模块基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息，具体为：所述第二模块基于所述HMM-GMM模型，将所述训练数据语音特征与其最可能的三音子状态进行对应，获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the second module performs forced alignment on the triphone states of the speech features of the training data to obtain frame-level state information of the speech features, specifically: the second module Based on the HMM-GMM model, the speech features of the training data are associated with their most probable triphone states, and frame-level state information of the speech features is obtained.

优选地，所述第三模块对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为：所述第三模块利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛，用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the third module pre-trains the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network, specifically: the third module uses limited The Boltzmann machine performs layer-by-layer training based on the training data until convergence, and uses the obtained parameters to initialize the weights of each hidden layer of the deep network.

本发明实施例采用三音子状态，基于深层神经网络建模，使用受限波尔兹曼算法初始化所述网络各隐含层的权重，所述权重在后续还可以借助反向误差传播算法被更新，能够有效地缓解所述网络预训练时容易陷入局部极值的风险，并进一步提高声学模型的建模精度。The embodiment of the present invention adopts the triphonic state, based on the deep neural network modeling, and uses the restricted Boltzmann algorithm to initialize the weights of the hidden layers of the network, and the weights can also be obtained later by means of the reverse error propagation algorithm. The update can effectively alleviate the risk of easily falling into local extremum during the network pre-training, and further improve the modeling accuracy of the acoustic model.

附图说明Description of drawings

下面结合附图和具体实施方式对本发明作进一步详细的说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1是现有的语音识别系统示意图；Fig. 1 is a schematic diagram of an existing speech recognition system;

图2是本发明实施例的基于上下文相关深层神经网络语音识别系统框图；Fig. 2 is a block diagram of a speech recognition system based on a context-dependent deep neural network according to an embodiment of the present invention;

图3是本发明实施例的用于语音识别的声学模型的建模方法示意图；3 is a schematic diagram of a modeling method for an acoustic model for speech recognition according to an embodiment of the present invention;

图4是本发明实施例的用于语音识别的声学模型的建模系统示意图。Fig. 4 is a schematic diagram of a modeling system of an acoustic model for speech recognition according to an embodiment of the present invention.

具体实施方式Detailed ways

下面通过附图和实施例，对本发明实施例的技术方案做进一步的详细描述。The technical solutions of the embodiments of the present invention will be described in further detail below with reference to the drawings and embodiments.

考虑到混合高斯模型需要对语音特征及其概率分布做出不恰当假设，本发明实施例使用上下文相关的深层神经网络代替混合高斯模型进行声学模型建模。所述深层神经网络包含多个隐含层，其建模单元是经音素决策树聚类后的上下文相关三音子状态。整个系统的基本框图如图2所示。Considering that the mixed Gaussian model needs to make inappropriate assumptions about speech features and their probability distributions, the embodiment of the present invention uses a context-dependent deep neural network instead of the mixed Gaussian model for acoustic model modeling. The deep neural network includes a plurality of hidden layers, and its modeling unit is a context-dependent triphone state clustered by a phoneme decision tree. The basic block diagram of the whole system is shown in Figure 2.

深层神经网络训练时采用最小交叉熵准则作为目标函数，由于其具有多个隐含层，其误差函数具有很多的局部极值，导致深层神经网络在训练过程很容易陷入局部极值而过早的收敛。针对此问题，神经计算领域提出的通过神经网络预训练来初始化权重参数，再采用传统的误差反向传播算法对网络参数进行训练。预训练算法采用受限玻尔兹曼机，受限玻尔兹曼机为双向图模型，包括一个可见层和一个隐含层，其中同一层的各单元之间无互联而不同层的单元稠密链接。该模型通过一个能量函数定义可见层与隐含层变量的联合分布，具体公式如下：The minimum cross-entropy criterion is used as the objective function during deep neural network training. Because it has multiple hidden layers, its error function has many local extremums, which makes it easy for the deep neural network to fall into local extremums during the training process and cause premature failure. convergence. To solve this problem, the neural computing field proposed to initialize the weight parameters through neural network pre-training, and then use the traditional error back propagation algorithm to train the network parameters. The pre-training algorithm uses a restricted Boltzmann machine, which is a two-way graph model, including a visible layer and a hidden layer, in which there is no interconnection between the units of the same layer and the units of different layers are dense Link. The model defines the joint distribution of visible layer and hidden layer variables through an energy function, and the specific formula is as follows:

其中v为可见层变量，h为隐含层变量，E(v,h)为能量函数，p(v,h)为其联合分布概率，训练时通过最大观察特征似然概率p(v)，其权重参数更新公式如下：Among them, v is the visible layer variable, h is the hidden layer variable, E(v,h) is the energy function, p(v,h) is its joint distribution probability, and the maximum observation feature likelihood probability p(v) is passed during training, Its weight parameter update formula is as follows:

Δw_ij＝＜v_ih_j＞_data-＜v_ih_j＞_model Δw _ij ＝＜v _i h _j ＞ _data -＜v _i h _j ＞ _model

w_ij(t+1)＝w_ij(t)+Δw_ij w _ij (t+1)=w _ij (t)+Δw _ij

其中w_ij是连接权重，t是迭代次数，＜＞表示对括号内的变量取均值。Among them, w _ij is the connection weight, t is the number of iterations, and <> means to take the mean value of the variables in the brackets.

通过逐层训练受限玻尔兹曼机，将其参数用来初始化深层神经网络，从而使其初始权重落入权重空间的一个比较好的起始点，一定程度上缓解了网络训练时陷入局部极值的风险。同时采用经音素决策树聚类后的三音子状态作为神经网络的教师信号，包含了音素的上下文关系，使得声学模型的建模更加精细而准确。By training the restricted Boltzmann machine layer by layer, its parameters are used to initialize the deep neural network, so that its initial weight falls into a better starting point in the weight space, which alleviates the network from falling into local extremes during training to a certain extent. value risk. At the same time, the triphone state clustered by the phoneme decision tree is used as the teacher signal of the neural network, which includes the context relationship of the phoneme, making the modeling of the acoustic model more refined and accurate.

图3是本发明实施例的用于语音识别的声学模型的建模方法示意图。所述方法包括：步骤1，建立初始模型。具体地，用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型，该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态，所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率；Fig. 3 is a schematic diagram of a modeling method of an acoustic model for speech recognition according to an embodiment of the present invention. The method includes: Step 1, establishing an initial model. Specifically, a Hidden Markov-Mixed Gaussian HMM-GMM model is trained with the training data, and the modeling unit of the HMM-GMM model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, The HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm;

步骤2，获得训练数据的语音特征的语音特征帧级状态信息。具体地，基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息；Step 2, obtaining speech feature frame-level state information of the speech features of the training data. Specifically, based on the HMM-GMM model, the three-phone sub-states of the speech features of the training data are forcibly aligned to obtain frame-level state information of the speech features;

步骤3，初始化深层神经网络各隐含层权重。具体地，对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数；Step 3, initialize the weights of each hidden layer of the deep neural network. Specifically, pre-training the deep neural network as the acoustic model to obtain parameters for initializing the weights of each hidden layer of the deep network;

步骤4，更新深层神经网络各隐含层权重。具体地，基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练，更新其各隐含层的权重。Step 4, update the weights of each hidden layer of the deep neural network. Specifically, based on the triphone state of the speech feature of the training data, an error back propagation algorithm is used to train the deep neural network, and update the weights of each hidden layer.

要说明的是，所述隐马尔可夫-混合高斯HMM-GMM模型也可以写成隐马尔可夫/混合高斯HMM/GMM模型。It should be noted that the hidden Markov-mixed Gaussian HMM-GMM model can also be written as a hidden Markov/mixed Gaussian HMM/GMM model.

所述步骤3中的预训练可以视为一种无监督的训练。步骤3中的训练可以视为一种有监督的训练。The pre-training in step 3 can be regarded as a kind of unsupervised training. The training in step 3 can be regarded as a supervised training.

另外，步骤3中的预训练与步骤2可以同时执行。In addition, the pre-training in step 3 and step 2 can be performed at the same time.

在将所述HMM-GMM模型作为声学模型用于语音识别时，基于通过贝叶斯公式将语音特征经深层神经网络生成的后验概率转换为似然概率送入解码器进行解码，解码后获得的文本序列即作为识别到的说话内容。基于所述识别到的说话内容与真实的原始语音的差异可以评估语音识别的效果。根据该效果可以评估语音识别系统中作为声学模型的深层神经网络的性能，在必要时可以考虑对其进行再训练，甚至可以考虑对所述HMM-GMM模型中状态转移概率进行再设计。When the HMM-GMM model is used as an acoustic model for speech recognition, based on the Bayesian formula, the posterior probability generated by the speech feature through the deep neural network is converted into a likelihood probability and sent to the decoder for decoding. After decoding, the The text sequence of is taken as the recognized speech content. The effect of speech recognition can be evaluated based on the difference between the recognized speech content and the real original speech. According to this effect, the performance of the deep neural network as an acoustic model in the speech recognition system can be evaluated, and retraining can be considered if necessary, and even redesign of the state transition probability in the HMM-GMM model can be considered.

图4是本发明实施例的用于语音识别的声学模型的建模系统示意图。所述建模系统包括：第一模块，用于用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型，该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态，所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率；第二模块，用于基于所述HMM-GMM模型，对所述训练数据语音特征的三音子状态进行强制对齐，获得所述语音特征帧级状态信息；第三模块，用于对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数；第四模块，用于基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练，更新其各隐含层的权重。Fig. 4 is a schematic diagram of a modeling system of an acoustic model for speech recognition according to an embodiment of the present invention. The modeling system includes: a first module for training a Hidden Markov-Mixed Gaussian HMM-GMM model with training data, the modeling unit of the HMM-GMM model undergoes phoneme decision-making for the phonetic features of the training data The triphone state after the tree clustering, the HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm; the second module is used for based on the HMM-GMM model, the The triphonic sub-states of the speech features of the training data are forcibly aligned to obtain the frame-level state information of the speech features; the third module is used to pre-train the deep neural network as the acoustic model to obtain the deep neural network for initializing the deep layer The parameter of the weight of each hidden layer of network; The 4th module, for adopting error backpropagation algorithm to train described deep neural network based on the triphone state of described training data speech feature, update its each hidden layer the weight of.

本发明实施例采用深层神经网络代替混合高斯模型进行声学模型建模,建模时利用了具有上下文相关特性的三音子状态，而且不同于所述混合高斯模型需要对语音特征及其分布做一些特定假设，直接给出语音特征的后验概率。所述的三音子状态充分考虑了语言的上下文相关性，使得建模单元更细致，所述多个隐含层与人类语音感知系统原理更相似，利于进行高阶特征信息的提取。本发明实施例使用受限波尔兹曼算法初始化所述网络各隐含层的权重，所述权重在后续还可以借助反向误差传播算法被更新，能够有效地缓解所述网络预训练时容易陷入局部极值的风险，并进一步提高声学模型的建模精度。In the embodiment of the present invention, a deep neural network is used instead of a mixed Gaussian model to model an acoustic model, and a triphone state with context-dependent characteristics is used in modeling, and different from the mixed Gaussian model, some voice features and their distribution need to be done. For certain assumptions, the posterior probability of speech features is directly given. The triphonic state fully considers the context correlation of language, making the modeling unit more detailed, and the multiple hidden layers are more similar to the principle of the human speech perception system, which is conducive to the extraction of high-order feature information. In the embodiment of the present invention, the restricted Boltzmann algorithm is used to initialize the weights of the hidden layers of the network, and the weights can be updated later with the help of the reverse error propagation algorithm, which can effectively alleviate the difficulty of network pre-training. Risk of getting stuck in local extrema and further improving the modeling accuracy of acoustic models.

本领域技术人员应该进一步意识到，结合本文中所公开的实施例描述的各示例模块及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art should further appreciate that the example modules and algorithm steps described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the possibilities of hardware and software For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may implement the described functionality using different methods for each particular application, but such implementation should not be considered as exceeding the scope of the present application.

结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器（RAM）、内存、只读存储器（ROM）、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

需要指出的是，以上仅为本发明较佳实施例，并非用来限定本发明的实施范围，具有专业知识基础的技术人员可以由以上实施实例实现本发明，因此凡是根据本发明的精神和原则之内所做的任何的变化、修改与改进，都被本发明的专利范围所覆盖。即，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围。It should be pointed out that the above are only preferred embodiments of the present invention, and are not intended to limit the implementation scope of the present invention. Those skilled in the art with professional knowledge can realize the present invention from the above implementation examples. Any changes, modifications and improvements made within are covered by the patent scope of the present invention. That is, the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that the technical solutions of the present invention can be modified or equivalent replacement without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. for a modeling method for the acoustic model of speech recognition, it is characterized in that, described method comprises:

A hidden Markov-mixed Gaussian HMM-GMM model is trained with training data, the modeling unit of this HMM-GMM model is the three-tone state of phonetic feature after phoneme decision tree-based clustering of described training data, described HMM-GMM model, by expecting that maximum EM Algorithm for Training obtains, obtains the state transition probability of described three-tone state simultaneously;

Based on described HMM-GMM model, pressure alignment is carried out to described training data phonetic feature, obtains the three-tone status information of described phonetic feature frame rank;

Pre-training is carried out to obtain the parameter of the weight of each hidden layer for initialization deep layer network to the deep-neural-network as described acoustic model;

Phonetic feature frame level status information based on described training data phonetic feature adopts error backpropagation algorithm to train described deep-neural-network, upgrades the weight of its each hidden layer.

2. modeling method as claimed in claim 1, it is characterized in that, described based on described HMM-GMM model, pressure alignment is carried out to the three-tone state of described training data phonetic feature, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, most probable to described training data phonetic feature and its three-tone state is carried out corresponding, obtain described phonetic feature frame level status information.

3. modeling method as claimed in claim 1, it is characterized in that, describedly pre-training is carried out to the deep-neural-network as described acoustic model be specially with the parameter of the weight obtaining each hidden layer for deep layer network described in initialization: utilize limited Boltzmann machine successively to train to convergence based on described training data, by the weight of each hidden layer of deep layer network described in the parameter initialization obtained.

4. for a modeling for voice recognition acoustic model, it is characterized in that, described modeling comprises:

First module, for training a hidden Markov-mixed Gaussian HMM-GMM model with training data, the modeling unit of this HMM-GMM model is the three-tone state of phonetic feature after phoneme decision tree-based clustering of described training data, and described HMM-GMM model is by expecting that maximum EM algorithm obtains the state transition probability of described three-tone state;

Second module, for based on described HMM-GMM model, carries out pressure alignment to described training data phonetic feature, obtains the three-tone status information of described phonetic feature frame level;

3rd module, for carrying out pre-training to obtain the parameter of the weight of each hidden layer for initialization deep layer network to the deep-neural-network as described acoustic model;

Four module, for adopting error backpropagation algorithm to train described deep-neural-network based on the phonetic feature frame level status information of described training data phonetic feature, upgrades the weight of its each hidden layer.

5. modeling as claimed in claim 4, it is characterized in that, described second module is based on described HMM-GMM model, pressure alignment is carried out to the three-tone state of described training data phonetic feature, obtain described phonetic feature frame level status information, be specially: most probable to described training data phonetic feature and its three-tone state, based on described HMM-GMM model, is carried out corresponding by described second module, obtain described phonetic feature frame level status information.

6. modeling as claimed in claim 4, it is characterized in that, described 3rd module is carried out pre-training to the deep-neural-network as described acoustic model and is specially with the parameter of the weight obtaining each hidden layer for deep layer network described in initialization: described 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, by the weight of each hidden layer of deep layer network described in the parameter initialization obtained.