CN103117060B - For modeling method, the modeling of the acoustic model of speech recognition - Google Patents
For modeling method, the modeling of the acoustic model of speech recognition Download PDFInfo
- Publication number
- CN103117060B CN103117060B CN201310020010.7A CN201310020010A CN103117060B CN 103117060 B CN103117060 B CN 103117060B CN 201310020010 A CN201310020010 A CN 201310020010A CN 103117060 B CN103117060 B CN 103117060B
- Authority
- CN
- China
- Prior art keywords
- training data
- phonetic feature
- modeling
- hmm
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 74
- 238000013528 artificial neural network Methods 0.000 claims abstract description 35
- 238000003066 decision tree Methods 0.000 claims abstract description 8
- 230000007704 transition Effects 0.000 claims abstract description 8
- 230000001419 dependent effect Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002028 premature Effects 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
- Complex Calculations (AREA)
Abstract
本发明涉及一种用于语音识别的声学模型的建模方法及语音识别系统。所述方法包括:训练一个初始模型,建模单元为经音素决策树聚类后的三音子状态,所述模型还给出状态转移概率;基于初始模型对训练数据语音特征的三音子状态进行强制对齐,获得其帧级别的状态信息;对深层神经网络进行预训练以得到各隐含层初始权重;基于所获得的帧级状态信息采用误差反向传播算法对已初始化的网络进行训练,更新权重。本发明采用上下文相关三音子状态做为建模单元,基于深层神经网络建模,使用受限波尔兹曼算法初始化所述网络各隐含层的权重,所述权重在后续还可以借助反向误差传播算法被更新,能够有效地缓解所述网络预训练时容易陷入局部极值的风险,并进一步提高声学模型的建模精度。
The invention relates to a modeling method of an acoustic model for speech recognition and a speech recognition system. The method includes: training an initial model, the modeling unit is a triphone state clustered by a phoneme decision tree, and the model also provides a state transition probability; Perform mandatory alignment to obtain its frame-level state information; pre-train the deep neural network to obtain the initial weights of each hidden layer; use the error back-propagation algorithm to train the initialized network based on the obtained frame-level state information, Update weights. The present invention adopts the context-dependent triphone state as the modeling unit, based on the deep neural network modeling, uses the restricted Boltzmann algorithm to initialize the weights of each hidden layer of the network, and the weights can also be used in the follow-up. The error propagation algorithm is updated, which can effectively alleviate the risk of falling into local extremum during the network pre-training, and further improve the modeling accuracy of the acoustic model.
Description
技术领域technical field
本发明涉及语音识别领域,尤其涉及一种用于语音识别的声学模型的建模方法及建模系统。The invention relates to the field of speech recognition, in particular to a modeling method and modeling system of an acoustic model for speech recognition.
背景技术Background technique
目前语音识别的主流框架基于统计模式识别。典型的语音识别系统框架如图1所示:包括语音采集及前端处理模块、特征提取模块、声学模型模块、语言模型模块以及解码器模块。语音识别的基本流程如下:语音采集装置收集人的语音后经过前端处理之后进行特征提取,提取的特征序列如MFCC或PLP通过声学模型获得其观察概率,结合语言模型概率送入解码器获得最有可能的文本序列。所述声学模型建模基于隐马尔科夫框架,采用混合高斯模型对语音特征的概率分布进行建模。所述混合高斯模型会对语音特征及其分布做一些不恰当的假设,如相邻语音特征的线性无关假设,其观察概率服从混合高斯分布等。此外,混合高斯模型进行参数训练时目标函数是使观察特征的似然概率最大,而解码时使用的却是最大后验准则,概率模型上不一致。可见传统的声学模型,建模精度不高,导致语音识别效果欠佳。The current mainstream framework for speech recognition is based on statistical pattern recognition. A typical speech recognition system framework is shown in Figure 1: it includes speech collection and front-end processing modules, feature extraction modules, acoustic model modules, language model modules, and decoder modules. The basic process of speech recognition is as follows: the speech collection device collects the human speech and performs feature extraction after front-end processing. The extracted feature sequence, such as MFCC or PLP, obtains its observation probability through the acoustic model, and sends it to the decoder in combination with the language model probability to obtain the most effective possible text sequences. The modeling of the acoustic model is based on the Hidden Markov Framework, and the probability distribution of speech features is modeled by using a mixed Gaussian model. The mixed Gaussian model will make some inappropriate assumptions about speech features and their distribution, such as the assumption of linear independence of adjacent speech features, and its observation probability obeys the mixed Gaussian distribution. In addition, when the Gaussian mixture model is used for parameter training, the objective function is to maximize the likelihood probability of the observed features, but the maximum a posteriori criterion is used for decoding, and the probability model is inconsistent. It can be seen that the traditional acoustic model has low modeling accuracy, resulting in poor speech recognition effect.
发明内容Contents of the invention
针对上述问题,本发明实施例提出一种用于语音识别的声学模型的建模方法、建模系统。In view of the above problems, an embodiment of the present invention proposes a modeling method and a modeling system for an acoustic model for speech recognition.
在第一方面,本发明实施例提出一种用于语音识别的声学模型的建模方法,所述方法包括:用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型,该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态,所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率;基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息;对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数;基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练,更新其各隐含层的权重。In a first aspect, an embodiment of the present invention proposes a method for modeling an acoustic model for speech recognition, the method comprising: using training data to train a Hidden Markov-Mixed Gaussian HMM-GMM model, the HMM-GMM The modeling unit of the model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, and the HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm; based on The HMM-GMM model is forcibly aligning the triphone sub-states of the speech features of the training data to obtain the frame-level state information of the speech features; pre-training the deep neural network as the acoustic model to obtain Initialize the parameters of the weights of each hidden layer of the deep network; based on the triphone state of the training data voice feature, the deep neural network is trained using the error back propagation algorithm, and the weights of each hidden layer are updated .
优选地,所述基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息,具体为:基于所述HMM-GMM模型,将所述训练数据语音特征与其最可能的三音子状态进行对应,获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the triphone sub-states of the speech features of the training data are forcibly aligned to obtain frame-level state information of the speech features, specifically: based on the HMM-GMM model, Corresponding the speech features of the training data with their most probable triphone states to obtain frame-level state information of the speech features.
优选地,所述对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为:利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛,用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the pre-training of the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network is specifically: using a restricted Boltzmann machine based on the The training data is trained layer by layer until convergence, and the weights of each hidden layer of the deep network are initialized with the obtained parameters.
在第二方面,本发明实施例提出一种用于语音识别声学模型的建模系统,其包括:第一模块,用于用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型,该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态,所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率;第二模块,用于基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息;第三模块,用于对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数;第四模块,用于基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练,更新其各隐含层的权重。In a second aspect, an embodiment of the present invention proposes a modeling system for an acoustic model of speech recognition, which includes: a first module for training a Hidden Markov-Mixed Gaussian HMM-GMM model with training data, the The modeling unit of the HMM-GMM model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, and the HMM-GMM model obtains the state transition of the triphone state through the expected maximum EM algorithm Probability; the second module is used to forcibly align the triphone sub-states of the speech features of the training data based on the HMM-GMM model, and obtain the frame-level state information of the speech features; the third module is used to perform as The deep neural network of the acoustic model is pre-trained to obtain parameters for initializing the weights of each hidden layer of the deep network; the fourth module is used to adopt an error based on the triphone state of the training data voice feature The backpropagation algorithm trains the deep neural network and updates the weights of its hidden layers.
优选地,所述第二模块基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息,具体为:所述第二模块基于所述HMM-GMM模型,将所述训练数据语音特征与其最可能的三音子状态进行对应,获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the second module performs forced alignment on the triphone states of the speech features of the training data to obtain frame-level state information of the speech features, specifically: the second module Based on the HMM-GMM model, the speech features of the training data are associated with their most probable triphone states, and frame-level state information of the speech features is obtained.
优选地,所述第三模块对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为:所述第三模块利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛,用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the third module pre-trains the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network, specifically: the third module uses limited The Boltzmann machine performs layer-by-layer training based on the training data until convergence, and uses the obtained parameters to initialize the weights of each hidden layer of the deep network.
本发明实施例采用三音子状态,基于深层神经网络建模,使用受限波尔兹曼算法初始化所述网络各隐含层的权重,所述权重在后续还可以借助反向误差传播算法被更新,能够有效地缓解所述网络预训练时容易陷入局部极值的风险,并进一步提高声学模型的建模精度。The embodiment of the present invention adopts the triphonic state, based on the deep neural network modeling, and uses the restricted Boltzmann algorithm to initialize the weights of the hidden layers of the network, and the weights can also be obtained later by means of the reverse error propagation algorithm. The update can effectively alleviate the risk of easily falling into local extremum during the network pre-training, and further improve the modeling accuracy of the acoustic model.
附图说明Description of drawings
下面结合附图和具体实施方式对本发明作进一步详细的说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
图1是现有的语音识别系统示意图;Fig. 1 is a schematic diagram of an existing speech recognition system;
图2是本发明实施例的基于上下文相关深层神经网络语音识别系统框图;Fig. 2 is a block diagram of a speech recognition system based on a context-dependent deep neural network according to an embodiment of the present invention;
图3是本发明实施例的用于语音识别的声学模型的建模方法示意图;3 is a schematic diagram of a modeling method for an acoustic model for speech recognition according to an embodiment of the present invention;
图4是本发明实施例的用于语音识别的声学模型的建模系统示意图。Fig. 4 is a schematic diagram of a modeling system of an acoustic model for speech recognition according to an embodiment of the present invention.
具体实施方式Detailed ways
下面通过附图和实施例,对本发明实施例的技术方案做进一步的详细描述。The technical solutions of the embodiments of the present invention will be described in further detail below with reference to the drawings and embodiments.
考虑到混合高斯模型需要对语音特征及其概率分布做出不恰当假设,本发明实施例使用上下文相关的深层神经网络代替混合高斯模型进行声学模型建模。所述深层神经网络包含多个隐含层,其建模单元是经音素决策树聚类后的上下文相关三音子状态。整个系统的基本框图如图2所示。Considering that the mixed Gaussian model needs to make inappropriate assumptions about speech features and their probability distributions, the embodiment of the present invention uses a context-dependent deep neural network instead of the mixed Gaussian model for acoustic model modeling. The deep neural network includes a plurality of hidden layers, and its modeling unit is a context-dependent triphone state clustered by a phoneme decision tree. The basic block diagram of the whole system is shown in Figure 2.
深层神经网络训练时采用最小交叉熵准则作为目标函数,由于其具有多个隐含层,其误差函数具有很多的局部极值,导致深层神经网络在训练过程很容易陷入局部极值而过早的收敛。针对此问题,神经计算领域提出的通过神经网络预训练来初始化权重参数,再采用传统的误差反向传播算法对网络参数进行训练。预训练算法采用受限玻尔兹曼机,受限玻尔兹曼机为双向图模型,包括一个可见层和一个隐含层,其中同一层的各单元之间无互联而不同层的单元稠密链接。该模型通过一个能量函数定义可见层与隐含层变量的联合分布,具体公式如下:The minimum cross-entropy criterion is used as the objective function during deep neural network training. Because it has multiple hidden layers, its error function has many local extremums, which makes it easy for the deep neural network to fall into local extremums during the training process and cause premature failure. convergence. To solve this problem, the neural computing field proposed to initialize the weight parameters through neural network pre-training, and then use the traditional error back propagation algorithm to train the network parameters. The pre-training algorithm uses a restricted Boltzmann machine, which is a two-way graph model, including a visible layer and a hidden layer, in which there is no interconnection between the units of the same layer and the units of different layers are dense Link. The model defines the joint distribution of visible layer and hidden layer variables through an energy function, and the specific formula is as follows:
其中v为可见层变量,h为隐含层变量,E(v,h)为能量函数,p(v,h)为其联合分布概率,训练时通过最大观察特征似然概率p(v),其权重参数更新公式如下:Among them, v is the visible layer variable, h is the hidden layer variable, E(v,h) is the energy function, p(v,h) is its joint distribution probability, and the maximum observation feature likelihood probability p(v) is passed during training, Its weight parameter update formula is as follows:
Δwij=<vihj>data-<vihj>model Δw ij =<v i h j > data -<v i h j > model
wij(t+1)=wij(t)+Δwij w ij (t+1)=w ij (t)+Δw ij
其中wij是连接权重,t是迭代次数,<>表示对括号内的变量取均值。Among them, w ij is the connection weight, t is the number of iterations, and <> means to take the mean value of the variables in the brackets.
通过逐层训练受限玻尔兹曼机,将其参数用来初始化深层神经网络,从而使其初始权重落入权重空间的一个比较好的起始点,一定程度上缓解了网络训练时陷入局部极值的风险。同时采用经音素决策树聚类后的三音子状态作为神经网络的教师信号,包含了音素的上下文关系,使得声学模型的建模更加精细而准确。By training the restricted Boltzmann machine layer by layer, its parameters are used to initialize the deep neural network, so that its initial weight falls into a better starting point in the weight space, which alleviates the network from falling into local extremes during training to a certain extent. value risk. At the same time, the triphone state clustered by the phoneme decision tree is used as the teacher signal of the neural network, which includes the context relationship of the phoneme, making the modeling of the acoustic model more refined and accurate.
图3是本发明实施例的用于语音识别的声学模型的建模方法示意图。所述方法包括:步骤1,建立初始模型。具体地,用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型,该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态,所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率;Fig. 3 is a schematic diagram of a modeling method of an acoustic model for speech recognition according to an embodiment of the present invention. The method includes: Step 1, establishing an initial model. Specifically, a Hidden Markov-Mixed Gaussian HMM-GMM model is trained with the training data, and the modeling unit of the HMM-GMM model is the triphone state after the phoneme decision tree clustering of the speech features of the training data, The HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm;
步骤2,获得训练数据的语音特征的语音特征帧级状态信息。具体地,基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息;Step 2, obtaining speech feature frame-level state information of the speech features of the training data. Specifically, based on the HMM-GMM model, the three-phone sub-states of the speech features of the training data are forcibly aligned to obtain frame-level state information of the speech features;
优选地,所述基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息,具体为:基于所述HMM-GMM模型,将所述训练数据语音特征与其最可能的三音子状态进行对应,获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the triphone sub-states of the speech features of the training data are forcibly aligned to obtain frame-level state information of the speech features, specifically: based on the HMM-GMM model, Corresponding the speech features of the training data with their most probable triphone states to obtain frame-level state information of the speech features.
步骤3,初始化深层神经网络各隐含层权重。具体地,对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数;Step 3, initialize the weights of each hidden layer of the deep neural network. Specifically, pre-training the deep neural network as the acoustic model to obtain parameters for initializing the weights of each hidden layer of the deep network;
步骤4,更新深层神经网络各隐含层权重。具体地,基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练,更新其各隐含层的权重。Step 4, update the weights of each hidden layer of the deep neural network. Specifically, based on the triphone state of the speech feature of the training data, an error back propagation algorithm is used to train the deep neural network, and update the weights of each hidden layer.
优选地,所述对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为:利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛,用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the pre-training of the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network is specifically: using a restricted Boltzmann machine based on the The training data is trained layer by layer until convergence, and the weights of each hidden layer of the deep network are initialized with the obtained parameters.
要说明的是,所述隐马尔可夫-混合高斯HMM-GMM模型也可以写成隐马尔可夫/混合高斯HMM/GMM模型。It should be noted that the hidden Markov-mixed Gaussian HMM-GMM model can also be written as a hidden Markov/mixed Gaussian HMM/GMM model.
所述步骤3中的预训练可以视为一种无监督的训练。步骤3中的训练可以视为一种有监督的训练。The pre-training in step 3 can be regarded as a kind of unsupervised training. The training in step 3 can be regarded as a supervised training.
另外,步骤3中的预训练与步骤2可以同时执行。In addition, the pre-training in step 3 and step 2 can be performed at the same time.
在将所述HMM-GMM模型作为声学模型用于语音识别时,基于通过贝叶斯公式将语音特征经深层神经网络生成的后验概率转换为似然概率送入解码器进行解码,解码后获得的文本序列即作为识别到的说话内容。基于所述识别到的说话内容与真实的原始语音的差异可以评估语音识别的效果。根据该效果可以评估语音识别系统中作为声学模型的深层神经网络的性能,在必要时可以考虑对其进行再训练,甚至可以考虑对所述HMM-GMM模型中状态转移概率进行再设计。When the HMM-GMM model is used as an acoustic model for speech recognition, based on the Bayesian formula, the posterior probability generated by the speech feature through the deep neural network is converted into a likelihood probability and sent to the decoder for decoding. After decoding, the The text sequence of is taken as the recognized speech content. The effect of speech recognition can be evaluated based on the difference between the recognized speech content and the real original speech. According to this effect, the performance of the deep neural network as an acoustic model in the speech recognition system can be evaluated, and retraining can be considered if necessary, and even redesign of the state transition probability in the HMM-GMM model can be considered.
图4是本发明实施例的用于语音识别的声学模型的建模系统示意图。所述建模系统包括:第一模块,用于用训练数据训练一个隐马尔可夫-混合高斯HMM-GMM模型,该HMM-GMM模型的建模单元为所述训练数据的语音特征经过音素决策树聚类后的三音子状态,所述HMM-GMM模型通过期望最大EM算法获得所述三音子状态的状态转移概率;第二模块,用于基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息;第三模块,用于对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数;第四模块,用于基于所述训练数据语音特征的三音子状态采用误差反向传播算法对所述深层神经网络进行训练,更新其各隐含层的权重。Fig. 4 is a schematic diagram of a modeling system of an acoustic model for speech recognition according to an embodiment of the present invention. The modeling system includes: a first module for training a Hidden Markov-Mixed Gaussian HMM-GMM model with training data, the modeling unit of the HMM-GMM model undergoes phoneme decision-making for the phonetic features of the training data The triphone state after the tree clustering, the HMM-GMM model obtains the state transition probability of the triphone state through the expected maximum EM algorithm; the second module is used for based on the HMM-GMM model, the The triphonic sub-states of the speech features of the training data are forcibly aligned to obtain the frame-level state information of the speech features; the third module is used to pre-train the deep neural network as the acoustic model to obtain the deep neural network for initializing the deep layer The parameter of the weight of each hidden layer of network; The 4th module, for adopting error backpropagation algorithm to train described deep neural network based on the triphone state of described training data speech feature, update its each hidden layer the weight of.
优选地,所述第二模块基于所述HMM-GMM模型,对所述训练数据语音特征的三音子状态进行强制对齐,获得所述语音特征帧级状态信息,具体为:所述第二模块基于所述HMM-GMM模型,将所述训练数据语音特征与其最可能的三音子状态进行对应,获得所述语音特征帧级状态信息。Preferably, based on the HMM-GMM model, the second module performs forced alignment on the triphone states of the speech features of the training data to obtain frame-level state information of the speech features, specifically: the second module Based on the HMM-GMM model, the speech features of the training data are associated with their most probable triphone states, and frame-level state information of the speech features is obtained.
优选地,所述第三模块对作为所述声学模型的深层神经网络进行预训练以得到用于初始化所述深层网络的各隐含层的权重的参数具体为:所述第三模块利用受限波尔兹曼机基于所述训练数据进行逐层训练至收敛,用获得的参数初始化所述深层网络的各隐含层的权重。Preferably, the third module pre-trains the deep neural network as the acoustic model to obtain parameters for initializing the weights of the hidden layers of the deep network, specifically: the third module uses limited The Boltzmann machine performs layer-by-layer training based on the training data until convergence, and uses the obtained parameters to initialize the weights of each hidden layer of the deep network.
本发明实施例采用深层神经网络代替混合高斯模型进行声学模型建模,建模时利用了具有上下文相关特性的三音子状态,而且不同于所述混合高斯模型需要对语音特征及其分布做一些特定假设,直接给出语音特征的后验概率。所述的三音子状态充分考虑了语言的上下文相关性,使得建模单元更细致,所述多个隐含层与人类语音感知系统原理更相似,利于进行高阶特征信息的提取。本发明实施例使用受限波尔兹曼算法初始化所述网络各隐含层的权重,所述权重在后续还可以借助反向误差传播算法被更新,能够有效地缓解所述网络预训练时容易陷入局部极值的风险,并进一步提高声学模型的建模精度。In the embodiment of the present invention, a deep neural network is used instead of a mixed Gaussian model to model an acoustic model, and a triphone state with context-dependent characteristics is used in modeling, and different from the mixed Gaussian model, some voice features and their distribution need to be done. For certain assumptions, the posterior probability of speech features is directly given. The triphonic state fully considers the context correlation of language, making the modeling unit more detailed, and the multiple hidden layers are more similar to the principle of the human speech perception system, which is conducive to the extraction of high-order feature information. In the embodiment of the present invention, the restricted Boltzmann algorithm is used to initialize the weights of the hidden layers of the network, and the weights can be updated later with the help of the reverse error propagation algorithm, which can effectively alleviate the difficulty of network pre-training. Risk of getting stuck in local extrema and further improving the modeling accuracy of acoustic models.
本领域技术人员应该进一步意识到,结合本文中所公开的实施例描述的各示例模块及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art should further appreciate that the example modules and algorithm steps described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the possibilities of hardware and software For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may implement the described functionality using different methods for each particular application, but such implementation should not be considered as exceeding the scope of the present application.
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
需要指出的是,以上仅为本发明较佳实施例,并非用来限定本发明的实施范围,具有专业知识基础的技术人员可以由以上实施实例实现本发明,因此凡是根据本发明的精神和原则之内所做的任何的变化、修改与改进,都被本发明的专利范围所覆盖。即,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围。It should be pointed out that the above are only preferred embodiments of the present invention, and are not intended to limit the implementation scope of the present invention. Those skilled in the art with professional knowledge can realize the present invention from the above implementation examples. Any changes, modifications and improvements made within are covered by the patent scope of the present invention. That is, the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that the technical solutions of the present invention can be modified or equivalent replacement without departing from the spirit and scope of the technical solution of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310020010.7A CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310020010.7A CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103117060A CN103117060A (en) | 2013-05-22 |
CN103117060B true CN103117060B (en) | 2015-10-28 |
Family
ID=48415418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310020010.7A Expired - Fee Related CN103117060B (en) | 2013-01-18 | 2013-01-18 | For modeling method, the modeling of the acoustic model of speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103117060B (en) |
Families Citing this family (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6164639B2 (en) | 2013-05-23 | 2017-07-19 | 国立研究開発法人情報通信研究機構 | Deep neural network learning method and computer program |
CN103345656B (en) * | 2013-07-17 | 2016-01-20 | 中国科学院自动化研究所 | A kind of data identification method based on multitask deep neural network and device |
CN104347066B (en) * | 2013-08-09 | 2019-11-12 | 上海掌门科技有限公司 | Baby cry recognition method and system based on deep neural network |
CN104376842A (en) * | 2013-08-12 | 2015-02-25 | 清华大学 | Neural network language model training method and device and voice recognition method |
CN103514879A (en) * | 2013-09-18 | 2014-01-15 | 广东欧珀移动通信有限公司 | Local voice recognition method based on BP neural network |
CN104575497B (en) * | 2013-10-28 | 2017-10-03 | 中国科学院声学研究所 | A kind of acoustic model method for building up and the tone decoding method based on the model |
US9613619B2 (en) * | 2013-10-30 | 2017-04-04 | Genesys Telecommunications Laboratories, Inc. | Predicting recognition quality of a phrase in automatic speech recognition systems |
JP5777178B2 (en) * | 2013-11-27 | 2015-09-09 | 国立研究開発法人情報通信研究機構 | Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for constructing a deep neural network, and statistical acoustic model adaptation Computer programs |
CN103680496B (en) * | 2013-12-19 | 2016-08-10 | 百度在线网络技术(北京)有限公司 | Acoustic training model method based on deep-neural-network, main frame and system |
CN103839211A (en) * | 2014-03-23 | 2014-06-04 | 合肥新涛信息科技有限公司 | Medical history transferring system based on voice recognition |
CN103839546A (en) * | 2014-03-26 | 2014-06-04 | 合肥新涛信息科技有限公司 | Voice recognition system based on Yangze river and Huai river language family |
CN104036774B (en) * | 2014-06-20 | 2018-03-06 | 国家计算机网络与信息安全管理中心 | Tibetan dialect recognition methods and system |
CN105960672B (en) * | 2014-09-09 | 2019-11-26 | 微软技术许可有限责任公司 | Variable component deep neural network for Robust speech recognition |
US9324320B1 (en) * | 2014-10-02 | 2016-04-26 | Microsoft Technology Licensing, Llc | Neural network-based speech processing |
CN106157953B (en) * | 2015-04-16 | 2020-02-07 | 科大讯飞股份有限公司 | Continuous speech recognition method and system |
US10606651B2 (en) | 2015-04-17 | 2020-03-31 | Microsoft Technology Licensing, Llc | Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit |
CN106297773B (en) * | 2015-05-29 | 2019-11-19 | 中国科学院声学研究所 | A neural network acoustic model training method |
US10452995B2 (en) | 2015-06-29 | 2019-10-22 | Microsoft Technology Licensing, Llc | Machine learning classification on hardware accelerators with stacked memory |
US10540588B2 (en) | 2015-06-29 | 2020-01-21 | Microsoft Technology Licensing, Llc | Deep neural network processing on hardware accelerators with stacked memory |
CN206097455U (en) * | 2015-08-20 | 2017-04-12 | 漳州凯邦电子有限公司 | Speech recognition controlgear |
CN106611599A (en) * | 2015-10-21 | 2017-05-03 | 展讯通信(上海)有限公司 | Voice recognition method and device based on artificial neural network and electronic equipment |
KR102313028B1 (en) * | 2015-10-29 | 2021-10-13 | 삼성에스디에스 주식회사 | System and method for voice recognition |
CN106940998B (en) * | 2015-12-31 | 2021-04-16 | 阿里巴巴集团控股有限公司 | Execution method and device for setting operation |
CN105654955B (en) * | 2016-03-18 | 2019-11-12 | 华为技术有限公司 | Speech recognition method and device |
CN105761720B (en) * | 2016-04-19 | 2020-01-07 | 北京地平线机器人技术研发有限公司 | Interactive system and method based on voice attribute classification |
CN106782511A (en) * | 2016-12-22 | 2017-05-31 | 太原理工大学 | Amendment linear depth autoencoder network audio recognition method |
CN106782504B (en) * | 2016-12-29 | 2019-01-22 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN108346423B (en) * | 2017-01-23 | 2021-08-20 | 北京搜狗科技发展有限公司 | Method and device for processing speech synthesis model |
CN106816147A (en) * | 2017-01-25 | 2017-06-09 | 上海交通大学 | Speech recognition system based on binary neural network acoustic model |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
KR102339716B1 (en) * | 2017-06-30 | 2021-12-14 | 삼성에스디에스 주식회사 | Method for recognizing speech and Apparatus thereof |
CN107680582B (en) | 2017-07-28 | 2021-03-26 | 平安科技(深圳)有限公司 | Acoustic model training method, voice recognition method, device, equipment and medium |
CN109741735B (en) * | 2017-10-30 | 2023-09-01 | 阿里巴巴集团控股有限公司 | Modeling method, acoustic model acquisition method and acoustic model acquisition device |
CN108111335B (en) * | 2017-12-04 | 2019-07-23 | 华中科技大学 | A kind of method and system of scheduling and link virtual network function |
CN108109615A (en) * | 2017-12-21 | 2018-06-01 | 内蒙古工业大学 | A kind of construction and application method of the Mongol acoustic model based on DNN |
CN109975762B (en) * | 2017-12-28 | 2021-05-18 | 中国科学院声学研究所 | An underwater sound source localization method |
CN110070855B (en) * | 2018-01-23 | 2021-07-23 | 中国科学院声学研究所 | A speech recognition system and method based on transfer neural network acoustic model |
CN108648747B (en) * | 2018-03-21 | 2020-06-02 | 清华大学 | Language identification system |
CN109326277B (en) * | 2018-12-05 | 2022-02-08 | 四川长虹电器股份有限公司 | Semi-supervised phoneme forced alignment model establishing method and system |
CN109545201B (en) * | 2018-12-15 | 2023-06-06 | 中国人民解放军战略支援部队信息工程大学 | Construction Method of Acoustic Model Based on Deep Mixed Factor Analysis |
CN112259089B (en) * | 2019-07-04 | 2024-07-02 | 阿里巴巴集团控股有限公司 | Speech recognition method and device |
CN110459216B (en) * | 2019-08-14 | 2021-11-30 | 桂林电子科技大学 | Canteen card swiping device with voice recognition function and using method |
CN113450786B (en) * | 2020-03-25 | 2024-07-26 | 阿里巴巴集团控股有限公司 | Network model obtaining method, information processing method, device and electronic equipment |
CN114387958A (en) * | 2020-10-19 | 2022-04-22 | 中国移动通信有限公司研究院 | Speech recognition method, device and terminal |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5317673A (en) * | 1992-06-22 | 1994-05-31 | Sri International | Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system |
CN1427368A (en) * | 2001-12-19 | 2003-07-02 | 中国科学院自动化研究所 | Palm computer non specific human speech sound distinguishing method |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102411931A (en) * | 2010-09-15 | 2012-04-11 | 微软公司 | Deep belief network for large vocabulary continuous speech recognition |
CN102693723A (en) * | 2012-04-01 | 2012-09-26 | 北京安慧音通科技有限责任公司 | Method and device for recognizing speaker-independent isolated word based on subspace |
-
2013
- 2013-01-18 CN CN201310020010.7A patent/CN103117060B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5317673A (en) * | 1992-06-22 | 1994-05-31 | Sri International | Method and apparatus for context-dependent estimation of multiple probability distributions of phonetic classes with multilayer perceptrons in a speech recognition system |
CN1427368A (en) * | 2001-12-19 | 2003-07-02 | 中国科学院自动化研究所 | Palm computer non specific human speech sound distinguishing method |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101740024A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Method for automatic evaluation based on generalized fluent spoken language fluency |
CN102411931A (en) * | 2010-09-15 | 2012-04-11 | 微软公司 | Deep belief network for large vocabulary continuous speech recognition |
CN102693723A (en) * | 2012-04-01 | 2012-09-26 | 北京安慧音通科技有限责任公司 | Method and device for recognizing speaker-independent isolated word based on subspace |
Non-Patent Citations (4)
Title |
---|
Hidden Markov model/Gaussian mixture models (HMM/GMM) based voice command system: A way to improve the control of remotely operated robot arm TR45;Ibrahim M. M. El-emary1, Mohamed Fezari and Hamza Attoui;《Scientific Research and Essays》;20110118;第6卷(第2期);341-350 * |
IMPROVED HYBRID MODEL OF HMM/GMM FOR SPEECH RECOGNITION;Poonam Bansal, Anuj Kant, Sumit Kumar, Akash Sharda, Shitij Gupt;《International Conference "Intelligent Information and Engineering Systems" INFOS 2008, Varna, Bulgaria, June-July 2008》;20080731;69-75 * |
区分性模型组合中基于决策树的声学上下文建模方法;黄浩,李兵虎,吾守尔.斯拉木;《自动化学报》;20120930;第38卷(第9期);全文 * |
韵律相关的汉语语音识别系统研究;倪崇嘉,刘文举,徐波;《计算机应用研究》;20110831;第28卷(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN103117060A (en) | 2013-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103117060B (en) | For modeling method, the modeling of the acoustic model of speech recognition | |
EP3046053B1 (en) | Method and apparatus for training language model | |
CN110211574B (en) | A speech recognition model building method based on bottleneck features and multi-scale multi-head attention mechanism | |
CN108875807B (en) | An image description method based on multi-attention and multi-scale | |
CN110427846B (en) | Face recognition method for small unbalanced samples by using convolutional neural network | |
CN104751228B (en) | Construction method and system for the deep neural network of speech recognition | |
CN107767861B (en) | Voice awakening method and system and intelligent terminal | |
Ghoshal et al. | Multilingual training of deep neural networks | |
Trentin et al. | Robust combination of neural networks and hidden Markov models for speech recognition | |
CN106157953B (en) | Continuous speech recognition method and system | |
CN107293291B (en) | An end-to-end speech recognition method based on adaptive learning rate | |
Lee et al. | Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams | |
CN108172218B (en) | Voice modeling method and device | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN106531157B (en) | Regularized Accent Adaptive Methods in Speech Recognition | |
JP2019159654A (en) | Time-series information learning system, method, and neural network model | |
CN104036774A (en) | Method and system for recognizing Tibetan dialects | |
CN105139864A (en) | Voice recognition method and voice recognition device | |
CN112074903A (en) | System and method for tone recognition in spoken language | |
CN111477247A (en) | GAN-based speech adversarial sample generation method | |
CN108109615A (en) | A kind of construction and application method of the Mongol acoustic model based on DNN | |
CN113408430B (en) | Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework | |
CN108175426B (en) | A Lie Detection Method Based on Deep Recursive Conditionally Restricted Boltzmann Machines | |
CN114783426B (en) | Speech recognition method, device, electronic device and storage medium | |
CN113889099B (en) | A speech recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20151028 |
|
CF01 | Termination of patent right due to non-payment of annual fee |