CN104700828B

CN104700828B - The construction method of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model based on selective attention principle

Info

Publication number: CN104700828B
Application number: CN201510122982.6A
Authority: CN
Inventors: 杨毅; 孙甲松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-03-19
Filing date: 2015-03-19
Publication date: 2018-01-12
Anticipated expiration: 2035-03-19
Also published as: CN104700828A; WO2016145850A1

Abstract

A method for constructing an acoustic model of deep long-term short-term memory recurrent neural network based on the principle of selective attention. By adding attention gate units to the acoustic model of deep long-term short-term memory recurrent neural network, it can represent the instantaneous functional changes of neurons in the auditory cortex. The difference between the gate unit and other gate units is that other gate units have one-to-one correspondence with the time series, while the attention gate unit reflects the short-term plasticity effect, so there are gaps in the time series; The above-mentioned neural network acoustic model obtained by data training can realize robust feature extraction and construction of a robust acoustic model for Cross-talk noise, and improve the robustness of the acoustic model by suppressing the impact of non-target flow on feature extraction The purpose of this method; this method can be widely used in various machine learning fields such as speaker recognition, keyword recognition, and human-computer interaction involving speech recognition.

Description

Acoustic model of deep long short-term memory recurrent neural network based on the principle of selective attention type build method

技术领域technical field

本发明属于音频技术领域，特别涉及一种基于选择性注意原理的深度长短期记忆循环神经网络声学模型的构建方法。The invention belongs to the field of audio technology, in particular to a method for constructing an acoustic model of a deep long-term short-term memory cyclic neural network based on the principle of selective attention.

背景技术Background technique

随着信息技术的迅速发展，语音识别技术已经具备大规模商业化的条件。目前语音识别主要采用基于统计模型的连续语音识别技术，其主要目标是通过给定的语音序列寻找其所代表的概率最大的词序列。基于统计模型的连续语音识别系统的任务是根据给定的语音序列寻找其所代表的概率最大的词序列，通常包括构建声学模型和语言模型及其对应的搜索解码方法。随着声学模型和语言模型的快速发展，语音识别系统的性能在理想声学环境下已经大为改善，现有的深度神经网络-隐马尔科夫模型(Deep Neural Network-Hidden Markov Model，DNN-HMM)初步成熟，通过机器学习的方法可以自动提取有效特征，并能对多帧语音对应的上下文信息建模，但是此类模型每一层都有百万量级的参数，且下一层的输入是上一次的输出，因此需要使用GPU设备来训练DNN声学模型，训练时间长；高度非线性以及参数共享的特性也使得DNN难以进行参数自适应。With the rapid development of information technology, speech recognition technology has the conditions for large-scale commercialization. At present, speech recognition mainly uses continuous speech recognition technology based on statistical models, and its main goal is to find the word sequence with the highest probability represented by a given speech sequence. The task of the continuous speech recognition system based on the statistical model is to find the most probable word sequence represented by the given speech sequence, usually including the construction of acoustic models and language models and their corresponding search and decoding methods. With the rapid development of acoustic models and language models, the performance of speech recognition systems has been greatly improved in an ideal acoustic environment. The existing Deep Neural Network-Hidden Markov Model (DNN-HMM ) is initially mature, and effective features can be automatically extracted through machine learning methods, and context information corresponding to multi-frame speech can be modeled, but each layer of this type of model has millions of parameters, and the input of the next layer It is the last output, so it is necessary to use GPU equipment to train the DNN acoustic model, and the training time is long; the highly nonlinear and parameter sharing characteristics also make it difficult for DNN to perform parameter adaptation.

循环神经网络(Recurrent Neural Network，RNN)是一种单元之间存在有向循环来表达网络内部动态时间特性的神经网络，在手写体识别和语言模型等方面得到广泛应用。语音信号是复杂的时变信号，在不同时间尺度上具有复杂的相关性，因此相比于深度神经网络而言，循环神经网络具有的循环连接功能更适合处理这类复杂时序数据。Recurrent Neural Network (RNN) is a neural network in which there are directed cycles between units to express the dynamic time characteristics of the network. It is widely used in handwriting recognition and language models. Speech signals are complex time-varying signals with complex correlations on different time scales. Therefore, compared with deep neural networks, the recurrent connection function of recurrent neural networks is more suitable for processing such complex time series data.

作为循环神经网络的一种，长短期记忆(Long Short-Term Memory，LSTM)模型比循环神经网络更适合处理和预测事件滞后且时间不定的长时序列。多伦多大学提出的增加了记忆模块(memory block)的深度LSTM-RNN声学模型则将深度神经网络的多层次表征能力与循环神经网络灵活利用长跨度上下文的能力结合，使得基于TIMIT库的音素识别错误率降至17.1％。As a type of recurrent neural network, the Long Short-Term Memory (LSTM) model is more suitable than the recurrent neural network for processing and predicting long-term sequences with event lags and uncertain times. The deep LSTM-RNN acoustic model proposed by the University of Toronto, which adds a memory block, combines the multi-level representation capabilities of deep neural networks with the ability of recurrent neural networks to flexibly use long-span contexts, making phoneme recognition errors based on the TIMIT library The rate fell to 17.1%.

但是循环神经网络中使用的梯度下降法存在梯度消散(vanishing gradient)问题，也就是在对网络的权重进行调整的过程中，随着网络层数增加，梯度逐层消散，致使其对权重调整的作用越来越小。谷歌提出的两层深度LSTM-RNN声学模型，在以前的深度LSTM-RNN模型中增加了线性循环投影层(Recurrent Projection Layer)，用于解决梯度消散问题。对比实验表明，RNN的帧正确率(Frame Accuracy)及其收敛速度明显逊于LSTM-RNN和DNN；在词错误率及其收敛速度方面，最好的DNN在训练数周后的词错误率为11.3％；而两层深度LSTM-RNN模型在训练48小时后词错误率降低至10.9％，训练100/200小时后，词错误率降低至10.7/10.5(％)。However, the gradient descent method used in the cyclic neural network has the problem of gradient dissipation (vanishing gradient), that is, in the process of adjusting the weight of the network, as the number of network layers increases, the gradient dissipates layer by layer, resulting in its effect on weight adjustment. The effect is getting smaller and smaller. The two-layer deep LSTM-RNN acoustic model proposed by Google adds a linear recurrent projection layer (Recurrent Projection Layer) to the previous deep LSTM-RNN model to solve the gradient dissipation problem. Comparative experiments show that the Frame Accuracy and its convergence speed of RNN are obviously inferior to those of LSTM-RNN and DNN; in terms of word error rate and its convergence speed, the word error rate of the best DNN after several weeks of training is 11.3%; while the word error rate of the two-layer deep LSTM-RNN model was reduced to 10.9% after training for 48 hours, and after 100/200 hours of training, the word error rate was reduced to 10.7/10.5(%).

慕尼黑大学提出的深度双向长短期记忆循环神经网络(Deep BidirectionalLong Short-Term Memory Recurrent Neural Networks，DBLSTM-RNN)声学模型，在神经网络的每个循环层中定义了相互独立的前向层和后向层，并使用多隐藏层对输入的声学特征进行更高层表征，同时对噪声和混响进行有监督学习实现特征投影和增强。此方法在2013PASCAL CHiME数据集上，在信噪比[-6dB，9dB]范围内实现了词错误率从基线的55％降低到22％。The Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks (DBLSTM-RNN) acoustic model proposed by the University of Munich defines mutually independent forward and backward layers in each cyclic layer of the neural network. Layer, and use multiple hidden layers to perform higher-level representation of the input acoustic features, and perform supervised learning on noise and reverberation to achieve feature projection and enhancement. This method achieves a word error rate reduction from 55% of the baseline to 22% in the SNR [-6dB, 9dB] range on the 2013 PASCAL CHiME dataset.

但实际声学环境的复杂性仍然严重影响和干扰连续语音识别系统的性能，即使利用目前主流的DNN声学模型方法，在包括噪声、音乐、口语、重复等复杂环境条件下的连续语音识别数据集上也只能获得70％左右的识别率，连续语音识别系统中声学模型的抗噪性和鲁棒性仍有待改进。However, the complexity of the actual acoustic environment still seriously affects and interferes with the performance of the continuous speech recognition system. Even if the current mainstream DNN acoustic model method is used, the continuous speech recognition data set under complex environmental conditions including noise, music, spoken language, repetition, etc. It can only obtain a recognition rate of about 70%, and the noise resistance and robustness of the acoustic model in the continuous speech recognition system still need to be improved.

随着声学模型和语言模型的快速发展，语音识别系统的性能在理想声学环境下已经大为改善，现有的DNN-HMM模型初步成熟，通过机器学习的方法可以自动提取有效特征，并能对多帧语音对应的上下文信息建模。然而大多数识别系统对于声学环境的改变仍然十分敏感，特别是在cross-talk噪声(两人或多人同时说话)干扰下不能满足实用性能的要求。与深度神经网络声学模型相比，循环神经网络声学模型中的单元之间存在有向循环，可以有效的描述神经网络内部的动态时间特性，更适合处理具有复杂时序的语音数据。而长短期记忆神经网络比循环神经网络更适合处理和预测事件滞后且时间不定的长时序列，因此用于构建语音识别的声学模型能够取得更好的效果。With the rapid development of acoustic models and language models, the performance of speech recognition systems has been greatly improved in an ideal acoustic environment. The existing DNN-HMM model is initially mature, and effective features can be automatically extracted through machine learning methods, and can be used for Contextual information modeling for multi-frame speech. However, most recognition systems are still very sensitive to changes in the acoustic environment, especially under the interference of cross-talk noise (two or more people talking at the same time), which cannot meet the requirements of practical performance. Compared with the deep neural network acoustic model, there is a directed cycle between the units in the recurrent neural network acoustic model, which can effectively describe the dynamic time characteristics inside the neural network, and is more suitable for processing speech data with complex timing. The long-short-term memory neural network is more suitable than the recurrent neural network for processing and predicting long-term sequences with event lags and uncertain times, so the acoustic model used to build speech recognition can achieve better results.

人脑在处理复杂场景的语音时存在选择性注意的现象，其主要原理为：人脑具有听觉选择性注意的能力，在听觉皮层区域通过自上而下的控制机制，来实现抑制非目标流和增强目标流的目的。研究表明，在选择性注意的过程中，听觉皮层的短期可塑性(Short-Term Plasticity)效应增加了对声音的区分能力。在注意力非常集中时，在初级听觉皮层可以在50毫秒内开始对声音目标进行增强处理。The human brain has the phenomenon of selective attention when processing speech in complex scenes. The main principle is that the human brain has the ability of auditory selective attention. In the auditory cortex area, the top-down control mechanism is used to suppress non-target flow. And the purpose of enhancing the target flow. Studies have shown that in the process of selective attention, the short-term plasticity (Short-Term Plasticity) effect in the auditory cortex increases the ability to distinguish sounds. During intense attention, reinforcement processing of sound objects can begin within 50 milliseconds in the primary auditory cortex.

发明内容Contents of the invention

为了克服上述现有技术的缺点，本发明的目的在于提供一种基于选择性注意原理的深度长短期记忆循环神经网络声学模型的构建方法，建立了基于选择性注意原理的深度长短期记忆循环神经网络声学模型，通过在深度长短期记忆循环神经网络声学模型中增加注意门单元，来表征听觉皮层神经元的瞬时功能改变，注意门单元与其他门单元不同之处在于，其他门单元与时间序列一一对应，而注意门单元体现的是短期可塑性效应，因此在时间序列上存在间隔。通过对包含Cross-talk噪声的大量语音数据进行训练获得的上述神经网络声学模型，可以实现对Cross-talk噪声的鲁棒特征提取和鲁棒声学模型的构建，通过抑制非目标流对特征提取的影响可以达到提高声学模型的鲁棒性的目的。In order to overcome the above-mentioned shortcoming of the prior art, the object of the present invention is to provide a kind of construction method of deep long short-term memory recurrent neural network acoustic model based on selective attention principle, set up the deep long short-term memory recurrent neural network acoustic model based on selective attention principle The network acoustic model, by adding the attention gate unit in the deep long short-term memory recurrent neural network acoustic model, to represent the instantaneous functional changes of auditory cortex neurons, the difference between the attention gate unit and other gate units is that other gate units and time series One-to-one correspondence, while the attention gate unit reflects the short-term plasticity effect, so there is an interval in the time series. The above-mentioned neural network acoustic model obtained by training a large amount of speech data containing Cross-talk noise can realize robust feature extraction and construction of a robust acoustic model for Cross-talk noise. Influence can achieve the purpose of improving the robustness of the acoustic model.

为了实现上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于选择性注意原理的连续语音识别方法，包括如下步骤：A continuous speech recognition method based on the principle of selective attention, comprising the steps of:

第一步，构建基于选择性注意原理的深度长短期记忆循环神经网络The first step is to construct a deep long short-term memory recurrent neural network based on the principle of selective attention

从输入到隐藏层定义为一个长短期记忆循环神经网络，深度指的是每个长短期记忆循环神经网络的输出为下一个长短期记忆循环神经网络的输入，如此重复，最后一个长短期记忆循环神经网络的输出作为整个系统的输出；在每一个长短期记忆循环神经网络中，语音信号x_t为t时刻的输入，x_t-1为t-1时刻的输入，以此类推，总时间长度上的输入为x＝[x₁,...,x_T]其中t∈[1,T]，T为语音信号的总时间长度；t时刻的长短期记忆循环神经网络由注意门、输入门、输出门、遗忘门、记忆细胞、tanh函数、隐藏层、乘法器组成，t-1时刻的长短期记忆循环神经网络由输入门、输出门、遗忘门、记忆细胞、tanh函数、隐藏层、乘法器组成；总时间长度上的隐藏层输出为y＝[y₁,...,y_T]；From the input to the hidden layer is defined as a long-term short-term memory recurrent neural network, and the depth refers to the output of each long-term short-term memory recurrent neural network as the input of the next long-term short-term memory recurrent neural network, so repeated, the last long-term short-term memory recurrent neural network The output of the neural network is the output of the whole system; in each long-short-term memory recurrent neural network, the speech signal x _t is the input at time t, x _t-1 is the input at time t-1, and so on, the total time length The input above is x=[x ₁ ,...,x _T ] where t∈[1,T], T is the total time length of the speech signal; the long short-term memory recurrent neural network at time t consists of attention gate, input gate , output gate, forget gate, memory cell, tanh function, hidden layer, multiplier, the long short-term memory cycle neural network at t-1 time is composed of input gate, output gate, forget gate, memory cell, tanh function, hidden layer, Composed of multipliers; the hidden layer output on the total time length is y=[y ₁ ,...,y _T ];

第二步，构建基于选择性注意原理的深度长短期记忆循环神经网络声学模型The second step is to construct a deep long short-term memory recurrent neural network acoustic model based on the principle of selective attention

在第一步的基础上，每间隔s时刻对应的深度长短期记忆循环神经网络存在注意门，其他时刻的深度长短期记忆循环神经网络不存在注意门，即，基于选择性注意原理的深度长短期记忆循环神经网络声学模型由间隔存在注意门的深度长短期记忆循环神经网络组成。On the basis of the first step, there is an attention gate in the deep long-term short-term memory recurrent neural network corresponding to each interval s, and there is no attention gate in the deep long-term short-term memory recurrent neural network at other times, that is, the depth long-term short-term memory recurrent neural network based on the principle of selective attention has no attention gate. The short-term memory recurrent neural network acoustic model consists of a deep long-short-term memory recurrent neural network with interval-existing attention gates.

如何在复杂环境干扰，特别是在cross-talk噪声干扰下进行识别，一直是语音识别的难点之一，阻碍了语音识别的大规模应用。与现有技术相比，本发明借鉴人脑在处理复杂场景的语音时存在选择性注意的现象来实现抑制非目标流和增强目标流，通过在深度长短期记忆递归神经网络声学模型中增加注意门单元，来表征听觉皮层神经元的瞬时功能改变，注意门单元与其他门单元不同之处在于，其他门单元与时间序列一一对应，而注意门单元体现的是短期可塑性效应，因此在时间序列上存在间隔。在一些包含Cross-talk噪声的连续语音识别数据集上采用这种方法，可以获得比深度神经网络方法更好的性能。How to recognize in a complex environment, especially under the interference of cross-talk noise, has always been one of the difficulties in speech recognition, which hinders the large-scale application of speech recognition. Compared with the prior art, the present invention learns from the phenomenon of selective attention in the human brain when processing speech in complex scenes to suppress non-target flow and enhance target flow. The gate unit is used to represent the instantaneous functional changes of neurons in the auditory cortex. The difference between the attention gate unit and other gate units is that the other gate units correspond to the time series one by one, while the attention gate unit reflects the short-term plasticity effect, so in time There are gaps in the sequence. Using this method on some continuous speech recognition datasets containing Cross-talk noise can achieve better performance than the deep neural network method.

附图说明Description of drawings

图1是本发明的基于选择性注意原理的深度长短期记忆循环神经网络流程图。Fig. 1 is the flowchart of deep long short-term memory recurrent neural network based on selective attention principle of the present invention.

图2是本发明的基于选择性注意原理的深度长短期记忆神经网络声学模型流程图。Fig. 2 is a flow chart of the deep long short-term memory neural network acoustic model based on the selective attention principle of the present invention.

具体实施方式detailed description

下面结合附图和实施例详细说明本发明的实施方式。The implementation of the present invention will be described in detail below in conjunction with the drawings and examples.

本发明利用基于选择性注意原理的深度长短期记忆循环神经网络声学模型，实现了连续语音识别。但本发明提供的模型及方法不局限于连续语音识别，也可以是任何与语音识别有关的方法和装置。The invention realizes continuous speech recognition by utilizing a deep long-short-term memory cycle neural network acoustic model based on the principle of selective attention. However, the models and methods provided by the present invention are not limited to continuous speech recognition, and can also be any methods and devices related to speech recognition.

本发明主要包括如下步骤：The present invention mainly comprises the steps:

如图1所示，输入101和输入102为t时刻和t-1时刻语音信号输入x_t和x_t-1(t∈[1,T]，T为语音信号的总时间长度)；t时刻的长短期记忆循环神经网络由注意门103、输入门104、遗忘门105、记忆细胞106、输出门107、tanh函数108、tanh函数109、隐藏层110、乘法器122以及乘法器123组成；t-1时刻的长短期记忆循环神经网络由输入门112、遗忘门113、记忆细胞114、输出门115、tanh函数116、tanh函数117、隐藏层118、乘法器120以及乘法器121组成。t时刻和t-1时刻隐藏层输出分别为输出111和输出119。As shown in Figure 1, input 101 and input 102 are voice signal input x _t and x _t-1 (t ∈ [1, T], T is the total time length of voice signal) at t moment and t-1 moment; The long short-term memory recurrent neural network is made up of attention gate 103, input gate 104, forget gate 105, memory cell 106, output gate 107, tanh function 108, tanh function 109, hidden layer 110, multiplier 122 and multiplier 123; The long-short-term memory recurrent neural network at time -1 consists of an input gate 112 , a forgetting gate 113 , a memory cell 114 , an output gate 115 , a tanh function 116 , a tanh function 117 , a hidden layer 118 , a multiplier 120 and a multiplier 121 . The output of the hidden layer at time t and time t-1 is output 111 and output 119 respectively.

其中，输入102同时作为输入门112、遗忘门113、输出门115以及tanh函数116的输入，输入门112的输出与tanh函数116的输出送入乘法器120，运算后的输出作为记忆细胞114的输入，记忆细胞114的输出作为tanh函数117的输入，tanh函数117的输出和输出门115的输出送入乘法器121，运算后的输出作为隐藏层118的输入，隐藏层118的输出即为输出119。Among them, the input 102 is simultaneously used as the input of the input gate 112, the forget gate 113, the output gate 115 and the tanh function 116, the output of the input gate 112 and the output of the tanh function 116 are sent to the multiplier 120, and the output after the operation is used as the output of the memory cell 114. Input, the output of the memory cell 114 is used as the input of the tanh function 117, the output of the tanh function 117 and the output of the output gate 115 are sent to the multiplier 121, and the output after the operation is used as the input of the hidden layer 118, and the output of the hidden layer 118 is the output 119.

输入101、记忆细胞114的输出以及乘法器121的输出共同作为注意门103的输入，注意门103的输出和乘法器121的输出共同作为tanh函数108的输入，注意门103的输出、记忆细胞114的输出和乘法器121的输出分别共同作为输入门104、遗忘门105以及输出门107的输入，遗忘门105的输出和记忆细胞114的输出送入乘法器124，输入门104的输出与tanh函数108的输出送入乘法器122，乘法器124的输出和乘法器122的输出作为记忆细胞106的输入，记忆细胞106的输出作为tanh函数109的输入，tanh函数109的输出和输出门107的输出送入乘法器123，乘法器123的输出作为隐藏层110的输入，隐藏层110的输出即为输出111。The output of the input 101, the memory cell 114 and the output of the multiplier 121 are jointly used as the input of the attention gate 103, the output of the attention gate 103 and the output of the multiplier 121 are jointly used as the input of the tanh function 108, the output of the attention gate 103, the memory cell 114 The output of the output of the multiplier 121 and the output of the multiplier 121 are jointly used as the input of the input gate 104, the forget gate 105 and the output gate 107 respectively, the output of the forget gate 105 and the output of the memory cell 114 are sent to the multiplier 124, the output of the input gate 104 and the tanh function The output of 108 is sent to multiplier 122, and the output of multiplier 124 and the output of multiplier 122 are as the input of memory cell 106, and the output of memory cell 106 is as the input of tanh function 109, the output of tanh function 109 and the output of output gate 107 The output of the multiplier 123 is used as the input of the hidden layer 110, and the output of the hidden layer 110 is the output 111.

即：在t∈[1,T]时刻的参数按照如下公式计算：That is: the parameters at time t∈[1,T] are calculated according to the following formula:

G_{atten_t}＝sigmoid(W_axx_t+W_am m_t-1+W_ac Cell_t-1+b_a)G _{atten_t} ＝sigmoid(W _ax x _t +W _am m _t-1 +W _ac Cell _t-1 +b _a )

G_{input_t}＝sigmoid(W_ia G_{atten_t}+W_im m_t-1+W_ic Cell_t-1+b_i)G _{input_t} ＝sigmoid(W _ia G _{atten_t} +W _im m _t-1 +W _ic Cell _t-1 +b _i )

G_{forget_t}＝sigmoid(W_fa G_{atten_t}+W_fm m_t-1+W_fc Cell_t-1+b_f)G _{forget_t} ＝sigmoid(W _fa G _{atten_t} +W _fm m _t-1 +W _fc Cell _t-1 +b _f )

Cell_t＝G_{forget_t}⊙Cell_t-1+G_{input_t}⊙tanh(W_ca G_{atten_t}+W_cm m_t-1+b_c)Cell _t ＝G _{forget_t} ⊙Cell _t-1 +G _{input_t} ⊙tanh(W _ca G _{atten_t} +W _cm m _t-1 +b _c )

G_{output_t}＝sigmoid(W_oa G_{atten_t}+W_om m_t-1+W_oc Cell_t-1+b_o)G _{output_t} ＝sigmoid(W _oa G _{atten_t} +W _om m _t-1 +W _oc Cell _t-1 +b _o )

m_t＝G_{output_t}⊙tanh(Cell_t)m _t ＝G _{output_t} ⊙tanh(Cell _t )

y_t＝softmax_k(W_ym m_t+b_y)y _t ＝softmax _k (W _ym m _t +b _y )

其中G_{atten_t}为t时刻注意门103的输出，G_{input_t}为t时刻输入门104的输出，G_{forget_t}为t时刻遗忘门105的输出，Cell_t为t时刻记忆细胞106的输出，G_{output_t}为t时刻输出门107的输出，m_t为t时刻隐藏层110的输入，y_t为t时刻的输出111；x_t为t时刻的输入101，m_t-1为t-1时刻隐藏层118的输入，Cell_t-1为t-1时刻记忆细胞114的输出；W_ax为t时刻注意门a与t时刻输入x之间的权重，W_am为t时刻注意门a与t-1时刻隐藏层输入m之间的权重，W_ac为t时刻注意门a与t-1时刻记忆细胞c之间的权重，W_ia为t时刻输入门i与t时刻注意门a之间的权重，W_im为t时刻输入门i与t-1时刻隐藏层输入m之间的权重，W_ic为t时刻输入门i与t-1时刻记忆细胞c之间的权重，W_fa为t时刻遗忘门f与t时刻注意门a之间的权重，W_fm为t时刻遗忘门f与t-1时刻隐藏层输入m之间的权重，W_fc为t时刻遗忘门f与t-1时刻记忆细胞c之间的权重，W_ca为t时刻记忆细胞c与t时刻注意门a之间的权重，W_cm为t时刻记忆细胞c与t-1时刻隐藏层输入m之间的权重，W_oa为t时刻输出门o与t时刻注意门a之间的权重，W_om为t时刻输出门o与t-1时刻隐藏层输入m之间的权重，W_oc为t时刻输出门o与t-1时刻记忆细胞c之间的权重；b_a为注意门a的偏差量，b_i为输入门i的偏差量，b_f为遗忘门f的偏差量，b_c为记忆细胞c的偏差量，b_o为输出门o的偏差量，b_y为输出y的偏差量，不同的b代表不同的偏差量；且有其中x_k表示第k∈[1,K]个softmax函数的输入，，l∈[1,K]，表示对全部求和；⊙代表矩阵元素相乘。Wherein G _{atten_t} is the output of the attention gate 103 at the time t, G _{input_t} is the output of the input gate 104 at the time t, G _{forget_t} is the output of the forgetting gate 105 at the time t, Cell _t is the output of the memory cell 106 at the time t, and G _{output_t} is the time at t The output of output gate 107, m _t is the input of hidden layer 110 at time t, y _t is the output 111 at time t; x _t is the input 101 at time t, m _t-1 is the input of hidden layer 118 at time t-1, Cell _t-1 is the output of the memory cell 114 at time t-1; W _ax is the weight between the attention gate a at time t and the input x at time t; W _am is the attention gate a at time t and the hidden layer input m at time t-1 W _ac is the weight between attention gate a at time t and memory cell c at time t-1, W _ia is the weight between input gate i at time t and attention gate a at time t, W _im is time t The weight between input gate i and hidden layer input m at time t-1, W _ic is the weight between input gate i at time t and memory cell c at time t-1, W _fa is the forgetting gate f at time t and attention at time t The weight between gate a, W _fm is the weight between the forgotten gate f at time t and the hidden layer input m at time t-1, W _fc is the weight between the forgotten gate f at time t and memory cell c at time t-1, W _ca is the weight between memory cell c at time t and attention gate a at time t, W _cm is the weight between memory cell c at time t and hidden layer input m at time t-1, W _oa is the output gate o and gate a at time t Pay attention to the weight between the gate a at time t, W _om is the weight between the output gate o at time t and the hidden layer input m at time t-1, W _oc is the weight between the output gate o at time t and memory cell c at time t-1 b _a is the deviation of attention gate a, b _i is the deviation of input gate i, b _f is the deviation of forgetting gate f, b _c is the deviation of memory cell c, b _o is the deviation of output gate o Deviation, b _y is the deviation of the output y, and different b represents different deviations; and there are where x _k represents the input of the k∈[1,K]th softmax function, l∈[1,K], express to all Sum; ⊙ stands for matrix element multiplication.

在第一步的基础上，每间隔s(s＝5)时刻对应的深度长短期记忆循环神经网络存在注意门，其他时刻的深度长短期记忆循环神经网络不存在注意门，即，基于选择性注意原理的深度长短期记忆循环神经网络声学模型由间隔存在注意门的深度长短期记忆循环神经网络组成。如图2所示为所建立的基于选择性注意原理的深度长短期记忆循环神经网络声学模型，t时刻的深度长短期记忆循环神经网络存在注意门201，t-s时刻的深度长短期记忆循环神经网络存在注意门202，如此循环。On the basis of the first step, there is an attention gate in the deep long-short-term memory recurrent neural network corresponding to each interval s (s=5), and there is no attention gate in the deep long-term short-term memory recurrent neural network at other moments, that is, based on selectivity Attention-based deep long-term short-term memory recurrent neural network The acoustic model consists of a deep long-term short-term memory recurrent neural network with attention gates at intervals. As shown in Figure 2, the established acoustic model of the deep long-term short-term memory recurrent neural network based on the principle of selective attention, the deep long-term short-term memory recurrent neural network at time t has an attention gate 201, and the deep long-term short-term memory recurrent neural network at time t-s There is an attention gate 202, and so on.

Claims

1. A method for building a deep long-short-term memory recurrent neural network acoustic model based on the principle of selective attention, comprising the steps of:

The first step is to construct a deep long short-term memory recurrent neural network based on the principle of selective attention

From the input to the hidden layer is defined as a long-term short-term memory recurrent neural network, and the depth refers to the output of each long-term short-term memory recurrent neural network as the input of the next long-term short-term memory recurrent neural network, so repeated, the last long-term short-term memory recurrent neural network The output of the neural network is the output of the whole system; in each long-short-term memory recurrent neural network, the speech signal x _t is the input at time t, x _t-1 is the input at time t-1, and so on, the total time length The input above is x=[x ₁ ,...,x _T ] where t∈[1,T], T is the total time length of the speech signal; the long short-term memory recurrent neural network at time t consists of attention gate, input gate , output gate, forget gate, memory cell, tanh function, hidden layer, multiplier, the long short-term memory cycle neural network at t-1 time is composed of input gate, output gate, forget gate, memory cell, tanh function, hidden layer, Composed of multipliers; the hidden layer output on the total time length is y=[y ₁ ,...,y _T ];

The parameters at time t∈[1,T] are calculated according to the following formula:

G _{atten_t} ＝sigmoid(W _ax x _t +W _am m _t-1 +W _ac Cell _t-1 +b _a )

G _{input_t} ＝sigmoid(W _ia G _{atten_t} +W _im m _t-1 +W _ic Cell _t-1 +b _i )

G _{forget_t} ＝sigmoid(W _fa G _{atten_t} +W _fm m _t-1 +W _fc Cell _t-1 +b _f )

Cell _t ＝G _{forget_t} ⊙Cell _t-1 +G _{input_t} ⊙tanh(W _ca G _{atten_t} +W _cm m _t-1 +b _c )

G _{output_t} ＝sigmoid(W _oa G _{atten_t} +W _om m _t-1 +W _oc Cell _t-1 +b _o )

m _t ＝G _{output_t} ⊙tanh(Cell _t )

y _t ＝softmax _k (W _ym m _t +b _y )

Where G _{atten_t} is the output of the attention gate at time t, G _{input_t} is the output of the input gate at time t, G _{forget_t} is the output of the forget gate at time t, Cell _t is the output of the memory cell at time t, and G _{output_t} is the output of the output gate at time t , m _t is the input of the hidden layer at time t, y _t is the output at time t; x _t is the input at time t, m _t-1 is the input of the hidden layer at time t-1, Cell _t-1 is the time of t-1 The output of the memory cell; W _ax is the weight between the attention gate a at time t and the input x at time t, W _am is the weight between the attention gate a at time t and the input m of the hidden layer at time t-1, W _ac is the weight at time t Note the weight between gate a and memory cell c at time t-1, W _ia is the weight between input gate i at time t and attention gate a at time t, W _im is the hidden layer between input gate i and time t-1 at time t The weight between input m, W _ic is the weight between input gate i at time t and memory cell c at time t-1, W _fa is the weight between forgetting gate f at time t and attention gate a at time t, W _fm is The weight between the forgotten gate f at time t and the hidden layer input m at time t-1, W _fc is the weight between the forgotten gate f at time t and the memory cell c at time t-1, W _ca is the memory cell c and t at time t Always pay attention to the weight between the gate a, W _cm is the weight between the memory cell c at the time t and the input m of the hidden layer at the time t-1, W _oa is the weight between the output gate o at the time t and the attention gate a at the time t, W _om is the weight between the output gate o at time t and the hidden layer input m at time t-1, W _oc is the weight between the output gate o at time t and the memory cell c at time t-1; b _a is the weight of the attention gate a Bias, b _i is the deviation of input gate i, b _f is the deviation of forgetting gate f, b _c is the deviation of memory cell c, b _o is the deviation of output gate o, b _y is the deviation of output y amount, different b represents different deviations; and there are where x _k represents the input of the k∈[1,K]th softmax function, l∈[1,K], express to all Sum; ⊙ represents matrix element multiplication;

The second step is to construct a deep long short-term memory recurrent neural network acoustic model based on the principle of selective attention

On the basis of the first step, there is an attention gate in the deep long-term short-term memory recurrent neural network corresponding to each interval s, and there is no attention gate in the deep long-term short-term memory recurrent neural network at other times, that is, the depth long-term short-term memory recurrent neural network based on the principle of selective attention has no attention gate. The short-term memory recurrent neural network acoustic model consists of a deep long-short-term memory recurrent neural network with interval-existing attention gates.

2. according to the construction method of the deep long-short-term memory recurrent neural network acoustic model based on the selective attention principle of claim 1, it is characterized in that, said s=5.