CN111477220B

CN111477220B - Neural network voice recognition method and system for home spoken language environment

Info

Publication number: CN111477220B
Application number: CN202010295068.2A
Authority: CN
Inventors: 张晖; 程铭; 赵海涛; 孙雁飞; 倪艺洋; 朱洪波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2023-04-25
Anticipated expiration: 2040-04-15
Also published as: JP7166683B2; CN111477220A; WO2021208455A1; JP2022540968A

Abstract

The invention discloses a neural network speech recognition method and system oriented to a home oral environment. The method includes: model construction: adding a long-term and short-term memory network to a deep neural network to construct a combined neural network DNN-LSTM model; The data set is preprocessed to obtain the feature vector set, and the feature vector set is used as the input of the DNN-LSTM model for iterative training to train to the optimal acoustic model; an input speech signal of an unknown language is trained to the DNN-LSTM model. The LSTM model obtains the Chinese output probability vector set and the English output probability vector set respectively; performs language matching according to the Chinese output probability vector set and the English output probability vector set, and outputs the judgment result; the present invention can quickly and accurately identify The content of the speaker can be widely used in actual home scenes.

Description

A Neural Network Speech Recognition Method and System Oriented to Home Spoken Language Environment

技术领域technical field

本发明属于智能识别技术领域，具体涉及一种面向家居口语环境的神经网络语音识别方法及系统。The invention belongs to the technical field of intelligent recognition, and in particular relates to a neural network speech recognition method and system for home spoken language environment.

背景技术Background technique

语音识别研究的重点对象是语音，将语音信号转换成可由计算机所识别的信息，从而识别说话人的语音命令及文字内容。语音识别的方法基本可以分为三种：基于语言学和声学、模型匹配和神经网络三种方法。第一种方法虽然出现较早，但由于其模型复杂的局限性，还没到达较为实用的阶段；第二种方法中应用较多的是隐马尔可夫模型，可用于标注问题的概率模型，并呈现出该模型随机生成观测序列，使语音识别技术得到很大的提升。第三种方法使用浅层神经网络学习训练容易造成梯度不稳定，并且人工提取样本特征费时费力，识别效果不是很好。在传统的语音识别系统中，GMM-HMM的声学建模方法在实际中是应用最广泛的，但是在家居环境下处理一些复杂的语音信号问题时，传统模型的应用场景就显得比较单一。The key object of speech recognition research is speech, which converts speech signals into information that can be recognized by computers, so as to recognize the speaker's speech commands and text content. Speech recognition methods can basically be divided into three types: three methods based on linguistics and acoustics, model matching and neural networks. Although the first method appeared earlier, due to the limitations of its complex model, it has not yet reached a more practical stage; the second method is mostly used in the hidden Markov model, which can be used to label the probability model of the problem, And it shows that the model randomly generates observation sequences, which greatly improves the speech recognition technology. The third method uses shallow neural network learning and training to easily cause gradient instability, and manual extraction of sample features is time-consuming and laborious, and the recognition effect is not very good. In the traditional speech recognition system, the acoustic modeling method of GMM-HMM is the most widely used in practice, but when dealing with some complex speech signal problems in the home environment, the application scenario of the traditional model is relatively simple.

发明内容Contents of the invention

发明目的：为了克服现有技术的不足，本发明提供一种面向家居口语环境的神经网络语音识别方法，该方法可以解决语音识别率低以及识别效率差的问题，本发明还提供一种面向家居口语环境的神经网络语音识别系统。Purpose of the invention: In order to overcome the deficiencies of the prior art, the present invention provides a neural network speech recognition method oriented to the home spoken language environment, which can solve the problems of low speech recognition rate and poor recognition efficiency. The present invention also provides a home-oriented A Neural Network Speech Recognition System for Spoken Language Environments.

技术方案：一方面，本发明所述的面向家居口语环境的神经网络语音识别方法，该方法包括：Technical solution: On the one hand, the neural network speech recognition method for home spoken language environment according to the present invention, the method includes:

模型构建：在深度神经网络中加入长短期记忆网络，构建组合神经网络DNN-LSTM模型；Model construction: add a long-term short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

模型训练：Model training:

中文语音数据训练：对采集的中文语音数据集预处理，得到中文特征向量集，并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优中文声学模型；Chinese voice data training: preprocessing the collected Chinese voice data set to obtain a Chinese feature vector set, and using the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to an optimal Chinese acoustic model;

英文语音数据训练：对采集的英文语音数据集预处理，得到英文特征向量集，并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优英文声学模型；English speech data training: preprocessing the collected English speech data set to obtain the English feature vector set, and using the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model;

模型测试：Model test:

将一个未知语种的输入语音信号voice0，分别经过所述中文声学模型和所述英文声学模型，分别得到中文输出概率向量集和英文输出概率向量集；An input speech signal voice0 of an unknown language is passed through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;

根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配，并输出判断结果。Language matching is performed according to the Chinese output probability vector set and the English output probability vector set, and a judgment result is output.

进一步地，包括：Further, include:

所述组合神经网络DNN-LSTM模型包括输入层、长短期记忆网络、第二隐藏层、第三隐藏层、第四隐藏层和输出层，所述长短期记忆网络作为第一隐藏层。The combined neural network DNN-LSTM model includes an input layer, a long short-term memory network, a second hidden layer, a third hidden layer, a fourth hidden layer and an output layer, and the long short-term memory network is used as the first hidden layer.

进一步地，包括：Further, include:

所述第一隐藏层的节点数为512个，其激活函数选择sigmoid函数和tanh函数，第二隐藏层、第三隐藏层和第四隐藏层的节点数均为1024个，激活函数选择sigmoid函数。The number of nodes of the first hidden layer is 512, and its activation function selects sigmoid function and tanh function, and the number of nodes of the second hidden layer, the third hidden layer and the fourth hidden layer is 1024, and the activation function selects sigmoid function .

进一步地，包括：Further, include:

所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练步骤包括：The Chinese feature vector set is carried out iterative training as the input of the DNN-LSTM model, and the training steps include:

(1)初始化模型结构中的权值矩阵W和偏置向量b为一个随机值；(1) Initialize the weight matrix W and bias vector b in the model structure as a random value;

(2)开始从第1次到最大次数的迭代；在每次迭代中，都是从第一条语音数据训练样本开始遍历至最后一个训练样本；(2) Start iterations from the 1st to the maximum number of times; in each iteration, start traversing from the first voice data training sample to the last training sample;

(3)在每一个训练样本的训练过程中，将对应的特征向量输入到输入层；从第一隐藏层开始遍历到输出层，并采用前向传播算法表示出正在遍历的对应层，然后根据损失函数表示输出层；前向传播算法完成后，开始从第四隐藏层遍历至第一隐藏层，采用反向传播算法表示对应的第一隐藏层；(3) During the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;

(4)反向传播算法完成后，从第一隐藏层到输出层的顺序开始遍历，并更新对应层的权值矩阵和偏置向量Wⁿ、bⁿ，至此一次迭代过程中对于一个样本的训练就结束了；此时若样本没有遍历完，则继续遍历样本；若样本已经遍历完，则进行下一次的迭代；(4) After the backpropagation algorithm is completed, it traverses from the first hidden layer to the output layer, and updates the weight matrix and bias vector W ⁿ , b ⁿ of the corresponding layer. The training is over; at this time, if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, proceed to the next iteration;

(5)当全部W、b的改变值都不超过迭代阈值，则停止迭代循环；(5) When the change values of all W and b do not exceed the iteration threshold, stop the iterative cycle;

(6)保存各层的最优权值矩阵W和偏置向量b。(6) Save the optimal weight matrix W and bias vector b of each layer.

进一步地，包括：Further, include:

根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配，包括：Perform language matching according to the Chinese output probability vector set and the English output probability vector set, including:

利用信息熵公式分别计算中文输出概率向量集P和英文输出概率向量集P'对应的信息熵，分别对应记为H和H'；其中，P＝{p₁，p₂,...,p_q}，P'＝{p₁'，p'₂,...,p_t'}，q为中文声学模型输出分类的总数，t为英文声学模型输出分类的总数；Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P', which are respectively recorded as H and H'; where, P={p ₁ , p ₂ ,...,p _q }, P'={p ₁ ', p' ₂ ,..., p _t '}, q is the total number of Chinese acoustic model output categories, and t is the total number of English acoustic model output categories;

若从中文声学模型输出的概率向量集中存在p_i明显大于其他的概率值，从英文声学模型输出的概率向量集中的各个概率值相差不大，其中，1≤i≤q；If there are probability values in the probability vector set output from the Chinese acoustic model that p _i is significantly greater than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;

且若输入到中文声学模型对应的信息熵H比输入到英文声学模型对应的信息熵H'小，则对应的未知语种的输入语音信号voice0为中文，并将中文声学模型的输出概率作为最后输出结果；And if the information entropy H corresponding to the Chinese acoustic model is smaller than the information entropy H' corresponding to the English acoustic model, the corresponding unknown language input voice signal voice0 is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;

若从英文声学模型输出的概率向量集中存在p'_j明显大于其他的概率值，从中文声学模型输出的概率向量集中的各个概率值相差不大，If there is a probability value in the probability vector set output from the English acoustic model that p' _j is significantly larger than other probability values, the probability values in the probability vector set output from the Chinese acoustic model are not much different.

且若输入到英文声学模型对应的信息熵H'比输入到中文声学模型对应的信息熵H小，则对应的未知语种的输入语音信号voice0为英文，并将英文声学模型的输出概率作为最后输出结果。And if the information entropy H' corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding unknown language input voice signal voice0 is English, and the output probability of the English acoustic model is taken as the final output result.

另一方面，本发明还提供一种上述面向家居口语环境的神经网络语音识别方法实现的系统，该系统包括：On the other hand, the present invention also provides a system realized by the above-mentioned neural network speech recognition method for home oral environment, the system includes:

模型构建模块，用于在深度神经网络中加入长短期记忆网络，构建组合神经网络DNN-LSTM模型；A model building block for adding a long short-term memory network to a deep neural network to construct a combined neural network DNN-LSTM model;

模型训练模块，其又包括中文模型训练单元和英文模型训练单元，所述中文模型训练单元，用于对采集的中文语音数据集预处理，得到中文特征向量集，并将所述中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优中文声学模型；Model training module, it comprises Chinese model training unit and English model training unit again, described Chinese model training unit is used for the Chinese speech data set preprocessing of collection, obtains Chinese feature vector set, and described Chinese feature vector set Carry out iterative training as the input of described DNN-LSTM model, train to optimal Chinese acoustic model;

英文模型训练单元，用于对采集的英文语音数据集预处理，得到英文特征向量集，并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优英文声学模型；The English model training unit is used to preprocess the collected English speech data set to obtain the English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to achieve the optimal English acoustics Model;

模型测试模块，其又包括语音输入单元和语音类型判断单元，所述语音输入单元，用于将一个未知语种的输入语音信号voice0，分别经过所述中文声学模型和所述英文声学模型，分别得到中文输出概率向量集和英文输出概率向量集；所述语音类型判断单元，用于根据所述中文输出概率向量集和英文输出概率向量集进行语种匹配，并输出判断结果。Model testing module, it comprises speech input unit and speech type judging unit again, and described speech input unit is used for the input speech signal voice0 of an unknown language, passes through described Chinese acoustic model and described English acoustic model respectively, obtains respectively The Chinese output probability vector set and the English output probability vector set; the speech type judgment unit is used to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result.

有益效果：本发明与现有技术相比，其显著优点是：本发明结合LSTM利用记忆单元可以记录很长的历史信息的特征以及DNN可以有效的提取数据中的高层次信息的特征的特点，提出在DNN隐藏层的第一层加入LSTM的想法，构建了DNN和LSTM相结合的组合神经网络进行声学建模，并对中文数据集和英文数据集进行训练和测试，得到中文声学模型和英文声学模型，并通过引用熵的概念来比较输入语音信号在中文声学模型和英文声学模型的输出结果，将熵值较小的结果作为声学模型的输出结果，从而达到简单语种识别的目的，并且提高了整体的语音识别率，进而能够快速准确的识别家居场景下说话人的内容，可以广泛应用于实际家居场景。Beneficial effects: Compared with the prior art, the present invention has the remarkable advantages that: the present invention combines the feature that LSTM utilizes the memory unit to record very long historical information and the feature that DNN can effectively extract high-level information in the data, The idea of adding LSTM to the first layer of the hidden layer of DNN was proposed, and a combined neural network combining DNN and LSTM was constructed for acoustic modeling, and the Chinese and English data sets were trained and tested, and the Chinese acoustic model and English Acoustic model, and compare the output results of the input speech signal in the Chinese acoustic model and the English acoustic model by referring to the concept of entropy, and use the result with a smaller entropy value as the output result of the acoustic model, so as to achieve the purpose of simple language recognition and improve The overall speech recognition rate is improved, and then it can quickly and accurately identify the content of the speaker in the home scene, which can be widely used in the actual home scene.

附图说明Description of drawings

图1为本发明所述的面向家居口语环境的组合神经网络语音识别算法总体结构框图；Fig. 1 is the overall structural block diagram of the combined neural network speech recognition algorithm facing home spoken language environment of the present invention;

图2为本发明所述的DNN-LSTM模型结构图；Fig. 2 is the DNN-LSTM model structural diagram of the present invention;

图3为LSTM整体结构图。Figure 3 is the overall structure diagram of LSTM.

具体实施方式Detailed ways

为了更加详细的描述本发明提出的面向家居口语环境的组合神经网络语音识别算法，结合附图，举例说明如下。In order to describe in more detail the combined neural network speech recognition algorithm oriented to the home spoken language environment proposed by the present invention, an example is illustrated as follows in conjunction with the accompanying drawings.

如图1为面向家居口语环境的组合神经网络语音识别算法总体结构框图，首先结合DNN和LSTM的特点，构建DNN-LSTM模型；然后，采用DNN-LSTM模型对中文数据集和英文数据集进行训练，保存中文声学模型和英文声学模型；最后，通过语种匹配输出结果，从而到达语种识别和语音识别的目的。As shown in Figure 1, the overall structural block diagram of the combined neural network speech recognition algorithm for home spoken language environment, firstly combine the characteristics of DNN and LSTM to construct the DNN-LSTM model; then, use the DNN-LSTM model to train the Chinese data set and the English data set , save the Chinese acoustic model and the English acoustic model; finally, output the results through language matching, so as to achieve the purpose of language recognition and speech recognition.

DNN为深度神经网络，Deep Neural Networks，LSTM为长短期记忆网络，LongShort-Term Memory，如图3是LSTM内部三门逻辑计算结构图，LSTM的核心要素是细胞状态，表示细胞状态随时间的信息传递。它沿着整个链直线贯通，只有一些微小的线性相互作用，信息很容易在没有大幅度变化的情况下流动。在传递过程中，通过当前时刻输入、上一时刻隐藏层状态、上一时刻细胞状态以及门结构来增加或删除细胞状态中的信息。在语音识别中，LSTM模型的记忆单元主要用于存储与处理语音特征，它实现了三门计算，即遗忘门、输入门和输出门，通过三个门来保护和控制当前时刻的神经元状态c_t，具体如下：DNN is a deep neural network, Deep Neural Networks, LSTM is a long-term short-term memory network, LongShort-Term Memory, as shown in Figure 3 is a three-door logic calculation structure diagram inside LSTM, the core element of LSTM is the cell state, which represents the information of the cell state over time transfer. It runs in a straight line along the entire chain, with only a few minor linear interactions, and information flows easily without drastic changes. During the transfer process, the information in the cell state is added or deleted through the input at the current moment, the hidden layer state at the previous moment, the cell state at the previous moment, and the gate structure. In speech recognition, the memory unit of the LSTM model is mainly used to store and process speech features. It implements three calculations, namely, the forgetting gate, the input gate and the output gate, and protects and controls the neuron state at the current moment through three gates. c _t , the details are as follows:

(1)输入门：该门的作用是确定输入x_t中有多少信息保留在c_t中，实现公式为：(1) Input gate: The function of this gate is to determine how much information in the input x _t is retained in c _t , and the realization formula is:

i_t＝σ(W_i·[h_t-1，x_t]+b_i) (3)i _t =σ(W _i ·[h _t-1 , x _t ]+b _i ) (3)

其中，i_t为t时刻输入门的输入，通过输入门，将输入门对应的状态

保留下来。W_i、W_c表示权值矩阵，b_i、b_c表示偏置项，x_t-1,x_t,x_t+1分别表示上一个时刻、当前时刻和下一个时刻的输入；h_t-1,h_t,h_t+1分别表示上一个时刻、当前时刻和下一个时刻的神经元状态，σ表示sigmoid函数。Among them, it is the input of the input gate at time _t , through the input gate, the corresponding state of the input gate

Keep it. W _i , W _c represent the weight matrix, b _i , b _c represent the bias items, x _t-1 , x _t , x _t+1 represent the input of the previous moment, the current moment and the next moment respectively; h _{t- 1} , h _t , h _t+1 represent the neuron state at the previous moment, the current moment and the next moment respectively, and σ represents the sigmoid function.

(2)遗忘门：该门的作用是确定t时刻输入中的c_t-1有多少成分保留在c_t中。实现公式为：(2) Forget gate: The function of this gate is to determine how many components of c _t-1 in the input at time t remain in c _t . The realization formula is:

f_t＝σ(W_f·[h_t-1，x_t]+b_f) (5)f _t = σ(W _f ·[h _t-1 , x _t ]+b _f ) (5)

其中，W_f表示权值矩阵，b_f表示偏置项。Among them, W _f represents the weight matrix, and b _f represents the bias term.

(3)输出门：该门的作用是利用控制单元c_t有多少输出到LSTM的当前输出值h_t。首先经过输入门和遗忘门之后的状态，即c_t实现公式为：(3) Output gate: the function of this gate is to use the current output value h _t of the control unit c _t to LSTM. First, the state after passing through the input gate and the forget gate, that is, the realization formula of c _t is:

其中，前半部分为信息经过遗忘门后保留在c_t中的成分，后半部分为信息经过输入门后保留在c_t中的成分。然后，为了确定c_t有多少成分保留在h_t中，输出的实现公式为：Among them, the first half is the component retained in c _t after the information passes through the forget gate, and the second half is the component retained in c _t after the information passes through the input gate. Then, to determine how much of c _t remains in h _t , the output is implemented as:

o_t＝σ(W_o[h_t-1，x_t]+b_o) (7)o _t =σ(W _o [h _t-1 , x _t ]+b _o ) (7)

其中，o_t为t时刻输出层的状态，W_o表示权重矩阵，b_o表示偏置项。最后，经过输出门，隐藏层的最终输出结果为：Among them, o _t is the state of the output layer at time t, W _o represents the weight matrix, and b _o represents the bias item. Finally, after the output gate, the final output of the hidden layer is:

h_t＝o_t*tanh(c_t) (8)h _t ＝o _t *tanh(c _t ) (8)

具体的：面向家居口语环境的组合神经网络语音识别方法包括：Specifically: the combined neural network speech recognition method for home oral environment includes:

首先，构建模型：在深度神经网络中加入长短期记忆网络，构建组合神经网络DNN-LSTM模型；First, build the model: add a long-term short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

如图2为DNN-LSTM模型，结构如下：第0层为输入层，第1层到第4层为隐藏层，第5层为输出层，其激活函数为softmax函数。在隐藏层中，第一层为LSTM网络结构，节点数为512个，其激活函数选择sigmoid函数和tanh函数，为防止网络内对数据的过分学习，在神经单元内部加入Dropout策略；后3层均为DNN网络结构，各层的节点数为1024个，激活函数选择sigmoid函数；即组合神经网络DNN-LSTM模型包括输入层、长短期记忆网络、第二隐藏层、第三隐藏层、第四隐藏层和输出层，所述长短期记忆网络作为第一隐藏层。As shown in Figure 2, the DNN-LSTM model has the following structure: layer 0 is the input layer, layers 1 to 4 are hidden layers, layer 5 is the output layer, and its activation function is the softmax function. In the hidden layer, the first layer is an LSTM network structure with 512 nodes. Its activation function selects sigmoid function and tanh function. In order to prevent excessive learning of data in the network, a Dropout strategy is added inside the neural unit; the last 3 layers Both are DNN network structures, the number of nodes in each layer is 1024, and the activation function selects the sigmoid function; that is, the combined neural network DNN-LSTM model includes an input layer, a long-term short-term memory network, a second hidden layer, a third hidden layer, and a fourth hidden layer. A hidden layer and an output layer, the long short-term memory network is used as the first hidden layer.

该模型有6层，每一层神经元的输入向量为zⁿ，输出向量为yⁿ，则有：The model has 6 layers, the input vector of neurons in each layer is z ⁿ , and the output vector is y ⁿ , then:

zⁿ＝Wⁿz^n-1+bⁿ,n＝1,2,3,4,5 (1)z ⁿ = W ⁿ z ^n-1 + b ⁿ , n = 1,2,3,4,5 (1)

式中，Wⁿ为第n-1层到第n层的权值矩阵，bⁿ为第n层的偏置。根据输入向量可得输出为：In the formula, W ⁿ is the weight matrix from the n-1th layer to the nth layer, b ⁿ is the bias of the nth layer. According to the input vector, the output can be obtained as:

yⁿ＝f_n(zⁿ) (2)y ⁿ =f _n (z ⁿ ) (2)

式中，f_n为第n层的激活函数。In the formula, f _n is the activation function of the nth layer.

其次，进行模型训练：Second, perform model training:

中文语音数据训练：对采集的中文语音数据集预处理，得到中文特征向量集vector0，并将中文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优中文声学模型；Chinese voice data training: preprocessing the collected Chinese voice data set to obtain the Chinese feature vector set vector0, and using the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal Chinese acoustic model;

其中，预处理操作包括：采样、预加重、加窗分帧、端点检测，并将特征向量vector0作为DNN-LSTM模型输入进行迭代训练，训练至最优声学模型China_model。Among them, the preprocessing operations include: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector0 is used as the input of the DNN-LSTM model for iterative training to the optimal acoustic model China_model.

训练的步骤如下：The training steps are as follows:

(1)将网络结构中的权值矩阵W和偏置向量b初始化为一个随机值。(1) Initialize the weight matrix W and bias vector b in the network structure to a random value.

(2)开始从第1次到最大次数的迭代；本实施例将最大次数设置为50，在每次迭代中，都是从第一个训练样本开始遍历至最后一个训练样本，其中用i来表示正在遍历的训练样本；(2) Start iterations from the first to the maximum number of times; in this embodiment, the maximum number of times is set to 50, and in each iteration, it is traversed from the first training sample to the last training sample, where i is used to Indicates the training samples being traversed;

(3)在每一个样本的训练过程中，将输入向量作为DNN的第一层输入，用a¹表示；然后从隐藏层的第一层到输出层开始遍历，用n来表示正在遍历那一层，每层都做前向传播算法计算a^i，n＝f(z^i，n)＝f(Wⁿa^i，n-1+bⁿ)，表示正在遍历的该层遍历的第i个样本对应的输入层。(3) During the training process of each sample, the input vector is used as the first layer input of DNN, denoted by a ¹ ; and then traversed from the first layer of the hidden layer to the output layer, and n is denoted by the traversing layer layer, each layer performs forward propagation algorithm to calculate a ^{i, n} = f(z ^{i, n} ) = f(W ⁿ a ^{i, n-1} + b ⁿ ), which means the i-th traversal of the layer being traversed The input layer to which the sample corresponds.

根据损失函数计算输出层δ^i，L，L即为输出层；前向传播算法完成后，开始从隐藏层的最后一层遍历至隐藏层的第一层，进行反向传播算法计算δ^i，n＝(Wⁿ⁺¹)^Tδ^i，n+1⊙算

；即正在遍历的该层正在遍历的第i个的训练样本对应的输出层，T为转置f′为求导，⊙表示同或运算。Calculate the output layer δ ^{i according to the loss function, L} is the output layer; after the forward propagation algorithm is completed, start traversing from the last layer of the hidden layer to the first layer of the hidden layer, and perform the back propagation algorithm to calculate δ ^{i, n} ＝(W ⁿ⁺¹ ) ^T δ ^{i, n+1} ⊙

; That is, the output layer corresponding to the i-th training sample that the layer is traversing is traversing, T is the transpose f' is the derivative, ⊙ represents the same OR operation.

(4)反向传播算法完成后，从隐藏层的第一层到输出层开始遍历，更新正在遍历的第n层的Wⁿ、bⁿ，则：(4) After the backpropagation algorithm is completed, start traversing from the first layer of the hidden layer to the output layer, and update W ⁿ and b ⁿ of the nth layer being traversed, then:

这样，一次迭代过程中对于某一个样本的训练就结束了；此时若样本没有遍历完，则继续遍历样本；若样本已经遍历完，则进行下一次的迭代，其中，m为训练样本总数，α是迭代步长；In this way, the training for a certain sample in one iteration process is over; at this time, if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, the next iteration will be performed, where m is the total number of training samples, α is the iteration step size;

英文语音数据训练：对采集的英文语音数据集预处理，得到英文特征向量集vector1，并将所述英文特征向量集作为所述DNN-LSTM模型的输入进行迭代训练，训练至最优英文声学模型；其中，预处理操作包括：采样、预加重、加窗分帧、端点检测，并将特征向量vector1作为DNN-LSTM模型输入进行迭代训练，训练至最优声学模型English_model，具体遍历步骤与中文语音数据训练的步骤相同，在此就不在赘述。English speech data training: Preprocess the collected English speech data set to obtain the English feature vector set vector1, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model ; Among them, the preprocessing operation includes: sampling, pre-emphasis, windowing and framing, endpoint detection, and the feature vector vector1 is used as the input of the DNN-LSTM model for iterative training, training to the optimal acoustic model English_model, the specific traversal steps and Chinese speech The steps of data training are the same and will not be repeated here.

最后，进行模型测试，具体步骤为：Finally, perform model testing, the specific steps are:

利用信息熵公式分别计算中文输出概率向量集P和英文输出概率向量集P'对应的信息熵，分别对应记为H和H'；其中，P＝{p₁，p₂,...,p_q}，P'＝{p₁'，p'₂,...,p_t'}，q为中文声学模型输出分类的总数，t为英文声学模型输出分类的总数；Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P', which are respectively recorded as H and H'; where, P={p ₁ , p ₂ ,...,p _q }, P'={p ₁ ', p' ₂ ,..., p _t '}, q is the total number of Chinese acoustic model output classifications, and t is the total number of English acoustic model output classifications;

信息熵公式为：The information entropy formula is:

若从中文声学模型输出的概率向量集中存在p_i明显大于其他的概率值，从英文声学模型输出的概率向量集中的各个概率值相差不明显，其中，1≤i≤q；If there is a probability value in the probability vector set output from the Chinese acoustic model that p _i is significantly greater than other probability values, the difference between each probability value in the probability vector set output from the English acoustic model is not obvious, where 1≤i≤q;

本发明实施例中，概率值是否相差明显和softmax输出层的输出分类相关，输出分类越多，对应的范围越小，该范围设置为β，即概率值之间相差值大于等于β即为相差明显，若各个概率值相差的范围小于β即为相差不明显。在实验过程中，输出分类为5类时，该范围β大约为0.2，输出分类越多，该范围越小。In the embodiment of the present invention, whether the difference between the probability values is significantly related to the output classification of the softmax output layer, the more output classifications, the smaller the corresponding range, the range is set to β, that is, the difference between the probability values is greater than or equal to β is the difference Obviously, if the difference range of each probability value is less than β, the difference is not obvious. During the experiment, when the output classification is 5 categories, the range β is about 0.2, and the more output classifications, the smaller the range.

且若输入到中文声学模型对应的信息熵H比输入到英文声学模型对应的信息熵H'小，则对应的未知语种的输入语音信号voice0为中文，并将中文声学模型的输出概率作为最后输出结果；即：根据信息熵的性质，熵越大，系统的信息量越大，不确定越高，当p₁＝p₂＝，.....，＝p_q时取最大值，中文输出概率的信息熵要比英文输出概率的信息熵小得多，也即中文语音信号在中文声学模型中的匹配度更高，故将中文声学模型的输出概率作为最后输出结果。And if the information entropy H corresponding to the Chinese acoustic model is smaller than the information entropy H' corresponding to the English acoustic model, the corresponding unknown language input voice signal voice0 is Chinese, and the output probability of the Chinese acoustic model is used as the final output The result; that is: according to the nature of information entropy, the greater the entropy, the greater the amount of information in the system, and the higher the uncertainty, when p ₁ =p ₂ =,...,=p _q , the maximum value is taken, and the Chinese output The information entropy of the probability is much smaller than the information entropy of the English output probability, that is, the matching degree of the Chinese speech signal in the Chinese acoustic model is higher, so the output probability of the Chinese acoustic model is taken as the final output result.

若从英文声学模型输出的概率向量集中存在p'_j明显大于其他的概率值，从中文声学模型输出的概率向量集中的各个概率值相差不大，其中，1≤j≤t；If there is a probability value in the probability vector set output from the English acoustic model that p' _j is significantly greater than other probability values, the probability values in the probability vector set output from the Chinese acoustic model are not much different, where 1≤j≤t;

另一方面，本发明还提供一种面向家居口语环境的神经网络语音识别系统，该系统包括：On the other hand, the present invention also provides a kind of neural network speech recognition system facing home spoken language environment, and this system comprises:

对于系统/装置实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。As for the system/apparatus embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to part of the description of the method embodiments.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者一个操作与另一个实体或者另一个操作区分开来，而不一定要求或者暗示这些实体或者操作之间存在任何这种实际的关系或者顺序。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or one operation from another entity or another operation, and do not necessarily require or imply that these entities Or any such actual relationship or order between operations.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全应用实施例、或结合应用和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely application embodiment, or an embodiment combining application and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。While preferred embodiments of the invention have been described, additional changes and modifications to these embodiments can be made by those skilled in the art once the basic inventive concept is appreciated. Therefore, it is intended that the appended claims be construed to cover the preferred embodiment as well as all changes and modifications which fall within the scope of the invention.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

1. A neural network speech recognition method facing home oral environment, characterized in that the method comprises:

Model construction: add a long-term short-term memory network to the deep neural network to build a combined neural network DNN-LSTM model;

Model training:

Chinese voice data training: preprocessing the collected Chinese voice data set to obtain a Chinese feature vector set, and using the Chinese feature vector set as the input of the DNN-LSTM model for iterative training to train to an optimal Chinese acoustic model;

English speech data training: preprocessing the collected English speech data set to obtain the English feature vector set, and using the English feature vector set as the input of the DNN-LSTM model for iterative training to train to the optimal English acoustic model;

Model test:

An input speech signal voice0 of an unknown language is passed through the Chinese acoustic model and the English acoustic model respectively to obtain a Chinese output probability vector set and an English output probability vector set;

Perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output a judgment result;

Perform language matching according to the Chinese output probability vector set and the English output probability vector set, including:

Use the information entropy formula to calculate the information entropy corresponding to the Chinese output probability vector set P and the English output probability vector set P', which are respectively recorded as H and H'; where, P={p ₁ ,p ₂ ,...,p _q }, P'={p′ ₁ ,p′ ₂ ,...,p′ _t }, q is the total number of output categories of the Chinese acoustic model, and t is the total number of output categories of the English acoustic model;

If there are probability values in the probability vector set output from the Chinese acoustic model that p _i is significantly greater than other probability values, the probability values in the probability vector set output from the English acoustic model are not much different, where 1≤i≤q;

And if the information entropy H corresponding to the Chinese acoustic model is smaller than the information entropy H' corresponding to the English acoustic model, the corresponding unknown language input voice signal voice0 is Chinese, and the output probability of the Chinese acoustic model is used as the final output result;

If there is a probability value in the probability vector set output from the English acoustic model that p′ _j is significantly greater than other probability values, the probability values in the probability vector set output from the Chinese acoustic model are not much different, where 1≤j≤t;

And if the information entropy H' corresponding to the input to the English acoustic model is smaller than the information entropy H corresponding to the input to the Chinese acoustic model, the corresponding unknown language input voice signal voice0 is English, and the output probability of the English acoustic model is taken as the final output result.

2. the neural network speech recognition method facing home spoken language environment according to claim 1, is characterized in that, described combined neural network DNN-LSTM model comprises input layer, long short-term memory network, the second hidden layer, the third hidden layer layer, the fourth hidden layer and the output layer, and the long short-term memory network is used as the first hidden layer.

3. the neural network speech recognition method facing home spoken language environment according to claim 2, is characterized in that, the node number of the first hidden layer is 512, and its activation function selects sigmoid function and tanh function, and the second hidden layer The number of nodes in the first layer, the third hidden layer and the fourth hidden layer is 1024, and the activation function selects the sigmoid function.

4. the neural network speech recognition method facing home spoken language environment according to claim 3, is characterized in that, described Chinese feature vector set carries out iterative training as the input of described DNN-LSTM model, and training step comprises:

(1) Initialize the weight matrix W and bias vector b in the model structure as a random value;

(2) Start iterations from the 1st to the maximum number of times; in each iteration, start traversing from the first voice data training sample to the last training sample;

(3) During the training process of each training sample, input the corresponding feature vector to the input layer; traverse from the first hidden layer to the output layer, and use the forward propagation algorithm to indicate the corresponding layer being traversed, and then according to The loss function represents the output layer; after the forward propagation algorithm is completed, it starts to traverse from the fourth hidden layer to the first hidden layer, and uses the back propagation algorithm to represent the corresponding first hidden layer;

(4) After the backpropagation algorithm is completed, start traversing from the first hidden layer to the output layer, and update the weight matrix and bias vector W ⁿ , b ⁿ of the corresponding layer, n is the layer being traversed, n=1, 2, 3, 4, 5, so far the training for one sample in one iteration process is over; at this time, if the sample has not been traversed, continue to traverse the sample; if the sample has been traversed, proceed to the next training iteration;

(5) When the change values of all W and b do not exceed the iteration threshold, stop the iterative cycle;

(6) Save the optimal weight matrix W and bias vector b of each layer.

5. A system realized by the neural network speech recognition method facing the home spoken language environment according to any one of claims 1-4, characterized in that the system comprises:

A model building block for adding a long short-term memory network to a deep neural network to construct a combined neural network DNN-LSTM model;

Model training module, it comprises Chinese model training unit and English model training unit again, described Chinese model training unit is used for the Chinese speech data set preprocessing of collection, obtains Chinese feature vector set, and described Chinese feature vector set Carry out iterative training as the input of described DNN-LSTM model, train to optimal Chinese acoustic model;

The English model training unit is used to preprocess the collected English voice data set to obtain the English feature vector set, and use the English feature vector set as the input of the DNN-LSTM model for iterative training to achieve the best English acoustics Model;

Model test module, it comprises speech input unit and speech type judging unit again, and described speech input unit is used for the input speech signal voice0 of an unknown language, passes through described Chinese acoustic model and described English acoustic model respectively, obtains respectively The Chinese output probability vector set and the English output probability vector set; the speech type judging unit is used to perform language matching according to the Chinese output probability vector set and the English output probability vector set, and output the judgment result;