CN104064196B

CN104064196B - A method of improving speech recognition accuracy based on speech front-end noise elimination

Info

Publication number: CN104064196B
Application number: CN201410281240.3A
Authority: CN
Inventors: 刘明; 王明江
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2014-06-20
Filing date: 2014-06-20
Publication date: 2017-08-01
Anticipated expiration: 2034-06-20
Also published as: CN104064196A

Abstract

The present invention provides a method for eliminating noise based on speech front-end processing for large-scale isolated word speech recognition to improve recognition accuracy. The method of the present invention solves the problem of low recognition accuracy due to noise in the speech endpoint detection error in the MFCC extraction process. The problem. Computational auditory scene analysis (CASA) is used in the front end of speech recognition. Compared with traditional noise reduction methods such as noise reduction and speech enhancement, CASA can effectively separate noise from noisy speech by simulating the auditory nervous system of the human ear. In the present invention, 10,240 noisy speeches are recognized, and the recognition accuracy is increased from 83% to 95.5% compared with no front-end noise processing.

Description

A method of improving speech recognition accuracy based on speech front-end noise elimination

技术领域technical field

本发明涉及孤立词语音识别领域，具体涉及一种提高大规模孤立词语音识别的准确率的方法。The invention relates to the field of speech recognition of isolated words, in particular to a method for improving the accuracy of speech recognition of large-scale isolated words.

背景技术Background technique

语音识别技术中研究和应用最广泛的特征参数是梅尔倒谱系数(MFCC)，低频段MFCC参数具有较高的谱分辨率，适合于语音识别。从目前使用的情况来看，梅尔刻度倒频谱参数已基本取代原本常用的线性预测编码导出的倒频谱参数，原因是它考虑了人类发声与接收声音的特性，在语音识别方面表现出了更好的鲁棒性。The most widely studied and applied feature parameter in speech recognition technology is the Mel cepstrum coefficient (MFCC). The low-frequency MFCC parameters have high spectral resolution and are suitable for speech recognition. Judging from the current situation, the Mel-scale cepstrum parameters have basically replaced the cepstrum parameters derived from the commonly used linear predictive coding, because it takes into account the characteristics of human vocalization and receiving sound, and shows better performance in speech recognition. Good robustness.

但是MFCC参数在存在较大的背景噪声的情况下，其识别率也不是很好。由于自然界任何地方都存在噪声，因此任何人发出的语音都是混有噪声的语音，即使是在绝对安静的环境下。在时域中，背景噪声以横波的形式叠加在语音波形上，在该情况下，在进行语音端点检测的时候，无疑会将噪声大、语音小的部分波形也当成有用的语音帧，这样提取的语音特征参数MFCC是不理想的，甚至是不可用的。However, the recognition rate of MFCC parameters is not very good in the presence of large background noise. Since noise exists everywhere in nature, the speech of any human being is speech mixed with noise, even in an environment of absolute silence. In the time domain, the background noise is superimposed on the speech waveform in the form of a transverse wave. In this case, when the speech endpoint detection is performed, the part of the waveform with large noise and small speech will undoubtedly be regarded as a useful speech frame. The speech feature parameters of MFCC are not ideal, or even unusable.

人的听觉系统能够在噪声环境中区分并跟踪自己感兴趣的语音信号，即使多种声音同时存在也能“听取”所需要的内容。听觉场景分析(ASA)正是在这一听觉生理现象上提出的理论。CASA模拟人耳的神经听觉系统，对语音信号的处理更接近于人对混合声音信号的听觉感知过程。因此可以用来将噪声从语音信号中分离出来，得到比较纯的语音信号，实际上是在语音识别过程中加入一个前端处理，从而达到提高含噪声语音识别的准确率。利用CASA进行语音增强的重点是选择合适的特征来分离目标语音和背景噪音，可用的特征包括语谱能量、基因频率和信道互相关特征阈值。The human auditory system can distinguish and track the speech signal of interest in a noisy environment, and can "hear" the desired content even if multiple sounds exist at the same time. Auditory Scene Analysis (ASA) is a theory proposed on this physiological phenomenon of hearing. CASA simulates the neural auditory system of the human ear, and the processing of speech signals is closer to the process of human auditory perception of mixed sound signals. Therefore, it can be used to separate the noise from the speech signal to obtain a relatively pure speech signal. In fact, a front-end processing is added to the speech recognition process, so as to improve the accuracy of speech recognition with noise. The focus of speech enhancement using CASA is to select appropriate features to separate target speech from background noise, available features include spectral energy, gene frequency, and channel cross-correlation feature threshold.

发明内容Contents of the invention

为解决现有技术中存在的问题，本发明提出了一种通过基于语音前端噪声消除来提高大规模孤立词语音识别准确率的方法，解决了由于含有噪声，MFCC提取过程中语音端点检测错误导致识别准确率低的问题。In order to solve the problems existing in the prior art, the present invention proposes a method for improving the accuracy of large-scale isolated word speech recognition by eliminating noise based on the front end of the speech, which solves the problem of speech endpoint detection errors in the MFCC extraction process due to noise. Problems with low recognition accuracy.

本发明通过以下技术方案实现：The present invention is realized through the following technical solutions:

一种基于语音前端噪声消除的提高语音识别准确率的方法，其特征在于：所述方法采用计算听觉场景分析(CASA)实现语音识别前端的噪声消除，所述方法包括以下步骤：A method for improving the accuracy of speech recognition based on voice front-end noise elimination, characterized in that: the method adopts computational auditory scene analysis (CASA) to realize the noise elimination of the speech recognition front-end, and the method comprises the following steps:

A.16KHz采样的带噪语音，先通过一个32通道的Gammatone滤波器，中心频率为50Hz～8KHz，对滤波后的信号加时间分辨率为20ms的矩形窗，帧率为100Hz；A. The noisy speech sampled at 16KHz first passes through a 32-channel Gammatone filter with a center frequency of 50Hz to 8KHz, and adds a rectangular window with a time resolution of 20ms to the filtered signal, and the frame rate is 100Hz;

B.计算第i个频率第j帧的听觉谱的噪声包络和语音包络，计算公式为：B. Calculate the noise envelope and the voice envelope of the auditory spectrum of the jth frame of the i frequency, the calculation formula is:

其中，i，j分别表示第i个频率，第j帧；N是一帧内的采样点的个数；x表示信号的时域振幅，下标L和R表示两个不同的声道；Among them, i and j respectively represent the i-th frequency and the j-th frame; N is the number of sampling points in one frame; x represents the time-domain amplitude of the signal, and the subscripts L and R represent two different channels;

C.计算噪声通道和语音通道的互相关函数C. Calculate the cross-correlation function of the noise channel and the speech channel

其中，τ是语音和噪声的特征时延，τ的取值范围是-16到16，对应16KHz的采样率下的-1ms到1ms的相对事件范围；Among them, τ is the characteristic delay of speech and noise, and the value range of τ is -16 to 16, corresponding to the relative event range of -1ms to 1ms under the sampling rate of 16KHz;

D.通过互相关函数计算计算噪声通道和语音通道的ITD和ILD：D. Calculate the ITD and ILD of the noise channel and the speech channel by calculating the cross-correlation function:

ITD(i,j)＝argmaxCC^i,j(τ)，ITD(i,j)=argmaxCC ^i,j (τ),

E.通过将所有帧、所有频率信道上的互相关函数相加，求出该和的极值，即为语音和噪声的特征时延τ，E. By adding the cross-correlation functions on all frames and all frequency channels, find the extreme value of the sum, which is the characteristic time delay τ of speech and noise,

判断哪一个声道输入的是语音信号，当τ为负时，第一信道信号为纯语音；反之，第二个信道的信号为纯语音；Judging which channel input is a speech signal, when τ is negative, the first channel signal is pure speech; otherwise, the signal of the second channel is pure speech;

F.采用简单的3状态单项状态跳转HMM模型计算第i个频率第j帧信号的掩模m(i,j)，掩模信息用来估计语音包络，其中F. Use a simple 3-state single-item state-jump HMM model to calculate the mask m(i,j) of the j-th frame signal at the i-th frequency, and the mask information is used to estimate the speech envelope, where

结合B中的包络可以计算出分离出噪声的语音的包络谱：Combined with the envelope in B, the envelope spectrum of the noise-separated speech can be calculated:

G.通过求解对数能量，提取每一帧语音的一个12维的谱系数向量，得到的系数向量可以直接作为语音识别的特征参数，具体采用以下公式：G. By solving the logarithmic energy, a 12-dimensional spectral coefficient vector of each frame of speech is extracted, and the obtained coefficient vector can be directly used as a characteristic parameter of speech recognition, specifically using the following formula:

其中，I是Gammatone滤波器的数量，其取值32，j、k分别表示第j帧中的第k个谱系数。Among them, I is the number of Gammatone filters, and its value is 32, and j and k respectively represent the kth spectral coefficient in the jth frame.

本发明的有益效果是：本发明为大规模孤立词语音识别提供了一种语音前端处理消除噪声从而提高识别准确率的方法。本发明解决了由于含有噪声，MFCC提取过程中语音端点检测错误导致识别准确率低的问题。实验结果表明，该算法在增加了一定计算量的前提下，有效地提高了噪声环境下大规模孤立词语音识别的准确率。The beneficial effects of the present invention are: the present invention provides a voice front-end processing method for eliminating noise and improving recognition accuracy for large-scale isolated word speech recognition. The invention solves the problem of low recognition accuracy due to noise and speech endpoint detection errors in the MFCC extraction process. Experimental results show that the algorithm can effectively improve the accuracy of speech recognition of large-scale isolated words in noise environment under the premise of increasing a certain amount of calculation.

附图说明Description of drawings

图1是本发明的语音前端噪声消除过程示意图。FIG. 1 is a schematic diagram of the voice front-end noise elimination process of the present invention.

具体实施方式detailed description

下面结合附图说明及具体实施方式对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

本发明的工作原理如下：输入的带噪语音信号可以看作是两个通信信道分别输入纯语音和纯噪声的模型，因此CASA模拟人耳的作用，根据两个信道到达的信号时间差(ITD)和强度差(ILD)来确定声源，即将注意力放到纯语音信号上面。CASA用ITD和ILD估计时频域上时频单元(T-F unit)的掩模信息，T-F掩模的信息可以指出T-F区域哪里是噪声，哪里是语音，最后将包含语音信息的T-F区域进行语音合成，还原“纯”语音。The working principle of the present invention is as follows: the input noisy speech signal can be regarded as the model of two communication channels inputting pure speech and pure noise respectively, so CASA simulates the effect of human ear, and according to the signal time difference (ITD) of two channel arrivals The sum intensity difference (ILD) is used to determine the sound source, that is, the focus is placed on the pure speech signal. CASA uses ITD and ILD to estimate the mask information of the time-frequency unit (T-F unit) in the time-frequency domain. The information of the T-F mask can indicate where the T-F area is noise and where is the voice. Finally, the T-F area containing voice information is used for speech synthesis. , to restore the "pure" voice.

如图1所示，本发明的基于语音前端噪声消除的提高语音识别准确率的方法，采用计算听觉场景分析(CASA)实现语音识别前端的噪声消除，所述方法包括以下步骤：As shown in Figure 1, the method for improving the accuracy of speech recognition based on speech front-end noise elimination of the present invention adopts computational auditory scene analysis (CASA) to realize the noise elimination of speech recognition front-end, and described method comprises the following steps:

ITD(i,j)＝argmaxCC^i,j(τ)，ITD(i,j)=argmaxCC ^i,j (τ),

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. a method for improving speech recognition accuracy based on speech front-end noise elimination, is characterized in that: described method adopts computing auditory scene analysis (CASA) to realize the noise elimination of speech recognition front-end, and described method comprises the following steps:

A. The noisy speech sampled at 16KHz first passes through a 32-channel Gammatone filter with a center frequency of 50Hz to 8KHz, and adds a rectangular window with a time resolution of 20ms to the filtered signal, and the frame rate is 100Hz;

B. Calculate the noise envelope and the voice envelope of the auditory spectrum of the jth frame of the i frequency, the calculation formula is:

\{\begin{matrix} {env env}_{L L} ((i i,, j j)) = = | | {Σ Σ}_{n no = = 00}^{N N - - 11} {x x}_{L L}^{i i,, j j} ((n no)) | | \\ {env env}_{R R} ((i i,, j j)) = = | | {Σ Σ}_{n no = = 00}^{N N - - 11} {x x}_{R R}^{i i,, j j} ((n no)) | | \end{matrix}

Among them, i and j respectively represent the i-th frequency and the j-th frame; N is the number of sampling points in one frame;

x represents the time-domain amplitude of the signal, and the subscripts L and R represent two different channels;

C. Calculate the cross-correlation function of the noise channel and the speech channel

{CC CC}^{i i,, j j} ((τ τ)) = = \frac{\frac{11}{N N} {Σ Σ}_{00}^{N N - - 11} | | {x x}_{S S}^{i i,, j j} ((n no)) {x x}_{N N}^{i i,, j j} ((n no - - τ τ)) | |}{\sqrt{\frac{11}{N N} {Σ Σ}_{00}^{N N - - 11} {| | {x x}_{S S}^{i i,, j j} ((n no)) | |}^{22}} \sqrt{\frac{11}{N N} {Σ Σ}_{00}^{N N - - 11} {| | {x x}_{N N}^{i i,, j j} ((n no - - τ τ)) | |}^{22}}},,

Among them, τ is the characteristic delay of speech and noise, and the value range of τ is -16 to 16, corresponding to the relative time range of -1ms to 1ms under the sampling rate of 16KHz;

D. Calculate the ITD and ILD of the noise channel and the speech channel by calculating the cross-correlation function:

ITD(i,j)=argmaxCC ^i,j (τ),

I I L L D D. ((i i,, j j)) = = 2020 {log log}_{1010} [[\frac{{env env}_{L L} ((i i,, j j))}{{env env}_{R R} ((i i,, j j))}]];;

E. By adding the cross-correlation functions on all frames and all frequency channels, find the extreme value of the sum, which is the characteristic time delay τ of speech and noise,

τ τ = = arg arg m m a a x x \underset{i i,, j j}{Σ Σ} {CC CC}^{i i,, j j} ((τ τ));;

Judging which channel input is a speech signal, when τ is negative, the L channel signal is pure speech;

Conversely, the R channel signal is pure speech;

F. Use a simple 3-state single-item state-jump HMM model to calculate the mask m(i,j) of the j-th frame signal at the i-th frequency, and the mask information is used to estimate the speech envelope, where,

m m ((i i,, j j)) = = \frac{11}{11 + + exp exp {{[[I I T T D D. ((i i,, j j)) - - 0.5 0.5]] [[I I L L D D. ((i i,, j j)) - - 0.5 0.5))]]}}},,

Combined with the envelope in B, the envelope spectrum of the noise-separated speech can be calculated:

{env env}_{M m} = = \{\begin{matrix} {env env}_{L L} ((i i,, j j)) \cdot \cdot m m ((i i,, j j)) & {τ τ}_{m m a a x x} < < 00 \\ {env env}_{R R} ((i i,, j j)) \cdot \cdot m m ((i i,, j j)) & {τ τ}_{m m a a x x} &GreaterEqual; &Greater Equal; 00 \end{matrix};;

G. By solving the logarithmic energy, a 12-dimensional spectral coefficient vector of each frame of speech is extracted, and the obtained spectral coefficient vector is directly used as a characteristic parameter of speech recognition, specifically using the following formula:

c c ((j j,, k k)) = = {Σ Σ}_{i i = = 11}^{I I} l l n no [[{env env}_{M m} ((i i,, j j))]] c c o o s the s [[\frac{k k π π}{I I} ((i i - - 0.5 0.5))]],,

Among them, I is the number of Gammatone filters, and its value is 32, and j and k respectively represent the kth spectral coefficient in the jth frame.