CN102982801B

CN102982801B - Phonetic feature extracting method for robust voice recognition

Info

Publication number: CN102982801B
Application number: CN201210449436.XA
Authority: CN
Inventors: 徐波; 范利春; 柯登峰; 孟猛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-11-12
Filing date: 2012-11-12
Publication date: 2014-12-10
Anticipated expiration: 2032-11-12
Also published as: CN102982801A

Abstract

The invention discloses a speech feature extraction method for robust speech recognition. The method comprises: obtaining a power spectrum; processing the power spectrum by using a filter bank; obtaining a medium-time power spectrum by means of frame averaging; The power spectrum is processed by asymmetric filtering, and the power spectrum is masked at the same time to obtain the pure speech power spectrum; the ratio of the pure speech and noisy speech power spectrum is subjected to channel average processing for smoothing; the smoothed pure speech and the banded speech The power spectrum ratio of noisy speech is multiplied by the power spectrum output by the filter bank to obtain the short-term power spectrum of pure speech; energy normalization is performed on the short-term power spectrum to eliminate multiplicative noise; the power spectrum is increased by equal loudness ; Exponential operation of the power spectrum; inverse Fourier transform of the power spectrum; calculation of the cepstral coefficient of the signal; mean normalization of the cepstral coefficient. The feature of the speech signal extracted by the present invention is fast and can be processed online; the acoustic model trained by using the feature extracted by the present invention has a good anti-noise effect; the present invention has very important application significance.

Description

A Speech Feature Extraction Method for Robust Speech Recognition

技术领域technical field

本发明涉及语音识别领域，尤其涉及一种在语音识别中，能够对平稳和非平稳噪声有明显抑制作用的语音特征提取算法方法。The invention relates to the field of speech recognition, in particular to a speech feature extraction algorithm method capable of significantly suppressing stationary and non-stationary noises in speech recognition.

背景技术Background technique

语音识别系统在复杂环境下识别性能急剧降低是语音识别中最重要的问题之一。例如在马路上手机语音查询地理位置，用户所处的声学环境非常复杂且变化迅速，这对语音识别系统的性能产生了极大的影响。原有的语音识别系统在没有噪声的环境下能取得很好的处理和识别效果，但在现实应用中识别系统的性能会由于时变的不可预测的环境噪音和信道的影响，说话人的差异，谈话内容的变化等因素影响严重下降。所以如何提高语音识别系统在训练和测试环境的不匹配条件下的鲁棒性成为语音识别技术的关键。It is one of the most important problems in speech recognition that the recognition performance of the speech recognition system degrades sharply in complex environments. For example, when querying geographical location by voice on the mobile phone on the road, the acoustic environment in which the user is located is very complex and changes rapidly, which has a great impact on the performance of the voice recognition system. The original speech recognition system can achieve good processing and recognition results in a noise-free environment, but in practical applications, the performance of the recognition system will be affected by time-varying unpredictable environmental noise and channels, and speaker differences , The impact of factors such as changes in the content of the conversation has declined severely. So how to improve the robustness of the speech recognition system under the condition of mismatch between training and testing environment becomes the key of speech recognition technology.

近年来，在语音识别技术环境鲁棒性这一研究领域人们提出了很多改进技术和算法，并取得了一定的效果。根据语音识别的流程，鲁棒语音识别可以分为四类：时频域的抗噪；特征域的噪声补偿；模型域的噪声自适应和解码域的自适应。最早的技术是时频域的抗噪，例如谱减和维纳滤波，还有经典的欧洲电信标准协会的两阶段维纳滤波。特征层面噪声抑制通常是在提取特征的过程中对噪声进行补偿。由于PLP和MFCC特征一直占据鳌头，所以特征层面的噪声抑制大多是在这两种特征上进行的，例如向量泰勒级数等。第三个阶段是在模型方面对噪声进行自适应，包括多状态的语音模型、共享变量参数的HMM等。第四个层面是在解码层面的噪声自适应，包括不确定性解码和用子带重估来代替不确定性解码等。In recent years, people have proposed many improved technologies and algorithms in the research field of environmental robustness of speech recognition technology, and achieved certain results. According to the process of speech recognition, robust speech recognition can be divided into four categories: anti-noise in the time-frequency domain; noise compensation in the feature domain; noise adaptation in the model domain and adaptation in the decoding domain. The earliest technology is anti-noise in the time-frequency domain, such as spectral subtraction and Wiener filtering, and the classic two-stage Wiener filtering of the European Telecommunications Standards Institute. Feature-level noise suppression usually compensates for noise during feature extraction. Since the PLP and MFCC features have always been at the top, the noise suppression at the feature level is mostly performed on these two features, such as vector Taylor series. The third stage is to adapt the noise in terms of models, including multi-state speech models, HMMs with shared variable parameters, etc. The fourth level is noise adaptation at the decoding level, including uncertainty decoding and sub-band reestimation instead of uncertainty decoding.

所有的这些方法从根本上来说都是寻求在某种准则下训练环境和测试环境之间不匹配的一种最佳补偿。在一系列假设前提条件，如加性噪音的高斯分布、噪音与语音信号的独立性、不同噪音之间的独立性、信道的渐变特性等等，这些方法对于语音识别技术的鲁棒性都作出了有益的探索和贡献，尤其在平稳噪声环境下有较好的噪声抑制效果。但这与在真实噪音环境下语音识别系统的应用要求还有很大的差距，因此对于更加复杂的环境，比如突发噪声等环境却无能为力。All of these methods are fundamentally seeking an optimal compensation for the mismatch between the training environment and the test environment under some criterion. Under a series of assumptions, such as Gaussian distribution of additive noise, independence of noise and speech signal, independence between different noises, gradient characteristics of channels, etc., these methods are made for the robustness of speech recognition technology. It has made useful exploration and contribution, especially in the stable noise environment, it has a better noise suppression effect. But there is still a big gap between this and the application requirements of the speech recognition system in the real noise environment, so it is powerless for more complex environments, such as sudden noise and other environments.

发明内容Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

为了解决上述的在复杂环境下的语音识别率低，而普通的特征提取方法对非平稳噪声的抑制能力不够强的缺点，本发明提出一种能够提高其识别率的特征提取方法，目的在于提高带有突发噪声和音乐噪声等加性噪声影响的语音的识别率，并且使纯净环境下的语音识别率不下降。In order to solve the above-mentioned shortcoming that the speech recognition rate in a complex environment is low, and the common feature extraction method is not strong enough to suppress non-stationary noise, the present invention proposes a feature extraction method that can improve its recognition rate, and the purpose is to improve The recognition rate of speech affected by additive noise such as burst noise and musical noise, and the speech recognition rate in a pure environment will not decrease.

(二)技术方案(2) Technical solution

本发明所基于的一种用于鲁棒语音识别的语音特征提取方法，包括以下步骤来实现：A kind of speech feature extraction method for robust speech recognition based on the present invention, comprises the following steps to realize:

步骤1、获取语音信号的功率谱；Step 1, obtaining the power spectrum of the speech signal;

步骤2、将所获得的功率谱通过滤波器组处理，获得含噪语音的短时功率谱；Step 2, processing the obtained power spectrum through a filter bank to obtain a short-term power spectrum of noisy speech;

步骤3、根据所获得的含噪语音的短时功率谱，采用帧平均的方式求取含噪语音的中等时长功率谱；Step 3, according to the obtained short-term power spectrum of the noisy speech, adopt the frame average mode to obtain the medium-duration power spectrum of the noisy speech;

步骤4、对所获得的含噪语音的中等时长功率谱进行不对称滤波和掩蔽抗噪，以得到纯净语音的中等时长功率谱；Step 4, performing asymmetric filtering and masking anti-noise on the obtained medium-duration power spectrum of the noisy speech to obtain the medium-duration power spectrum of the pure speech;

步骤5、根据所述纯净语音的中等时长功率谱、含噪语音的中等时长功率谱和含噪语音的短时功率谱获取纯净语音的短时功率谱；Step 5, obtaining the short-term power spectrum of the pure speech according to the medium-duration power spectrum of the pure speech, the medium-duration power spectrum of the noisy speech, and the short-term power spectrum of the noisy speech;

步骤6、对纯净语音的短时功率谱进行能量归一化处理，以消除乘性噪音；Step 6, performing energy normalization processing on the short-term power spectrum of the pure speech to eliminate multiplicative noise;

步骤7、对消除了乘性噪音的纯净语音的短时功率谱进行等响度加重；Step 7, carrying out equal loudness emphasis on the short-term power spectrum of the pure speech that has eliminated the multiplicative noise;

步骤8、对等响度加重后的纯净语音的短时功率谱进行指数非线性操作；Step 8, performing an exponential nonlinear operation on the short-term power spectrum of the pure speech after equal loudness emphasis;

步骤9、对进行了指数非线性操作后的纯净语音的短时功率谱进行傅立叶逆变换，以求取倒谱系数，对倒谱系数进行均值归一化处理，最终得到语音特征。Step 9. Inverse Fourier transform is performed on the short-term power spectrum of the pure speech after the exponential nonlinear operation to obtain cepstral coefficients, and the cepstral coefficients are subjected to mean value normalization processing to finally obtain speech features.

本发明从传统的语音特征提取方法入手，针对传统语音特征抗噪能力弱的缺点，提出了若干手段改进语音特征，最终形成一套新的语音特征提取方法。本发明针对噪声变化比语音慢的特点，利用帧平均的方式将短时功率谱转换为中等时长的功率谱，用于估计噪声；利用不对称滤波的方式，分别估计含噪语音中噪声和语音的频谱包络；在不对称滤波的基础上采用掩蔽的方式估计信噪比，并对其进行处理，将其转换为短时功率谱的信噪比进行抗噪；还通过能量归一化和指数非线性对功率谱进行处理。本发明提出的用于鲁棒语音识别的语音特征提取方法不仅能够对噪声进行更加准确的估计，也能使语音特征更加符合人耳的听觉特性。因此这种特征提取方法所求取的特征对噪声有很好的抑制作用。The present invention starts with the traditional speech feature extraction method, and aims at the disadvantage of weak anti-noise ability of the traditional speech feature, proposes several means to improve the speech feature, and finally forms a set of new speech feature extraction methods. Aiming at the characteristic that noise changes slower than speech, the present invention converts short-time power spectrum into medium-time power spectrum by means of frame average for estimating noise; uses asymmetric filtering to estimate noise and speech in noisy speech respectively. The spectral envelope of the spectrum; on the basis of asymmetric filtering, the signal-to-noise ratio is estimated by masking, and it is processed, and it is converted into the signal-to-noise ratio of the short-term power spectrum for anti-noise; also by energy normalization and Exponential nonlinearity processes the power spectrum. The speech feature extraction method for robust speech recognition proposed by the present invention can not only estimate the noise more accurately, but also make the speech features more in line with the auditory characteristics of human ears. Therefore, the features obtained by this feature extraction method have a good suppression effect on noise.

(三)有益效果(3) Beneficial effects

本发明从传统的语音特征提取方法入手，在传统的语音特征提取方法中加入了抗噪处理和符合人耳听觉的变换处理，使得这种特征提取方法不仅能够拟制各种加性噪声，并且在纯净环境下的识别率也高于传统的语音特征提取方法。The present invention starts with the traditional speech feature extraction method, and adds anti-noise processing and transformation processing in line with human hearing in the traditional speech feature extraction method, so that this feature extraction method can not only simulate various additive noises, but also The recognition rate in a pure environment is also higher than that of traditional speech feature extraction methods.

附图说明Description of drawings

图1为本发明用于鲁棒语音识别的语音特征提取方法的总流程框图；Fig. 1 is the general flowchart of the speech feature extraction method that the present invention is used for robust speech recognition;

图2为包含掩蔽的不对称低通滤波抗噪模块的结构流程图；Fig. 2 is the structure flowchart that comprises the asymmetric low-pass filtering anti-noise module of masking;

图3为图2中掩蔽模块的结构流程图。FIG. 3 is a structural flowchart of the masking module in FIG. 2 .

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

图1为本发明用于鲁棒语音识别的语音特征提取方法的总流程框图。如图1所示，本发明提出的一种用于鲁棒语音识别的语音特征提取方法主要由以下几个流程组成：对语音信号进行预加重；对语音加窗并采用短时傅立叶变换求取语音频谱；对语音频谱平方，求取功率谱；采用滤波器组对功率谱进行处理，以获得含噪语音的短时功率谱；采用帧平均的方式求取含噪语音的中等时长功率谱；对所求取的含噪语音的中等时长功率谱进行不对称低通滤波处理，跟踪语音中的噪声，同时对所求取的中等时长功率谱进行掩蔽处理，得到纯净语音的短时功率谱；对纯净语音和带噪语音的功率谱的比值进行通道平均处理，以进行平滑；将平滑后的纯净语音和带噪语音的功率谱比值同滤波器组输出的含噪语音的短时功率谱相乘，得到纯净语音的短时功率谱；对纯净语音的短时功率谱进行能量归一化处理，以消除乘性噪声；将归一化后的短时功率谱进行等响度加重，使其符合人耳听觉效应；然后将等响度加重后的功率谱利用指数操作进行强度响度转换，使其符合人的生理特征；之后对经过强度响度转换后的功率谱进行傅立叶逆变换；再根据傅立叶逆变换得到的结果求取倒谱系数；最后对求取的倒谱系数进行均值归一化处理，最终得到本发明方法的语音特征。以下对发明的各步骤具体阐述。FIG. 1 is a block diagram of the general flow of the speech feature extraction method for robust speech recognition in the present invention. As shown in Figure 1, a speech feature extraction method for robust speech recognition proposed by the present invention is mainly composed of the following processes: pre-emphasizing the speech signal; windowing the speech and using short-time Fourier transform to obtain Speech spectrum; square the speech spectrum to obtain the power spectrum; use the filter bank to process the power spectrum to obtain the short-term power spectrum of the noisy speech; use frame averaging to obtain the medium-duration power spectrum of the noisy speech; Perform asymmetric low-pass filtering on the medium-duration power spectrum of the noisy speech to track the noise in the speech, and at the same time mask the obtained medium-duration power spectrum to obtain the short-term power spectrum of pure speech; Perform channel average processing on the ratio of the power spectrum of the pure speech and the noisy speech for smoothing; compare the power spectrum ratio of the smoothed pure speech and the noisy speech with the short-term power spectrum of the noisy speech output by the filter bank Multiply the short-term power spectrum of pure speech to obtain the short-term power spectrum of pure speech; perform energy normalization processing on the short-term power spectrum of pure speech to eliminate multiplicative noise; perform equal loudness emphasis on the normalized short-term power spectrum to make it conform to Human auditory effect; then the power spectrum after equal loudness aggravation is converted into intensity and loudness by exponential operation, so that it conforms to human physiological characteristics; after that, inverse Fourier transform is performed on the power spectrum after intensity and loudness conversion; then according to the inverse Fourier transform The cepstral coefficients are calculated from the obtained results; finally, the mean value normalization process is performed on the obtained cepstral coefficients, and finally the speech features of the method of the present invention are obtained. Each step of the invention is described in detail below.

一、对语音信号进行预加重1. Pre-emphasize the voice signal

预加重的目的是削弱低频干扰的影响，突出高频信号的主成份。通常使用如下公式对语音采样点进行预加重：The purpose of pre-emphasis is to weaken the influence of low-frequency interference and highlight the main components of high-frequency signals. Usually, the following formula is used to pre-emphasize the speech sampling points:

${y the y}_{t t} = = \{\begin{matrix} {x x}_{t t} - - α α \cdot \cdot {x x}_{t t - - 11} & ift if > > 00 \\ ((11 - - α α)) \cdot \cdot {x x}_{t t} & ift if = = 00 \end{matrix} - - - - - - ((11))$

其中α被称为预加重系数，x为语音采样点，y是经过预加重后的语音采样点值，t为采样点的索引。Among them, α is called the pre-emphasis coefficient, x is the voice sampling point, y is the value of the voice sampling point after pre-emphasis, and t is the index of the sampling point.

二、对所述预加重后的语音信号加窗并采用短时傅立叶变换求取语音频谱语音信号是一个连续的时变信号，为了对语音进行分析研究，通常截取一段语音，认为语音在这一段内是稳态信号，并将这一段语音叫做一帧。为了减少截断效应，通常给这段语音乘一个窗，常见的有汉宁窗和海明窗。将加窗后的一帧语音信号进行短时傅立叶变换就能得到这一帧语音的频谱。具体包括：对语音进行分帧，其中帧长范围是20ms～30ms，帧移范围取值是10ms～15ms；对语音的每一帧进行加窗，采用汉宁窗或海明窗；对加窗后的语音进行短时傅立叶变换，采用原始傅立叶变化公式，或是对加窗后的语音补零到2的整数次方利用快速傅立叶变换求解得到语音频谱。Two, windowing the speech signal after the pre-emphasis and adopting short-time Fourier transform to obtain the speech spectrum Inside is a steady-state signal, and this segment of speech is called a frame. In order to reduce the truncation effect, a window is usually added to this speech, and the common ones are Hanning window and Hamming window. The frequency spectrum of this frame of speech can be obtained by performing short-time Fourier transform on a frame of speech signal after windowing. Specifically include: framing the voice, where the frame length range is 20ms to 30ms, and the frame shift range is 10ms to 15ms; adding a window to each frame of the voice, using a Hanning window or a Hamming window; Short-time Fourier transform is performed on the final speech, and the original Fourier transform formula is used, or the speech spectrum is obtained by filling the windowed speech with zeros to the integer power of 2 by fast Fourier transform.

三、对语音频谱平方，求取功率谱3. Find the power spectrum by squaring the speech spectrum

为了得到语音信号的功率谱P(w)，我们将短时傅立叶变换后的实部和虚部分别求取平方并求和。公式如下所示：In order to obtain the power spectrum P(w) of the speech signal, we square and sum the real and imaginary parts after the short-time Fourier transform respectively. The formula looks like this:

P(w)＝Re[S(w)]²+Im[S(w)]² (2)P(w)＝Re[S(w)] ² +Im[S(w)] ² (2)

其中S(w)表示短时傅立叶谱，Re[S(w)]和Im[S(w)]分别表示短时傅立叶谱的实部和虚部。where S(w) represents the short-time Fourier spectrum, and Re[S(w)] and Im[S(w)] represent the real and imaginary parts of the short-time Fourier spectrum, respectively.

四、采用滤波器组对功率谱进行处理Fourth, the use of filter banks to process the power spectrum

人耳对不同频率的语音具有不同的感知能力，实验发现，在1000Hz以下，感知能力与频率成线性关系，而在1000Hz以上，感知能力则与频率成对数关系。为了模拟人耳对不同频率语音的感知特性，通常采用滤波器组对线性频谱进行转换。采用的滤波器组可以是梅尔滤波器组(Mel-filterbank)或Gamma-tone滤波器组，而且通道数可以根据不同的滤波器选取不同的数目；The human ear has different perceptual abilities to different frequencies of speech. Experiments have found that below 1000 Hz, the perceptual ability has a linear relationship with the frequency, while above 1000 Hz, the perceptual ability has a logarithmic relationship with the frequency. In order to simulate the human ear's perceptual characteristics of speech at different frequencies, a filter bank is usually used to transform the linear spectrum. The filter bank used can be a Mel-filter bank or a Gamma-tone filter bank, and the number of channels can be selected according to different filters;

在本发明一个优选实施例中采用了Gamma-tone滤波器组。它有若干通道，这些通道的中心频率线性的分布在等效矩形带宽中。In a preferred embodiment of the invention a Gamma-tone filter bank is used. It has several channels whose center frequencies are distributed linearly in the equivalent rectangular bandwidth.

那么利用Gamma-tone滤波器组求和得到含噪语音的短时功率谱就如下式：Then the short-term power spectrum of the noisy speech is obtained by summing the Gamma-tone filter bank as follows:

$P P [[m m,, l l]] = = {Σ Σ}_{k k = = 00}^{((K K / / 22)) - - 11} {| | X x [[m m,, {e e}^{j j {w w}_{k k}}]] {H h}_{l l} (({e e}^{{jw jw}_{k k}})) | |}^{22} - - - - - - ((33))$

其中m和l分别表示帧和通道的索引，K为傅立叶变换的点数，w_k＝2π/F_s，F_s表示语音信号的采样频率。表示语音第m帧第频率的幅度，则表示第l通道频率的Gamma-tone滤波值。Among them, m and l represent the index of frame and channel respectively, K is the point number of Fourier transform, w _k =2π/F _s , and F _s represents the sampling frequency of the speech signal. Indicates the mth frame of speech the magnitude of the frequency, then it means the lth channel Frequency gamma-tone filter value.

五、采用帧平均的方式求取中等时长功率谱5. Calculate the medium-duration power spectrum by means of frame averaging

由于噪声的变化往往比语音的变化缓慢，因此估计噪声的时候需要求取一个比普通窗更长的窗，在本发明的特征提取方法中，采用帧平均的方式获得几个窗的均值来描述一个更长的窗。但是不能够对所有的语音都用这么长的窗，是因为窗长太大的话会使语音识别率降低。帧平均求取含噪语音的中等时长功率谱的公式表示如下：Since the change of noise is often slower than the change of speech, it is necessary to obtain a longer window than the ordinary window when estimating the noise. In the feature extraction method of the present invention, the average value of several windows is obtained by frame averaging to describe A longer window. But it is not possible to use such a long window for all voices, because if the window length is too large, the voice recognition rate will be reduced. The formula for calculating the medium-duration power spectrum of noisy speech by frame averaging is expressed as follows:

$Q Q [[m m,, l l]] = = \frac{11}{22 M m + + 11} {Σ Σ}_{{m m}^{' '} = = m m - - M m}^{m m + + M m} P P [[{m m}^{' '},, l l]] - - - - - - ((44))$

其中m和l分别表示帧和通道的索引，M表示求取中等时长的时候，分别向前和向后所取的帧数之和。Among them, m and l represent the index of frame and channel respectively, and M represents the sum of the number of frames taken forward and backward when calculating the medium duration.

六、对含噪语音的中等时长功率谱进行不对称滤波和掩蔽抗噪6. Asymmetric filtering and masking anti-noise for medium-duration power spectrum of noisy speech

由于噪声在某些频率的变化较快，所以为了更加准确的跟踪噪声，需要对不同通道的噪声进行不相同的处理，因此在此处引入了包含掩蔽的不对称低通滤波抗噪模块。具体流程如图2所示。Since the noise changes rapidly at certain frequencies, in order to track the noise more accurately, the noise of different channels needs to be processed differently, so an asymmetric low-pass filter anti-noise module including masking is introduced here. The specific process is shown in Figure 2.

图2为本发明中包含掩蔽的不对称低低通滤波抗噪模块的结构流程图。Fig. 2 is a structural flowchart of the masked asymmetric low-pass filter anti-noise module in the present invention.

在图2中，第一个不对称低通滤波器可以用如下公式描述：In Figure 2, the first asymmetric low-pass filter can be described by the following formula:

${Q Q}_{le let's go} [[m m,, l l]] = = \{\begin{matrix} {λ λ}_{a a} {Q Q}_{le let's go} [[m m - - 11,, l l]] + + ((11 - - {λ λ}_{a a})) Q Q [[m m,, l l]],, ifQ if Q [[m m,, l l]] &GreaterEqual; &Greater Equal; {Q Q}_{le let's go} [[m m - - 11,, l l]] \\ {λ λ}_{b b} {Q Q}_{le let's go} [[m m - - 11,, l l]] + + ((11 - - {λ λ}_{b b})) Q Q [[m m,, l l]],, ifQ if Q [[m m,, l l]] < < {Q Q}_{le let's go} [[m m - - 11,, l l]] \end{matrix} - - - - - - ((55))$

其中λ_a和λ_b为可调参数，取值范围是(0～1)。通过上式获得的Q_le[m，l]同Q[m，l]相减之后，再经过一个半波整流模块得到Q_o[m，l]。相减和整流的具体操作如公式(6)所示。Among them, λ _a and λ _b are adjustable parameters, and the value range is (0-1). Q _le [m, l] obtained by the above formula is subtracted from Q [m, l], and then Q _o [m, l] is obtained through a half-wave rectification module. The specific operations of subtraction and rectification are shown in formula (6).

Q_o[m，l]＝max(Q[m，l]-Q_le[m，l]，0) (6)Q _o [m, l] = max(Q [m, l] - Q _le [m, l], 0) (6)

Q_o[m，l]分别送入到掩蔽模块和第二个不对称低通滤波器。第二个不对称低通滤波器和上述的第一个不对称低通滤波器相同，内部结构仍然可用公式(5)来获得，只是输入由Q[m，l]变成了Q_o[m，l]，输出由Q_le[m，l]变成了Q_f[m，l]。第二个不对称低通滤波器所产生的值Q_f[m，l]将作为谱底功率，即功率谱的最小值。第二个不对称低通滤波器是为了防止不对称滤波和掩蔽整体作用的输出值的值太小而引起不必要的音乐噪声。另一方面，Q_o[m，l]经过掩蔽模块得到Q_tm[m，l]，这一步骤在后面详细叙述。Q_tm[m，l]和Q_f[m，l]共同输入到最大值模块，利用如下公式得到R_sp[m，l]：Q _o [m, l] are respectively fed into the masking module and the second asymmetric low-pass filter. The second asymmetric low-pass filter is the same as the above-mentioned first asymmetric low-pass filter, and the internal structure can still be obtained by formula (5), but the input is changed from Q[m, l] to Q _o [m , l], the output changes from Q _le [m, l] to Q _f [m, l]. The value Q _f [m, l] produced by the second asymmetric low-pass filter will be used as the spectral floor power, that is, the minimum value of the power spectrum. The second asymmetric low-pass filter is to prevent the output value of asymmetric filtering and masking the overall effect from being too small and causing unnecessary musical noise. On the other hand, Q _o [m, l] obtains Q _tm [m, l] through the masking module, and this step will be described in detail later. Q _tm [m, l] and Q _f [m, l] are jointly input to the maximum value module, and R _sp [m, l] is obtained by using the following formula:

R_sp[m，l]＝max(Q_tm[m，l]，Q_f[m，l]) (7)R _sp [m, l] = max(Q _tm [m, l], Q _f [m, l]) (7)

最后经过一个选择性开关来确定结果输出R[m，l]的取值。这个选择性的开关由如下公式来确定：Finally, a selective switch is used to determine the value of the result output R[m, l]. This optional switch is determined by the following formula:

$R R [[m m,, l l]] = = \{\begin{matrix} {R R}_{sp sp} [[m m,, l l]],, & ifQ if Q [[m m,, l l]] &GreaterEqual; &Greater Equal; {cQ cQ}_{le let's go} [[m m,, l l]] \\ {Q Q}_{f f} [[m m,, l l]],, & ifQ if Q [[m m,, l l]] < < {cQ cQ}_{le let's go} [[m m,, l l]] \end{matrix} - - - - - - ((88))$

其中c是可调参数，如可以选择c＝2。这个公式的意义在于，如果一个语音片段的中等时长功率不能够大于c＝2倍的其自身谱底功率的话，就认为这段语音是静音状态，因此输出值应该为谱底功率。Where c is an adjustable parameter, for example, c=2 can be selected. The significance of this formula is that if the medium duration power of a speech segment cannot be greater than its own spectral bottom power of c=2 times, it is considered that this section of speech is silent, so the output value should be the spectral bottom power.

上面所述的处理过程都是由图2中描述的包含掩蔽的不对称低通滤波抗噪模块实现的。下面具体描述图2中的掩蔽模块。其结构如图3所示。首先输入Q_o[m，l]经过MAX模块得到Q_p[m，l]，公式如下：The above-mentioned processing procedures are all realized by the asymmetric low-pass filter anti-noise module including masking described in FIG. 2 . The masking module in Fig. 2 is described in detail below. Its structure is shown in Figure 3. First input Q _o [m, l] to get Q _p [m, l] through the MAX module, the formula is as follows:

Q_p[m，l]＝max(λ_tQ_p[m-1，l]，Q_o[m，l]) (9)Q _p [m, l] = max(λ _t Q _p [m-1, l], Q _o [m, l]) (9)

其中λ_t是遗忘系数，取值范围是(0～1)。掩蔽模块最后的输出值Q_tm[m，l]也是由选择性开关决定，其公式描述如下：Among them, λ _t is the forgetting coefficient, and the value range is (0~1). The final output value Q _tm [m, l] of the masking module is also determined by the selective switch, and its formula is described as follows:

${Q Q}_{tm tm} [[m m,, l l]] = = \{\begin{matrix} {Q Q}_{o o} [[m m,, l l]] & {Q Q}_{o o} [[m m,, l l]] &GreaterEqual; &Greater Equal; {λ λ}_{t t} {Q Q}_{p p} [[m m - - 11,, l l]] \\ {μ μ}_{t t} {Q Q}_{p p} [[m m - - 11,, l l]] & {Q Q}_{o o} [[m m,, l l]] < < {λ λ}_{t t} {Q Q}_{p p} [[m m - - 11,, l l]] \end{matrix} - - - - - - ((1010))$

其中μ_t为对应的参数，取值范围是(0～1)。掩蔽模块的输出值Q_tm[m，l]同第二个不对称滤波器的输出Q_f[m，l]经过公示(7)描述的最大值模块得到了R_sp[m，l]，R_sp[m，l]最后同谱底Q_f[m，l]经公式(8)描述的选择性开关后，最终得到对含噪语音的中等时长功率谱进行不对称滤波和掩蔽抗噪后的结果R[m，l]。Where μ _t is the corresponding parameter, and the value range is (0-1). The output value Q _tm [m, l] of the masking module is the same as the output Q _f [m, l] of the second asymmetric filter. After the maximum module described in the publicity (7), R sp [m, l] is obtained, and R _sp [m, l], R _sp [m, l] is finally the same as the spectral bottom Q _f [m, l] after the selective switch described by formula (8), and finally obtain the medium-duration power spectrum of noisy speech after asymmetric filtering and masking anti-noise Result R[m,l].

通过上面的描述，计算出了不对称滤波和掩蔽抗噪后的输出R[m，l]，这个值代表纯净语音的中等时长功率谱，它与帧平均后的含噪语音中等时长功率谱Q[m，l]的比值可以描述含噪语音功率谱中纯净语音功率所占比例，我们用H[m，l]来表示。用公式表示如下：Through the above description, the output R[m, l] after asymmetric filtering and masking anti-noise is calculated. This value represents the medium-duration power spectrum of pure speech, which is the same as the medium-duration power spectrum Q of noisy speech after frame averaging. The ratio of [m, l] can describe the proportion of pure speech power in the noisy speech power spectrum, and we use H[m, l] to represent it. The formula is as follows:

$H h ((m m,, l l)) = = \frac{R R [[m m,, l l]]}{Q Q [[m m,, l l]]} - - - - - - ((1111))$

七、通道平均和抗噪整合7. Channel averaging and anti-noise integration

由于通道和通道之间的阈值是不同的，并且处理也常常是基于一个语音片段，因此通道之间的平滑是很有必要的。我们用如下公式来进行通道平均，得到通道平均权重H_s[m，l]：Since the thresholds are different from channel to channel, and processing is often based on a speech segment, smoothing between channels is necessary. We use the following formula for channel averaging to obtain the channel average weight H _s [m, l]:

${H h}_{s the s} [[m m,, l l]] = = ((\frac{11}{{l l}_{22} - - {l l}_{11} + + 11} {Σ Σ}_{{l l}^{' '} = = {l l}_{11}}^{{l l}_{22}} H h [[m m,, l l]])) - - - - - - ((1212))$

其中l₂＝min(l+N，L)，l₁＝max(l-N，1)，L表示滤波器通道的个数，N表示求取通道平均的时候，向前和向后观望的通道总数。所述经过通道平均的权重H_s[m，l]用来调制含噪语音的短时功率谱，以得到纯净语音的短时功率谱，公式如下：Wherein l ₂ =min(l+N, L), l ₁ =max(lN, 1), L represents the number of filter channels, N represents the total number of channels looking forward and backward when calculating the channel average . The channel-averaged weight H _s [m, l] is used to modulate the short-term power spectrum of noisy speech to obtain the short-term power spectrum of pure speech, the formula is as follows:

T[m，l]＝P[m，l]H_s[m，l] (13)T[m,l]=P[m,l]H _s [m,l] (13)

八、对纯净语音的短时功率谱进行能量归一化处理，以消除乘性噪声8. Perform energy normalization on the short-term power spectrum of pure speech to eliminate multiplicative noise

在MFCC等传统的特征提取算法中，为了拟合人的生理特性，采用了对数操作，这样就将特征提取算法中的乘性操作带来的噪声变为了加性的信息，最后可以通过均值归一化去掉。但是在本发明的特征提取方法中，采用指数的操作来拟合人的生理特性，这样乘性操作带来的噪声是不能通过均值归一化消除掉的，因此添加此步骤，为的是能够消除这一乘性噪声。In traditional feature extraction algorithms such as MFCC, in order to fit the physiological characteristics of people, logarithmic operations are used, so that the noise caused by the multiplicative operation in the feature extraction algorithm is changed into additive information, and finally through the mean Normalization removed. However, in the feature extraction method of the present invention, exponential operations are used to fit human physiological characteristics, so that the noise caused by multiplicative operations cannot be eliminated by mean value normalization, so this step is added in order to be able to Remove this multiplicative noise.

由于本发明的特征提取方法是在线的特征，因此不能获得所有帧的均值。本发明中采用动态更新的均值来代替整条语音的均值，公式如下所示：Since the feature extraction method of the present invention is an online feature, the mean value of all frames cannot be obtained. In the present invention, the mean value of the dynamic update is adopted to replace the mean value of the whole voice, and the formula is as follows:

$μ μ [[m m]] = = {λ λ}_{μ μ} μ μ [[m m - - 11]] + + \frac{11 - - {λ λ}_{μ μ}}{L L} {Σ Σ}_{l l = = 00}^{L L - - 11} T T [[m m,, l l]] - - - - - - ((1414))$

其中L表示滤波器通道的个数，λ_μ表示遗忘系数，取值范围是(0～1)。利用这个均值来对每个通道的纯净语音短时功率谱进行归一化就能够消除掉乘性噪声的影响。这一步骤的公式如下所示：Among them, L represents the number of filter channels, and λ _μ represents the forgetting coefficient, and the value range is (0～1). Using this mean value to normalize the short-term power spectrum of pure speech in each channel can eliminate the influence of multiplicative noise. The formula for this step is as follows:

$U u [[m m,, l l]] = = k k \frac{T T [[m m,, l l]]}{μ μ [[m m]]} - - - - - - ((1515))$

其中k是任意常数，利用这样的在线处理，能够使在线特征达到离线的效果。Where k is an arbitrary constant, using such online processing, the online feature can achieve the effect of offline.

九、对能量归一化处理后的纯净语音的短时功率谱进行等响度加重9. Perform equal loudness emphasis on the short-term power spectrum of pure speech after energy normalization processing

不同频率等响时它们的声压强度是不同的。为了补偿人耳对频率的这种偏差，需要对功率谱进行等响度预加重处理。通常用每个通道的中心频率作为这一通道的频率来对这一通道进行补偿，而补偿的公式有多种多样，本发明中采用的等响度权重公式如下所示：Their sound pressure intensity is different when different frequencies are equal. In order to compensate for this deviation of the human ear to frequency, equal loudness pre-emphasis is required on the power spectrum. Usually the center frequency of each channel is used as the frequency of this channel to compensate this channel, and there are various compensation formulas. The equal loudness weight formula adopted in the present invention is as follows:

$E E. (({w w}_{l l})) = = \frac{(({w w}_{l l}^{22} + + 1.44 1.44 \times \times 1010^{66})) {w w}_{l l}^{44}}{{(({w w}_{l l}^{22} + + 1.6 1.6 \times \times 1010^{55}))}^{22} \times \times (({w w}_{l l}^{22} + + 9.61 9.61 \times \times 1010^{66}))} - - - - - - ((1616))$

其中w表示频率，而l表示通道的索引，w_l是第l通道的频率，即第l通道的中心频率。Where w represents the frequency, and l represents the index of the channel, w _l is the frequency of the lth channel, that is, the center frequency of the lth channel.

对能量归一化处理后的纯净语音的短时功率谱进行等响度加重采用下面的公式：The following formula is used for equal loudness emphasis on the short-term power spectrum of pure speech after energy normalization processing:

O[m，l]＝U[m，l]·E(w_l) (17)O[m,l]=U[m,l]·E(w _l ) (17)

其中m，l分别是帧和通道的索引。where m, l are the indices of frame and channel respectively.

十、对等响度加重后的纯净语音短时功率谱进行指数操作10. Exponential operation on the short-term power spectrum of pure speech after equal loudness emphasis

为了更好的拟合人的听觉模型，将强度转化为响度，需要对功率谱进行非线性的压缩，在传统的PLP特征中，采用了立方根的非线性；而在传统的MFCC中，采用了对数非线性的方式。在本发明的特征提取方法中，采用了指数非线性的方式。公式如下：In order to better fit the human auditory model and convert the intensity into loudness, the power spectrum needs to be compressed nonlinearly. In the traditional PLP feature, the cubic root nonlinearity is used; in the traditional MFCC, the in a logarithmic non-linear fashion. In the feature extraction method of the present invention, an exponential nonlinear method is adopted. The formula is as follows:

L[m，l]＝O[m，l]^θ (18)L[m,l]=O[m,l] ^θ (18)

其中θ为指数非线性的参数。where θ is an exponential nonlinear parameter.

十一、对指数非线性变换后的纯净语音短时功率谱进行傅立叶逆变换11. Inverse Fourier transform of the pure speech short-term power spectrum after exponential nonlinear transformation

对指数非线性变换后的纯净语音短时功率谱进行傅立叶逆变换是为了求取语音信号的倒谱系数，进而获得语音特征。这里的傅立叶逆变化采用了基本的傅立叶逆变换方法。The purpose of performing inverse Fourier transform on the pure speech short-time power spectrum after exponential nonlinear transformation is to obtain the cepstral coefficient of the speech signal, and then obtain the speech features. The inverse Fourier transform here uses the basic inverse Fourier transform method.

十二、求取信号的倒谱系数12. Find the cepstrum coefficient of the signal

为了获得倒谱系数，本发明的方法中，首先采用了Durbin递推算法来求取线性预测系数，然后利用所求取的线性预测系数，根据递推公式获得相应的倒谱系数。递推公式如下：In order to obtain the cepstral coefficients, in the method of the present invention, the Durbin recursive algorithm is first used to obtain the linear prediction coefficients, and then the corresponding cepstral coefficients are obtained according to the recursive formula by using the obtained linear prediction coefficients. The recursion formula is as follows:

${c c}_{n no} = = \{\begin{matrix} {a a}_{n no} + + {Σ Σ}_{m m = = 11}^{n no - - 11} {kc kc}_{m m} {a a}_{n no - - m m} / / n no,, if if 11 \leq \leq n no \leq \leq p p + + 11 \\ {a a}_{n no} + + {Σ Σ}_{m m = = n no - - p p}^{n no - - 11} {kc kc}_{m m} {a a}_{n no - - m m} / / n no,, ifn ifn > > p p + + 11 \end{matrix} - - - - - - ((1919))$

其中a是线性预测系数，k是反射系数，它们都是由Durbin递推算法根据步骤十一中所述傅立叶逆变换的自相关方程求得，另外n是到谱系数的索引，p是模型的阶数。Where a is the linear prediction coefficient, k is the reflection coefficient, they are all obtained by the Durbin recursive algorithm according to the autocorrelation equation of the Fourier inverse transform described in step 11, and n is the index to the spectral coefficient, p is the model Order.

十三、对倒谱系数进行均值归一化13. Normalize the mean value of the cepstral coefficient

虽然在步骤八中进行了能量归一化，但是均值归一化还是有必要的，至少均值归一化不会带来负面的影响。均值归一化是将倒谱系数的所有维分别求取所有帧的平均值，然后将每一帧倒谱系数的每一维都减去相应维的到谱系数的均值。由于本发明的特征提取方法是在线的，因此均值也是将当前帧之前的所有帧求平均。Although the energy normalization is performed in step 8, the mean normalization is still necessary, at least the mean normalization will not bring negative effects. Mean normalization is to calculate the average value of all frames for all dimensions of the cepstral coefficient, and then subtract the mean value of the spectral coefficient of the corresponding dimension from each dimension of the cepstral coefficient of each frame. Since the feature extraction method of the present invention is online, the mean is also the average of all frames before the current frame.

下面结合附图，说明本发明所述的一种用于鲁棒语音识别的语音特征提取方法的实例，针对16KHz采样频率的语音，具体介绍如下。In the following, an example of a speech feature extraction method for robust speech recognition according to the present invention will be described in conjunction with the accompanying drawings. For speech with a sampling frequency of 16KHz, the specific introduction is as follows.

1.对语音信号进行预加重，加重系数α采用0.97。系统函数如公式(1)所示。1. Pre-emphasize the speech signal, and the emphasis coefficient α is 0.97. The system function is shown in formula (1).

2.语音帧长采用25ms，帧移采用10ms，加海明窗，并将一帧语音的400点尾部补零至512点，然后采用快速傅立叶变换求取语音频谱。2. The voice frame length is 25ms, the frame shift is 10ms, add a Hamming window, and fill the tail of 400 points of a frame of voice with zeros to 512 points, and then use the fast Fourier transform to obtain the voice spectrum.

3.利用所求取的语音频谱，根据公式(2)求取语音功率谱。3. Utilize the obtained speech frequency spectrum to obtain the speech power spectrum according to the formula (2).

4.采用Gamma-tone滤波器组对功率谱进行处理，通道个数采用40，采用的公式如(3)所示。4. Use the Gamma-tone filter bank to process the power spectrum, the number of channels is 40, and the formula used is shown in (3).

5.采用帧平均的方式求取中等时长的功率谱，计算公式如上所述公式(4)，其中M＝2，即利用当前帧与其前面两帧和后面两帧的平均功率作为含噪语音的中等时长功率来代替原单个帧的中等时长功率，这个时长为[(2M+1)-1]*10ms+25ms＝65ms。5. Adopt the mode of frame average to ask for the power spectrum of medium time length, calculation formula is as above-mentioned formula (4), wherein M=2, promptly utilize the average power of current frame and its preceding two frames and following two frames as noise-containing speech The medium-duration power is used to replace the original medium-duration power of a single frame, and the duration is [(2M+1)-1]*10ms+25ms=65ms.

6.对功率谱进行不对称滤波处理，跟踪语音中的噪声，同时对功率谱进行掩蔽处理，得到纯净语音功率谱。在这一步骤中，对包含掩蔽的不对称低通滤波抗噪模块按照实施方法中的公式进行计算。所用到的公式有(5，6，7，8，9，10，11)，其中具体参数的取值描述如下：6. Perform asymmetric filter processing on the power spectrum to track the noise in the voice, and at the same time mask the power spectrum to obtain a pure voice power spectrum. In this step, the asymmetric low-pass filter anti-noise module including masking is calculated according to the formula in the implementation method. The formulas used are (5, 6, 7, 8, 9, 10, 11), and the values of the specific parameters are described as follows:

λ_a＝0.999，λ_b＝0.5λ _a =0.999, λ _b =0.5

c＝2c=2

λ_t＝0.85，μ_t＝0.2λ _t = 0.85, μ _t = 0.2

7.对纯净语音和带噪语音功率谱的比值进行通道平均，以进行平滑，采用的公式为(12)，其中N＝4(即向前和向后分别观望的通道数为4)，即将9个通道的值进行平滑。将平滑后的纯净语音和带噪语音的功率谱比值同滤波器组输出的含噪语音的短时功率谱相乘，即公式(13)，得到纯净语音的短时功率谱。7. Carry out channel average to the ratio of pure speech and noisy speech power spectrum, to carry out smoothing, the formula that adopts is (12), and wherein N=4 (the channel number that promptly waits and sees forward and backward is 4 respectively), is about to The values of 9 channels are smoothed. Multiply the power spectrum ratio of the smoothed pure speech to the noisy speech with the short-term power spectrum of the noisy speech output by the filter bank, that is, formula (13), to obtain the short-term power spectrum of the pure speech.

8.对纯净语音的短时功率谱进行能量归一化处理，以消除乘性噪声，如公式(15)所示。其中均值采用动态估计，如公式(14)所示，初始值从数据集中统计得到。8. Perform energy normalization processing on the short-term power spectrum of pure speech to eliminate multiplicative noise, as shown in formula (15). The mean value is dynamically estimated, as shown in formula (14), and the initial value is obtained from the statistics of the data set.

9.功率谱进行等响度加重，使其符合人耳听觉效应。9. The equal loudness of the power spectrum is emphasized to make it conform to the auditory effect of the human ear.

10.对功率谱进行指数操作，使其符合人的生理特征，这里指数非线性参数θ选取1/15。10. Perform an exponential operation on the power spectrum to make it conform to human physiological characteristics, where the exponential nonlinear parameter θ is selected as 1/15.

11.对功率谱进行傅立叶逆变换，可以利用傅立叶逆变换的基本公式进行计算，因为本身点数就很少，计算量不大。11. The inverse Fourier transform of the power spectrum can be calculated using the basic formula of the inverse Fourier transform, because the number of points itself is small and the amount of calculation is not large.

12.求取信号的倒谱系数时，选取的线性预测系数为12个，倒谱系数也是12个，公式如(17)所示。12. When calculating the cepstral coefficient of the signal, the selected linear prediction coefficient is 12, and the cepstral coefficient is also 12. The formula is shown in (17).

13.对倒谱系数进行均值归一化处理，最终得到本发明方法的语音特征。13. Carry out mean value normalization processing to cepstral coefficient, finally obtain the speech feature of the method of the present invention.

本发明提出的所述特征提取方法与常用特征提取方法的效果对比：The feature extraction method proposed by the present invention is compared with the effects of commonly used feature extraction methods:

利用本发明所述特征提取方法在863桌面语音集上提取语音的特征，同时使用PLP特征提取方法和欧洲电信标准协会(ETSI)的高级抗噪前段(AFE)特征提取863桌面语音集的语音特征。利用这三个特征集，分别在相同的条件下采用HTK工具训练声学模型。然后，选取1000条纯净朗读录音，加入模拟白噪声，然后利用上述的三种特征提取方法分别提取特征。此外，对一个随机谈话录音集进行标注，得到7072条纯净录音和360条带噪声录音，仍然使用上述的三种特征提取方法分别提取语音特征。Utilize feature extraction method described in the present invention to extract the feature of voice on 863 desktop voice collections, use the advanced anti-noise front end (AFE) feature extraction of PLP feature extraction method and European Telecommunications Standards Institute (ETSI) simultaneously the voice feature of 863 desktop voice collections . Using these three feature sets, the acoustic model is trained using HTK tools under the same conditions. Then, select 1000 pure reading recordings, add simulated white noise, and then use the above three feature extraction methods to extract features respectively. In addition, a set of random conversation recordings is marked, and 7072 clean recordings and 360 noise recordings are obtained, and the above three feature extraction methods are still used to extract speech features respectively.

利用上述声学模型和其对应的特征进行语音识别，语言模型都采用同一个3元语言模型，识别器采用HTK工具中的解码器。这里采用词错误率(WER)来评估语音识别性能，其中，PNPLP是本发明的特征提取算法的名称。WER的计算公式如下：The above acoustic model and its corresponding features are used for speech recognition, the language model uses the same 3-gram language model, and the recognizer uses the decoder in the HTK tool. Here, word error rate (WER) is used to evaluate speech recognition performance, wherein PNPLP is the name of the feature extraction algorithm of the present invention. The calculation formula of WER is as follows:

在模拟白噪声的测试条件下，各种特征的性能如表1所示。表1中可以看出，在没有噪声的纯净语音情况下，PLP特征表现出很好的性能，但随着噪声的加大，PLP性能逐渐变差。欧洲电信标准协会的抗噪特征(AFE)在噪声上能够表现出一定的效果，但是本发明的特征提取方法抗噪性能要远远优于欧洲电信标准协会的抗噪算法。Under the test conditions of simulated white noise, the performance of various features is shown in Table 1. It can be seen from Table 1 that in the case of pure speech without noise, the PLP feature shows good performance, but as the noise increases, the PLP performance gradually deteriorates. The anti-noise feature (AFE) of the European Telecommunications Standards Institute can show certain effects on noise, but the anti-noise performance of the feature extraction method of the present invention is far better than the anti-noise algorithm of the European Telecommunications Standards Institute.

表2是各种特征提取算法在真实的测试集上的实验结果。从附表中可以看出，本发明所述特征提取方法抗噪性能突出，比欧洲电信标准协会的抗噪性能要好出很多。另外，本发明所述抗噪特征提取算法在纯净语音集上相对于经典的PLP算法有略微下降，但同欧洲电信标准协会的抗噪算法对比，本发明所述的抗噪特征提取算法仍然要好很多。Table 2 is the experimental results of various feature extraction algorithms on the real test set. It can be seen from the attached table that the feature extraction method of the present invention has outstanding anti-noise performance, which is much better than the anti-noise performance of the European Telecommunications Standards Institute. In addition, the anti-noise feature extraction algorithm of the present invention has a slight decline compared with the classic PLP algorithm on the pure speech set, but compared with the anti-noise algorithm of the European Telecommunications Standards Institute, the anti-noise feature extraction algorithm of the present invention is still better a lot of.

表1Table 1

表2Table 2

WERWER PLPPLP AFEAFE PNPLPPNPLP clearclear 11.64％11.64% 13.68％13.68% 12.07％12.07% noisenoise 35.89％35.89% 34.21％34.21% 33.36％33.36%

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. for a Speech Feature Extraction for robust speech identification, it is characterized in that, the method comprises:

Step 1, obtain the power spectrum of voice signal;

Step 2, by obtained power spectrum by bank of filters processing, obtain the short-time rating spectrum of noisy speech;

Step 3, according to the short-time rating spectrum of obtained noisy speech, adopt the average mode of frame to ask for the medium duration power spectrum of noisy speech;

Step 4, the medium duration power spectrum of obtained noisy speech is carried out asymmetric filtering and shelters anti-noise, to obtain the medium duration power spectrum of clean speech;

Step 5, obtain the short-time rating spectrum of clean speech according to the short-time rating spectrum of the medium duration power spectrum of the medium duration power spectrum of described clean speech, noisy speech and noisy speech;

Step 6, the short-time rating spectrum of clean speech is carried out to energy normalized processing, to eliminate the property taken advantage of noise;

Step 7, offset and the loudness such as carry out except the short-time rating spectrum of the clean speech of the property taken advantage of noise and increase the weight of;

The short-time rating spectrum of the clean speech after step 8, reciprocity loudness increase the weight of is carried out index nonlinear operation;

Step 9, the short-time rating spectrum of having carried out the clean speech after index nonlinear operation is carried out to inverse fourier transform, to ask for cepstrum coefficient, cepstrum coefficient is carried out to average normalized, finally obtain phonetic feature;

Wherein, the frequency spectrum that obtains voice signal described in step 1 further comprises following content:

Step 11, to voice signal adopt formula (1) carry out pre-emphasis:

y_{t} = \{\begin{matrix} x_{t} - α \cdot x_{t - 1} & ift > 0 \\ (1 - α) \cdot x_{t} & ift = 0 \end{matrix} - - - (1)

Wherein α is called as pre emphasis factor, and x is speech sample point, and y is the speech sample point value after pre-emphasis, the index that t is sampled point;

Step 12, each frame of the voice after pre-emphasis is carried out to windowing, adopt Hanning window or hamming window, voice after windowing are carried out to short time discrete Fourier transform, comprise and adopt original Fourier to change formula, or utilize fast fourier transform to solve to the integer power of the voice zero padding to 2 after windowing;

Step 13, the real part after Short Time Fourier Transform and imaginary part are asked for respectively square and summation, to obtain the power spectrum of voice signal, as shown in formula (2):

P(w)＝Re[S(w)] ²+Im[S(w)] ² (2)

The power spectrum that wherein P (w) is voice signal, S (w) represents fourier spectra in short-term, Re[S (w)] and Im[S (w)] real part and the imaginary part of fourier spectra in short-term represented respectively.

2. the Speech Feature Extraction for robust speech identification according to claim 1, it is characterized in that, the bank of filters adopting in described step 2 is Mel bank of filters Mel-filter bank or Gamma-tone bank of filters, port number is chosen different numbers according to different wave filters, wherein, the short-time rating that utilizes the summation of Gamma-tone bank of filters to obtain noisy speech is composed, as shown in formula (3):

P [m, l] = Σ_{k = 0}^{(K / 2) - 1} {| X [m, e^{{jw}_{k}}] H_{l} (e^{{jw}_{k}}) |}^{2} - - - (3)

Wherein p[m, l] be the short-time rating spectrum of noisy speech, m and l represent respectively the index of frame and bank of filters passage, what K was Fourier transform counts, w _k=2 π/F _s, F _srepresent the sample frequency of voice signal, represent voice signal m frame the the amplitude of frequency, represent l passage the Gamma-tone filter value of frequency.

3. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, adopts the average mode of frame to ask for the medium duration power spectrum of noisy speech in step 3, as shown in formula (4):

Q [m, l] = \frac{1}{2 M + 1} Σ_{m^{'} = m - M}^{m + M} P [m^{'}, l] - - - (4)

Wherein Q[m, l] be the medium duration power spectrum of noisy speech, m and l represent respectively the index of frame and bank of filters passage, when M represents to ask for medium duration, the frame number sum got forward and backward respectively, P[m ', l] be the noisy speech short-time rating spectrum of the m ' frame.

4. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, described in step 4, the medium duration power spectrum of obtained noisy speech is carried out asymmetric filtering and shelter anti-noise specifically comprising following step:

Step 41, medium obtained noisy speech duration power spectrum is carried out to filtering through first asymmetric low-pass filter, and the Output rusults that medium described noisy speech duration power spectrum is deducted to described first asymmetric low-pass filter is to integrate; Wherein said first asymmetric low-pass filter as formula (5) represent:

Q_{le} [m, l] = \{\begin{matrix} λ_{a} Q_{le} [m - 1, l] + (1 - λ_{a}) Q [m, l], ifQ [m, l] &GreaterEqual; Q_{le} [m - 1, l] \\ λ_{b} Q_{le} [m - 1, l] + (1 - λ_{b}) Q [m, l], ifQ [m, l] < Q_{le} [m - 1, l] \end{matrix} - - - (5)

Wherein m and l represent respectively the index of frame and bank of filters passage, Q _le[m, l] is the output of described first asymmetric low-pass filter; Q[m, l] be the medium duration power spectrum of described noisy speech, λ _aand λ _bfor adjustable parameter, span is (0～1);

Step 42, the result after described integration is obtained to Q through a half-wave rectification block _o[m, l], by Q _o[m, l] sends into respectively masking block and second asymmetric low-pass filter processed, and described second asymmetric low-pass filter is identical with first asymmetric low-pass filter, and the output of second dissymetrical filter is as spectrum end power; The Q that wherein half-wave rectification block obtains _othe formula of [m, l] is expressed as follows:

Q _o[m，l]＝max(Q[m，l]-Q _le[m，l]，0) (6)

Step 43, the described Q obtaining through half-wave rectification block _o[m, l] obtains result Q after masking block is processed _tm[m, l], and Q _o[m, l] obtains result Q after second asymmetric low-pass filter processed _f[m, l], afterwards, described Q _tm[m, l] and described Q _f[m, l] is input to maximal value module and obtains result R _sp[m, l]; Wherein maximal value module is as shown in formula (7):

R _sp[m，l]＝max(Q _tm[m，l]，Q _f[m，l]) (7)

Step 44, determined the medium duration power spectrum R[m of clean speech by the first selective switch, l], described the first selective switch is as shown in formula (8):

R [m, l] = \{\begin{matrix} R_{sp} [m, l], ifQ [m, l] &GreaterEqual; {cQ}_{le} [m, l] \\ Q_{f} [m, l], ifQ [m, l] < {cQ}_{le} [m, l] \end{matrix} - - - (8)

Wherein c is adjustable parameter.

5. the Speech Feature Extraction for robust speech identification according to claim 4, is characterized in that, the operating process of described masking block comprises following content:

The described Q obtaining through half-wave rectification block _o[m, l] obtains Q through the MAX module of masking block _p[m, l], as shown in formula (9):

Q _p[m，l]＝max(λ _tQ _p[m-1，l]，Q _o[m，l]) (9)

Wherein λ _tbe Forgetting coefficient, span is (0～1), the output valve Q that masking block is last _tm[m, l] determined by the second selective switch, and described the second selective switch is as shown in formula (10):

Q_{tm} [m, l] = \{\begin{matrix} Q_{o} [m, l] & Q_{o} [m, l] &GreaterEqual; λ_{t} Q_{p} [m - 1, l] \\ μ_{t} Q_{p} [m - 1, l] & Q_{o} [m, l] < λ_{t} Q_{p} [m - 1, l] \end{matrix} - - - (10)

Wherein μ _tfor corresponding parameter, span is (0～1), the output valve Q of masking block _tmafter the selective switch that [m, l] describes through formula (8), final, the medium duration power spectrum of noisy speech is carried out to asymmetric filtering and the result of sheltering after anti-noise is R[m, l].

6. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, the short-time rating spectrum of obtaining clean speech in described step 5 comprises following content:

Step 51, calculates the ratio H[m of the medium duration power spectrum of obtained clean speech and the medium duration power spectrum of noisy speech, l], as shown in formula (11):

H [m, l] = \frac{R [m, l]}{Q [m, l]} - - - (11);

Wherein said R[m, l] be the medium duration power spectrum of clean speech, Q[m, l] be the medium duration power spectrum of noisy speech;

Step 52, carries out passage average, to obtain passage average weight H _s[m, l], as shown in formula (12):

H_{s} [m, l] = (\frac{1}{l_{2} - l_{1} + 1} Σ_{i^{'} = l_{1}}^{l_{2}} H [m, l]) - - - (12)

Wherein l ₂=min (l+N, L), l ₁=max (l-N, 1), L represents the number of filter channel, N represents to ask for passage average time, the total number of channels of looking around forward and backward;

Step 53, utilizes described passage average weight H _sthe short-time rating spectrum P[m of [m, l] modulation noisy speech, l], and obtain the short-time rating spectrum T[m of clean speech, l], as shown in formula (13):

T[m，l]＝P[m，l]H _s[m，l] (13)。

7. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, the short-time rating spectrum to clean speech in described step 6 is carried out energy normalized processing, as shown in formula (15):

U [m, l] = k \frac{T [m, l]}{μ [m]} - - - (15)

Wherein k is arbitrary constant, T[m, l] be the short-time rating spectrum of clean speech, μ [m] is as shown in formula (14):

μ [m] = λ_{μ} μ [m - 1] + \frac{1 - λ_{μ}}{L} Σ_{l = 0}^{L - 1} T [m, l] - - - (14)

Wherein L represents the number of filter channel, λ _μrepresent Forgetting coefficient, span is (0～1).

8. the Speech Feature Extraction for robust speech identification according to claim 1, is characterized in that, in described step 7, the described spectrum of the short-time rating to clean speech the loudness such as is carried out and increased the weight of as shown in formula (17):

O[m，l]＝U[m，l]·E(w _l) (17)

Wherein m, l is respectively the index of frame and passage, U[m, l] for voice are through anti-noise short-time rating spectrum after treatment, E (w _l) as shown in formula (16):

E (w_{l}) = \frac{(w_{l}^{2} + 1.44 \times 10^{6}) w_{l}^{4}}{{(w_{l}^{2} + 1.6 \times 10^{5})}^{2} \times (w_{l}^{2} + 9.61 \times 10^{6})} - - - (16)

Wherein w represents frequency, w _lit is the frequency of l passage.

9. the Speech Feature Extraction for robust speech identification according to claim 1, it is characterized in that, in described step 8, the short-time rating spectrum of the clean speech after described reciprocity loudness increases the weight of is carried out index nonlinear operation as shown in formula (18):

L[m，l]＝O[m，l] ^θ (18)

Wherein θ is the nonlinear parameter of index, O[m, l] be the short-time rating spectrum of the clean speech of the loudness such as described after increasing the weight of.