CN107103913B

CN107103913B - Speech recognition method based on power spectrum Gabor characteristic sequence recursion model

Info

Publication number: CN107103913B
Application number: CN201710292486.4A
Authority: CN
Inventors: 卜起荣; 张晓�; 冯筠; 曹正文
Original assignee: NORTHWEST UNIVERSITY
Current assignee: Xi'an Mite Electronic Technology Co ltd
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2020-02-04
Anticipated expiration: 2037-04-28
Also published as: CN107103913A

Abstract

The invention discloses a speech recognition method based on a power spectrum Gabor characteristic sequence recursion model, which comprises the following basic steps: 1. preprocessing a voice input signal; 2. respectively extracting power spectrum features and dynamic spectrum Delta features; 3. filtering the frequency spectrum characteristic by using a space-time Gabor filter, and obtaining a voice characteristic sequence by PCA (principal component analysis) dimension reduction processing; 4. constructing a recursion graph according to the voice feature sequence; 5. and performing distance detection on the voice recursive model to complete voice recognition. The invention preprocesses the voice signal, obtains the voice characteristic sequence through characteristic extraction, and then converts the voice characteristic sequence into the recursion model for similarity detection, thereby effectively solving the problems that the recognition rate of the existing automatic voice recognition system is not ideal and the performance is easy to deteriorate under the complex conditions of unsteady noise, low signal-to-noise ratio and the like, and improving the robustness of the voice recognition algorithm.

Description

A Speech Recognition Method Based on Power Spectrum Gabor Feature Sequence Recursive Model

技术领域technical field

本发明属于语音识别技术领域，涉及一种复杂背景下的语音识别方法，具体涉及一种基于功率谱Gabor特征序列递归模型的语音识别方法。The invention belongs to the technical field of speech recognition, and relates to a speech recognition method under complex background, in particular to a speech recognition method based on a recursive model of power spectrum Gabor feature sequence.

背景技术Background technique

语音作为最自然便捷的交流方式，一直是人机通信和交互领域最重要的研究之一，自动语音识别(ASR)更是实现人机交互尤为关键的技术。经过多年的研究，ASR已经走进我们的生活，语音转录、自动翻译、手机助手等都是典型的代表。但是这些系统大都依赖他们所处的声学环境,鲁棒性不强。As the most natural and convenient way of communication, speech has always been one of the most important researches in the field of human-computer communication and interaction. Automatic Speech Recognition (ASR) is a particularly critical technology for realizing human-computer interaction. After years of research, ASR has entered our lives, and voice transcription, automatic translation, mobile phone assistants, etc. are typical representatives. But most of these systems depend on the acoustic environment in which they are located, and the robustness is not strong.

现有的语音识别包含两个阶段：其一，语音特征提取研究，其二，分类器设计研究。传统的特征提取方法，如梅尔倒谱系数(MFCC)、感知线性预测(PLP)等在复杂环境下难以提取有效的特征；而且传统的分类算法在识别过程中难以达到理想的效果，如动态时间规整(DTW)距离、支持向量机(SVM)等。因为语音识别的匹配需要解决的一个关键问题是说话人对同一个词的两次发音不可能完全相同，而且两次发音特征提取后的特征维度也不同，使用传统识别方法难以取得理想的效果。神经网络又受困于需要大样本的标注数据训练，小样本情况下，容易过拟合等问题，使其难以提供有效的方案。The existing speech recognition consists of two stages: one is the study of speech feature extraction, and the other is the study of classifier design. Traditional feature extraction methods, such as Mel cepstral coefficients (MFCC) and perceptual linear prediction (PLP), are difficult to extract effective features in complex environments; and traditional classification algorithms are difficult to achieve ideal results in the recognition process, such as dynamic Time Warped (DTW) Distance, Support Vector Machine (SVM), etc. Because a key problem that needs to be solved in the matching of speech recognition is that the speaker cannot pronounce the same word twice, and the feature dimensions after the feature extraction of the two pronunciations are also different, so it is difficult to achieve ideal results using traditional recognition methods. Neural networks are also plagued by problems such as requiring large samples of labeled data for training, and easy overfitting in the case of small samples, making it difficult to provide effective solutions.

发明内容SUMMARY OF THE INVENTION

针对上述现有技术中存在的问题，本发明的目的在于，提供一种基于功率谱Gabor特征序列递归模型的语音识别方法，通过对语音信号进行预处理，经过特征提取得到语音特征序列，然后将语音特征序列转化为递归图进行相似性检测，有效的解决了目前自动语音识别系统在非稳态噪声、低信噪比等复杂情况下识别率不够理想、性能容易恶化的问题，从而提高语音识别算法的鲁棒性。In view of the problems existing in the above-mentioned prior art, the purpose of the present invention is to provide a speech recognition method based on the recursive model of the power spectrum Gabor feature sequence. The speech feature sequence is converted into a recursive graph for similarity detection, which effectively solves the problem that the recognition rate of the current automatic speech recognition system is not ideal and the performance is easy to deteriorate under complex conditions such as non-steady noise and low signal-to-noise ratio, thereby improving speech recognition. Robustness of the algorithm.

为了实现上述任务，本发明采用以下技术方案：In order to realize the above-mentioned tasks, the present invention adopts the following technical solutions:

一种基于功率谱Gabor特征序列递归模型的语音识别方法，包括以下步骤：A speech recognition method based on a power spectrum Gabor feature sequence recursive model, comprising the following steps:

步骤一，语音信号的预处理Step 1: Preprocessing of the speech signal

对获取的语音信号进行端点检测，分离并去除语音信号中的噪声信息，获得语音信号中的有效部分，并求其功率谱；Perform endpoint detection on the acquired voice signal, separate and remove the noise information in the voice signal, obtain the effective part in the voice signal, and obtain its power spectrum;

步骤二，提取功率谱Gabor特征序列Step 2: Extract the Gabor feature sequence of the power spectrum

步骤2.1，根据所述的功率谱，提取所述的有效部分的功率归一化频谱特征；Step 2.1, according to the power spectrum, extract the power normalized spectrum feature of the effective part;

步骤2.2，将所述的功率归一化频谱特征按照帧索引顺序组成一个序列C，然后分别进行一阶差分和二阶差分处理后，分别得到Delta频谱特征和Double Delta频谱特征；Step 2.2, forming a sequence C according to the frame index sequence of the power-normalized spectral features, and then performing first-order difference and second-order difference processing, respectively, to obtain Delta spectral features and Double Delta spectral features;

步骤2.3，将所述的功率归一化频谱特征、Delta频谱特征以及Double Delta频谱特征组成功率归一化频谱特征集，然后通过时空Gabor滤波，将时间调制滤波器表示为行向量，与功率归一化频谱特征集的每个通道独立地卷积；同样，频域调制滤波器表示为列向量，与功率归一化频谱特征集的每个帧独立地卷积；Step 2.3, the power normalized spectral features, Delta spectral features and Double Delta spectral features are formed into a power normalized spectral feature set, and then the time-modulated filter is represented as a row vector through the space-time Gabor filtering, which is combined with the power normalized spectral feature set. Each channel of the normalized spectral feature set is independently convolved; similarly, the frequency-domain modulation filter, represented as a column vector, is independently convolved with each frame of the power-normalized spectral feature set;

步骤2.4，针对所述的功率归一化频谱特征集，执行直方图均衡化，然后通过PCA将高维特征投影到低维空间上，得功率谱Gabor特征集，并将功率谱Gabor特征集组成语音特征序列X；Step 2.4, for the power normalized spectral feature set, perform histogram equalization, and then project the high-dimensional feature onto the low-dimensional space through PCA to obtain the power spectral Gabor feature set, and combine the power spectral Gabor feature set into Speech feature sequence X;

步骤三，构建语音特征序列的递归图Step 3: Build a recursive graph of the speech feature sequence

将语音特征序列X按照如下公式计算其递归图r：Calculate the recursive graph r of the speech feature sequence X according to the following formula:

r＝θ(ε-||x(i_k)-x(i_m)||),i_k,i_m＝1…nr=θ(ε-||x(i _k )-x(im )||), _{i k} _, im =1… _n

上式中，n是语音特征序列经历的状态数，k、m表示帧索引，x(i_k)和x(i_m)是在i_k和i_m序列位置处观察到的语音特征序列上的值，‖.‖表示Euclidean范数，ε表示临界距离，ε＜1；而θ表示Heaviside函数，其定义如下所示:In the above formula, n is the number of states experienced by the speech feature sequence, k, _m represent the frame index, x( _ik ) and x(im ) are the observed speech feature sequences at the positions of the _i _k and im sequences. value, ‖.‖ represents the Euclidean norm, ε represents the critical distance, ε<1; and θ represents the Heaviside function, which is defined as follows:

步骤四，语音信号相似性检测Step 4: Voice Signal Similarity Detection

计算所述的递归图r与模板库中每一类信号的递归图r_i之间的距离，则最小的一个距离值对应的模板库中的语音信号即为识别出的结果；Calculate the distance between the recursive graph r and the recursive graph r _i of each type of signal in the template library, then the voice signal in the template library corresponding to the smallest distance value is the result of the identification;

所述的模板库是指，在进行语音识别之前，先采集各类标准的语音信号，按照步骤一至三的方法进行处理，得到标准的语音信号的递归图r_i，将这些语音信号的递归图存储在模板库中。The template library refers to that, before speech recognition, various standard speech signals are collected, and processed according to the methods of steps 1 to 3 to obtain the recursive graph r _i of the standard speech signal, and the recursive graphs of these speech signals are obtained. Stored in the template library.

进一步地，步骤一的具体过程包括：Further, the specific process of step 1 includes:

步骤1.1，对获取的语音信号，利用短时能量或者过零率指标进行端点检测，将语音信号中的有效部分和噪声信息分离开来；Step 1.1, for the acquired voice signal, use short-term energy or zero-crossing rate index to perform endpoint detection, and separate the effective part and noise information in the voice signal;

步骤1.2，通过高通滤波器完成所述有效部分的预加重过程；Step 1.2, complete the pre-emphasis process of the effective part through a high-pass filter;

步骤1.3，经过预加重的有效部分再进行加窗分帧操作；Step 1.3, after the effective part of the pre-emphasis, the windowing and framing operation is performed;

步骤1.4，进行快速傅里叶变换，以得到在频谱上的能量分布；对频谱取模平方得到功率谱。Step 1.4, perform fast Fourier transform to obtain the energy distribution on the spectrum; take the modulo square of the spectrum to obtain the power spectrum.

进一步地，步骤2.3中，所述的卷积采用的公式为：Further, in step 2.3, the formula used in the convolution is:

上式中，k和n分别是功率归一化频谱特征的频谱和时间索引，i和j分别表示频谱和时间相对中心的偏移，PNSpec(k,n)表示输入的语音信号特征集，filterfunction(i,j)表示时空Gabor滤波函数。In the above formula, k and n are the spectral and time indices of the power-normalized spectral features, respectively, i and j represent the relative center offset of the spectrum and time, respectively, PNSpec(k, n) represents the input speech signal feature set, filterfunction (i,j) represents the spatiotemporal Gabor filter function.

进一步地，步骤四中计算距离采用的公式为：Further, the formula used to calculate the distance in step 4 is:

上式中，C(r_i|r_j)是按照MPEG-1压缩算法先训练压缩图像r_j再压缩图像r_i之后值的大小，从而求得r_i图像中去除和r_j图像共有的冗余信息后两者之间的最小近似值。In the above formula, C(r _i |r _j ) is the size of the value after first training the compressed image r _j and then compressing the image r _i according to the MPEG-1 compression algorithm, so as to obtain the redundant redundancy in the r _i image and the r _j image removed. The smallest approximation between the two after the remaining information.

本发明与现有技术相比具有以下技术特点：Compared with the prior art, the present invention has the following technical characteristics:

1.在特征提取方面，本算法相对于传统的MFCC、PLP等静态谱特征，在PNS频谱的基础上引入动态谱Delta特征，然后将静态谱特征与动态谱特征进行融合，通过时空Gabor滤波，得到更鲁棒的语音特征，克服了语音信号在强噪声干扰等复杂情况下的失真问题。1. In terms of feature extraction, compared with the traditional static spectral features such as MFCC and PLP, this algorithm introduces the dynamic spectral Delta feature on the basis of the PNS spectrum, and then fuses the static spectral feature with the dynamic spectral feature. More robust speech features are obtained, and the distortion problem of speech signals in complex situations such as strong noise interference is overcome.

2.在语音识别方面，本算法将语音特征序列作为重点，提出将语音特征序列进行递归图压缩，然后利用MPEG-1压缩算法计算不同递归图之间的CK-1距离的方案，这样就使得在复杂背景下不同长度语音特征序列的鲁棒匹配问题得到有效的解决。2. In terms of speech recognition, this algorithm focuses on the speech feature sequence, and proposes a scheme of recursive graph compression of the speech feature sequence, and then using the MPEG-1 compression algorithm to calculate the CK-1 distance between different recursive graphs, so that the The robust matching problem of speech feature sequences of different lengths in complex backgrounds is effectively solved.

附图说明Description of drawings

图1是原始语音信号(横坐标：时间；纵坐标：振幅)；Figure 1 is the original speech signal (abscissa: time; ordinate: amplitude);

图2是语音信号加窗分帧后的有效部分(横坐标：时间；纵坐标：振幅)；Fig. 2 is the effective part (abscissa: time; ordinate: amplitude) of speech signal after windowing and framing;

图3是语音信号的PNSpec功率谱图(横坐标：帧数；纵坐标：振幅)；Fig. 3 is the PNSpec power spectrogram of speech signal (abscissa: frame number; ordinate: amplitude);

图4是Delta动态谱对比图；Fig. 4 is a Delta dynamic spectrum comparison diagram;

图5是动态谱识别效果图；Fig. 5 is a dynamic spectrum recognition effect diagram;

图6是语音特征提取整体框图；Fig. 6 is the overall block diagram of speech feature extraction;

图7是语音特征序列的递归图；其中(a)、(b)和(c)为三种不同的语音特征序列的递归图，(b)和(d)为在不同环境下相同的语音信号生成的特征序列的递归图；Fig. 7 is the recursive graph of the speech feature sequence; wherein (a), (b) and (c) are the recursive graphs of three different speech feature sequences, and (b) and (d) are the same speech signals in different environments the recursive graph of the generated feature sequence;

图8是基于CK距离的语音相似性检测示意图；Fig. 8 is the speech similarity detection schematic diagram based on CK distance;

图9是步骤四中语音信号相似性检测的示意图；Fig. 9 is the schematic diagram of speech signal similarity detection in step 4;

图10是本发明方法的整体流程图。Figure 10 is an overall flow chart of the method of the present invention.

具体实施方式Detailed ways

遵从上述技术方案，如图1至图10所示，本发明公开了一种基于功率谱Gabor特征序列递归模型的语音识别方法，详细步骤介绍如下：Following the above technical solutions, as shown in Figures 1 to 10, the present invention discloses a speech recognition method based on the recursive model of the power spectrum Gabor feature sequence. The detailed steps are described as follows:

步骤一，语音信号的预处理Step 1: Preprocessing of the speech signal

步骤1.1，对获取的语音信号，利用短时能量或过零率等指标进行端点检测，分离并去除语音信号中的噪声信息，获得语音信号中的有效部分；Step 1.1, for the acquired voice signal, use indicators such as short-term energy or zero-crossing rate to perform endpoint detection, separate and remove noise information in the voice signal, and obtain an effective part in the voice signal;

步骤1.2，针对步骤1.1得到的语音信号的有效部分，将其通过一个高通滤波器完成预加重过程，预加重能够提升高频部分，使信号的频谱变得平坦，保持在低频到高频的整个频带中，并且能用同样的信噪比求得频谱。其公式如下所示：Step 1.2, for the effective part of the speech signal obtained in step 1.1, pass it through a high-pass filter to complete the pre-emphasis process. Pre-emphasis can enhance the high-frequency part, make the spectrum of the signal flat, and keep it in the whole range from low frequency to high frequency. frequency band, and the spectrum can be obtained with the same signal-to-noise ratio. Its formula is as follows:

H(z)＝1-μz^-1 H(z)=1-μz ^-1

其中，z表示端点检测后的有效语音信号，H(z)表示预加重后的语音信号，μ为常数，一般取0.97。Among them, z represents the effective voice signal after endpoint detection, H(z) represents the voice signal after pre-emphasis, μ is a constant, generally 0.97.

步骤1.3，对于步骤1.2预加重后的有效部分语音信号再通过加窗分帧操作，以保证语音信号的平稳性。语音信号具有短时平稳性(10-30ms内可以认为语音信号近似不变)，这样就可以把语音信号分为一些短段来进行处理，这就是分帧，语音信号的分帧是采用可移动的有限长度的窗口进行加权的方法来实现的。Step 1.3, for the effective part of the voice signal after the pre-emphasis in step 1.2, the windowing and framing operation is performed to ensure the stability of the voice signal. The voice signal has short-term stability (it can be considered that the voice signal is approximately unchanged within 10-30ms), so that the voice signal can be divided into some short segments for processing, which is framing, and the framing of the voice signal is a movable A weighted method of finite-length windows is implemented.

步骤1.4，经过步骤1.3语音信号加窗后，每帧还必须再经过快速傅里叶变换以得到在频谱上的能量分布，然后对语音信号的频谱取模平方得到语音信号的功率谱。In step 1.4, after the windowing of the speech signal in step 1.3, each frame must be subjected to fast Fourier transform to obtain the energy distribution in the frequency spectrum, and then the frequency spectrum of the speech signal is modulo squared to obtain the power spectrum of the speech signal.

步骤2.1，根据步骤1.4得到的功率谱，提取所述的有效部分的功率归一化频谱(PNSpec)特征；其中采用幂律值为1/15的非线性压缩方式，并且利用Gammatone滤波器组代替三角滤波器组，得到PNSpec频谱特征，如图3所示。这样使得PNSpec在稳态噪声等环境下，对语音信号的声学可变性更加鲁棒。Step 2.1, according to the power spectrum obtained in step 1.4, extract the power normalized spectrum (PNSpec) feature of the effective part; wherein a nonlinear compression method with a power-law value of 1/15 is used, and a Gammatone filter bank is used instead. Triangular filter bank to obtain PNSpec spectral characteristics, as shown in Figure 3. This makes PNSpec more robust to the acoustic variability of speech signals in environments such as steady-state noise.

步骤2.2，将所述的功率归一化频谱特征按照帧索引顺序组成一个序列C＝(c₁,c₂,…,c_n)，其中c_n表示第n帧语音信号的频谱特征。然后分别进行一阶差分和二阶差分处理后，分别得到Delta频谱特征和Double Delta频谱特征；差分参数的计算公式如下所示：In step 2.2, the power- _normalized spectral features are formed into a sequence C=( _c ₁ , c ₂ , . Then, after first-order difference and second-order difference processing, respectively, Delta spectral characteristics and Double Delta spectral characteristics are obtained; the calculation formula of difference parameters is as follows:

D[n]＝C[n+m]-C[n-m]D[n]=C[n+m]-C[n-m]

上式中，D[n]表示第n个一阶差分，C[n]表示第n个PNSpec系数，n表示分析帧的索引，并且实际中m值大约取2或3。类似地，Double Delta被定义为Delta特征的一阶差分。In the above formula, D[n] represents the nth first-order difference, C[n] represents the nth PNSpec coefficient, n represents the index of the analysis frame, and the m value is about 2 or 3 in practice. Similarly, Double Delta is defined as the first-order difference of the delta feature.

步骤2.3，将所述的功率归一化频谱特征(13维)、Delta频谱特征以及DoubleDelta频谱特征组成功率归一化频谱特征集(39维)，然后通过时空Gabor滤波，将时间调制滤波器表示为行向量，与功率归一化频谱特征集的每个通道独立地卷积；同样，频域调制滤波器被表示为列向量，与特征集的每个帧独立地卷积。PNSpec与行或列向量进行卷积，公式定义如下所示：Step 2.3, the power normalized spectral feature (13 dimensions), the Delta spectral feature and the DoubleDelta spectral feature are formed into a power normalized spectral feature set (39 dimensions), and then the time-modulated filter is represented by the time-space Gabor filter. is a row vector, convolved with each channel of the power-normalized spectral feature set independently; similarly, the frequency-domain modulated filter is represented as a column vector, convolved with each frame of the feature set independently. PNSpec is convolved with a row or column vector, and the formula is defined as follows:

其中，k和n分别是PNSpec的频谱和时间索引，i和j分别表示频谱和时间相对中心的偏移，PNSpec(k,n)表示输入的语音信号特征集，filterfunction(i,j)表示时空Gabor滤波函数。Among them, k and n are the spectral and time indices of PNSpec, respectively, i and j represent the relative center offset of the spectrum and time, PNSpec(k, n) represents the input speech signal feature set, and filterfunction(i, j) represents the space-time Gabor filter function.

步骤2.4，针对所述的功率归一化频谱特征集，执行直方图均衡化，然后通过PCA将高维特征投影到低维空间上，得鲁棒的功率谱Gabor(PNSG)特征集，并将功率谱Gabor特征集组成语音特征序列X＝{x(i₁),x(i₂),…,x(i_n)}，其中x(i_n)表示第n帧的语音特征，X表示整个语音信号的特征集合；这段序列中包含了语音信号的有效信息，后续的处理也是针对于这段序列进行的。Step 2.4, for the power normalized spectral feature set, perform histogram equalization, and then project the high-dimensional features onto the low-dimensional space through PCA to obtain a robust power spectral Gabor (PNSG) feature set, and use The power spectrum Gabor feature set is composed of a speech feature sequence X={x(i ₁ ),x(i ₂ ),..., _x (in )}, where x(in ) represents the speech feature of the _nth frame, and X represents the entire The feature set of the speech signal; this sequence contains the valid information of the speech signal, and the subsequent processing is also performed for this sequence.

在图4中，展示了一个序列的不同特征谱图，而且在图5中，对比了这些特征图谱的分类效果，从侧面验证了本算法加入动态特征的有效性。In Figure 4, different feature spectra of a sequence are shown, and in Figure 5, the classification effects of these feature maps are compared, and the effectiveness of adding dynamic features to the algorithm is verified from the side.

r＝θ(ε-||x(i_k)-x(i_m)||),i_k,i_m＝1…nr=θ(ε-||x(i _k )-x(im )||), i _k, _{i m} ₌ 1…n

上式中，z即对应递归图计算式中的(ε-||x(i_k)-x(i_m)||)。In the above formula, z corresponds to (ε-||x( _i _k )-x(im )||) in the recursive graph calculation formula.

本步骤利用的是递归图原理，将语音特征序列转化成了递归图，在计算过程中，如果一个n维特征序列在i和j序列空间位置处的值非常接近(阈值ε)，那么就在递归矩阵坐标为(i_k,i_m)的地方标记一个值为1，否则,就在相应的位置标为0，递归图就是通过黑点和白点来描绘图形以反映时间序列的。This step uses the recursive graph principle to convert the speech feature sequence into a recursive graph. During the calculation process, if the values of an n-dimensional feature sequence at the spatial positions of the i and j sequences are very close (threshold ε), then The place where the recursive matrix coordinate is ( _i _k , im ) is marked with a value of 1, otherwise, it is marked as 0 at the corresponding position, and the recursive graph is drawn by black and white dots to reflect the time series.

图7展示了3种不同语音特征序列的递归图，从图中我们可以直观的观察到3种语音特征序列的内部结构，且3种递归图在斜对角线两侧呈对称分布。图7(b)与图7(d)为不同环境下相同语音信号生成的特征序列递归图，从图中可以看出两者具有相似的内部空间结构。Figure 7 shows the recursive graphs of three different speech feature sequences. From the figure, we can intuitively observe the internal structures of the three speech feature sequences, and the three recursive graphs are symmetrically distributed on both sides of the diagonal diagonal. Figure 7(b) and Figure 7(d) are recursive graphs of feature sequences generated by the same speech signal in different environments. It can be seen from the figures that the two have similar internal spatial structures.

计算所述的递归图r与模板库中每一类信号的递归图r_i之间的距离，本方案中，采用语音特征序列之间的压缩系数d_mpeg来代表模型之间的距离：Calculate the distance between the recursive graph r and the recursive graph r _i of each type of signal in the template library. In this scheme, the compression coefficient _dmpeg between the speech feature sequences is used to represent the distance between the models:

如上式所示,r_i是模板库中每一类信号的语音特征序列的递归图，r_j是待测语音特征序列的递归图，即步骤三求得的r；C(r_i|r_j)是按照MPEG-1压缩算法先训练压缩图像r_j再压缩图像r_i之后值的大小，从而求得r_i图像中去除和r_j图像共有的冗余信息后两者之间的最小近似值。As shown in the above formula, ri is the recursive graph of the speech feature sequence of each type of signal in the template library, and r _j is the recursive graph of the speech feature sequence to be tested, that is, r _obtained in step 3; C(r _i |r _j ) is the size of the value after first training the compressed image r _j and then compressing the image r _i according to the MPEG-1 compression algorithm, so as to obtain the minimum approximation between the two after removing the redundant information shared by the r _i image and the r _j image.

当r_i和r_j是两幅十分相似的图像时，计算MPEG-1帧间压缩系数，如果产生一个较小的数值d_mpeg，表明r_i,r_j具有显著的相似性，r_i与r_j递归图之间的相似性即它们拥有共同结构的部分，如下图9所示：When r _i and r _j are two very similar images, calculate the MPEG-1 inter-frame compression coefficient, if a smaller value d _mpeg is generated, it indicates that r _i , r _j have significant similarity, r _i and r The similarity between _j recursive graphs is the part where they share a common structure, as shown in Figure 9 below:

通过与模板库中语音信号的递归模型进行相似性检测计算，可得到当前待测语音的递归模型与模板库中每一类语音信号的递归图之间的不同距离，将这些距离值进行排序，取最小的一个距离值对应的模板库中的语音信号作为识别出的结果。By performing similarity detection and calculation with the recursive model of the speech signal in the template library, the different distances between the recursive model of the current speech to be tested and the recursive graph of each type of speech signal in the template library can be obtained, and these distance values are sorted. The speech signal in the template library corresponding to the smallest distance value is taken as the recognition result.

该步骤提到的模板库，是指在进行语音识别之前，先采集各类标准的语音信号，按照步骤一至三的方法进行处理，得到标准的语音信号的递归模型r_i，将这些语音信号的递归模型存储在一个模板库中；后续进行语音相似性检测时，将待测语音信号的递归模型与模板库中的各个标准语音信号的递归模型进行对比，二者之间的距离越小，说明二者的相似度越高，就认为待测语音即为与之相似度最高的一个语音，这个语音就作为识别结果进行输出。The template library mentioned in this step means that before speech recognition, various standard speech signals are collected and processed according to the methods in steps 1 to 3 to obtain the recursive model r _i of the standard speech signals. The recursive model is stored in a template library; in subsequent speech similarity detection, the recursive model of the speech signal to be tested is compared with the recursive model of each standard speech signal in the template library. The higher the similarity between the two is, the speech to be tested is considered to be the speech with the highest similarity, and this speech is output as the recognition result.

图8展示了本文算法的判别分析过程，首先对得到的语音数据进行预处理,提取相应的特征序列样本，并且得到对应的递归图④。经过和训练集中的语音序列递归图的比较，找到和④距离最小的训练集样本②，从而得到最终的识别结果。Figure 8 shows the discriminant analysis process of the algorithm in this paper. First, the obtained speech data is preprocessed, the corresponding feature sequence samples are extracted, and the corresponding recursive graph ④ is obtained. After comparing with the speech sequence recursion graph in the training set, find the training set sample ② with the smallest distance from ④, so as to obtain the final recognition result.

为了验证本方法的有效性，本发明选取公开语料库进行了实验验证：In order to verify the validity of this method, the present invention selects the public corpus to carry out experimental verification:

实验选取NOISEX-92语料库中的6种非平稳噪声和TIMIT数据集进行混合来构建本文的实验数据。其中噪声类型包括工厂噪声(Factory)、嘈杂噪声(Babble)、坦克内部噪声(Tank)、汽车噪声(Volvo)、F16座舱噪声(F16)和机枪噪声(Machinegun)。In the experiment, 6 kinds of non-stationary noises in NOISEX-92 corpus and TIMIT dataset are selected to be mixed to construct the experimental data of this paper. The noise types include factory noise (Factory), noisy noise (Babble), tank interior noise (Tank), car noise (Volvo), F16 cockpit noise (F16) and machine gun noise (Machinegun).

为了创建训练集，在信噪比为-15dB～3dB区间(步长为3dB)内，我们选用400个TIMIT数据集的句子和6种非平稳噪声的前半部分依次进行混合，得到不同信噪比下的混合语音信号。In order to create a training set, in the signal-to-noise ratio range of -15dB to 3dB (with a step size of 3dB), we select 400 TIMIT dataset sentences and the first half of the 6 kinds of non-stationary noise to mix in turn to obtain different signal-to-noise ratios. mixed speech signal.

对于测试集，我们选用100个句子和6种非平稳噪声的后半部分进行混合。使用非平稳噪声的不同部分确保了测试集中使用的噪声段与训练集中的噪声段不同,更能突出系统的鲁棒性。For the test set, we choose 100 sentences and mix the second half of 6 kinds of non-stationary noise. Using different parts of non-stationary noise ensures that the noise segment used in the test set is different from the noise segment in the training set, which can better highlight the robustness of the system.

本发明提出的算法平均识别准确率高达90.36％，比MFCC、PNCC以及SGBFB分别提高25.49％、20.36％、3.9％，表明了本文算法在低信噪比、强噪声干扰等复杂环境下，仍有较高的识别准确率，具有良好的识别效果。The average recognition accuracy rate of the algorithm proposed in the present invention is as high as 90.36%, which is 25.49%, 20.36% and 3.9% higher than MFCC, PNCC and SGBFB respectively. High recognition accuracy and good recognition effect.

Claims

1. A speech recognition method based on a power spectrum Gabor characteristic sequence recursive model is characterized by comprising the following steps:

step one, preprocessing a voice signal

Carrying out end point detection on the obtained voice signal, separating and removing noise information in the voice signal, obtaining an effective part in the voice signal, and solving a power spectrum of the effective part;

step two, extracting a power spectrum Gabor characteristic sequence

Step 2.1, extracting the power normalization frequency spectrum characteristic of the effective part according to the power spectrum;

step 2.2, forming a sequence C by the power normalization spectrum characteristics according to a frame index sequence, and respectively carrying out first-order difference and second-order difference processing to respectively obtain Delta spectrum characteristics and DoubleDelta spectrum characteristics;

step 2.3, the power normalized spectrum feature, the Delta spectrum feature and the DoubleDelta spectrum feature form a power normalized spectrum feature set, and then a time modulation filter is represented as a row vector through space-time Gabor filtering and is independently convolved with each channel of the power normalized spectrum feature set; similarly, the frequency domain modulation filter is represented as a column vector, convolved independently with each frame of the set of power normalized spectral features;

in step 2.3, the formula adopted by the convolution is as follows:

in the above formula, k and n are respectively the frequency spectrum and time index of the power normalized frequency spectrum feature, i and j respectively represent the deviation of the frequency spectrum and time from the center, PNSpec (k, n) represents the feature set of the input voice signal, and filter function (i, j) represents the space-time Gabor filter function;

step 2.4, executing histogram equalization aiming at the power normalization spectrum feature set, then projecting high-dimensional features onto a low-dimensional space through PCA to obtain a power spectrum Gabor feature set, and forming the power spectrum Gabor feature set into a voice feature sequence X;

step three, constructing a recursion graph of the voice characteristic sequence

Calculating a recursion graph r of the speech feature sequence X according to the following formula:

r＝θ(ε-||x(i_k)-x(i_m)||)，i_k，i_m＝1...n

in the above formula, n is the number of states experienced by the speech feature sequence, k and m represent frame indexes, and x (i)_k) And x (i)_m) Is at i_kAnd i_mThe value of the voice characteristic sequence observed at the sequence position, | | | | represents Euclidean norm, epsilon represents critical distance, and epsilon is less than 1; and θ represents the Heaviside function, which is defined as follows:

step four, voice signal similarity detection

Calculating the recursion graph r and the recursion graph r of each type of signal in the template library_iThe voice signal in the template library corresponding to the minimum distance value is the recognized result;

the template library is that before speech recognition, various standard speech signals are collected and processed according to the method of the first step to the third step to obtain a recursion graph r of the standard speech signals_iStoring a recursive graph of the speech signals in a template library;

the formula adopted for calculating the distance in the fourth step is as follows:

in the above formula, r_iIs a recursive graph of the speech feature sequence of each class of signal in the template library, r_jIs a recursive graph of the speech feature sequence to be measured, C (r)_i|r_j) Firstly training a compressed image r according to an MPEG-1 compression algorithm_jRecompressed image r_iThen the magnitude of the value, thereby obtaining r_iRemoving sum r from image_jMinimum approximation between redundant information common to images。

2. The method for speech recognition based on the recursive model of power spectrum Gabor feature sequence according to claim 1, wherein the specific process of the first step comprises:

step 1.1, carrying out end point detection on the obtained voice signal by using short-time energy or zero-crossing rate indexes, and separating an effective part from noise information in the voice signal;

step 1.2, completing the pre-emphasis process of the effective part through a high-pass filter;

step 1.3, performing windowing and framing operation on the pre-emphasized effective part;

step 1.4, performing fast Fourier transform to obtain energy distribution on a frequency spectrum; and performing modulus squaring on the frequency spectrum to obtain a power spectrum.