CN112735456A

CN112735456A - Speech enhancement method based on DNN-CLSTM network

Info

Publication number: CN112735456A
Application number: CN202011323987.2A
Authority: CN
Inventors: 汪友明; 张天琦
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-04-30
Anticipated expiration: 2040-11-23
Also published as: CN112735456B

Abstract

The invention is a speech enhancement method based on deep neural network and residual long short-term memory network (DNN-CLSTM). In this method, the speech amplitude feature obtained by spectral subtraction and the speech Mel cepstral coefficient (MFCC) feature obtained by fast Fourier transform are input into the DNN‑CLSTM network model to achieve the purpose of speech enhancement. First, time-frequency masking and windowing are performed on the noisy speech, and the amplitude and phase characteristics of the noisy speech are obtained by using fast Fourier transform, and the noise amplitude of the noisy speech is estimated. The estimated noise signal amplitude is subtracted from the speech amplitude to obtain the spectrally subtracted speech signal amplitude as the first feature input to the neural network. Secondly, perform fast Fourier transform (FFT) on the noisy speech to obtain the spectral line energy of the speech signal and then obtain the MFCC feature of the noisy speech as the second characteristic of the speech signal. The above two features are input into the DNN-CLSTM network for training to obtain a network model, and the minimum mean square error (MMSE) loss function evaluation index is used to evaluate the effectiveness of the model. Finally, input the actual noisy speech set into the trained speech enhancement network model, predict the enhanced estimated amplitude and MFCC, and use the inverse Fourier transform to obtain the final enhanced speech signal. The present invention has high fidelity of speech.

Description

A speech enhancement method based on DNN-CLSTM network

技术领域technical field

本发明属于语音增强技术领域，具体涉及一种基于DNN-CLSTM网络的语音增强方法。The invention belongs to the technical field of speech enhancement, in particular to a speech enhancement method based on a DNN-CLSTM network.

背景技术Background technique

语音作为信息传输的主要方式之一，在生活中得到了大量的应用，随着技术的发展，语音不仅在人与人交流之间起到了信息传递的作用，在人机交互中也大量运用到了语音信号。然而在我们的通信过程中，语音信号往往都伴随着大量的噪声信号，如工厂噪声、汽车噪声或者餐厅的嘈杂声等背景噪声。包含了大量噪声的语音信号会使得接收方在对语音信号中包含的有用信息进行提取时产生大量的干扰。针对这一问题，语音信号增强技术得到了广泛关注。As one of the main methods of information transmission, voice has been widely used in life. With the development of technology, voice not only plays the role of information transmission between people, but also is widely used in human-computer interaction. voice signal. However, in our communication process, speech signals are often accompanied by a large number of noise signals, such as factory noise, car noise or background noise such as restaurant noise. A speech signal containing a lot of noise will cause a lot of interference when the receiver extracts the useful information contained in the speech signal. In response to this problem, speech signal enhancement technology has received extensive attention.

语音增强是指现实中的语音被噪声干扰时，将噪声和语音信号分离的过程。语音增强技术现已经得到了广泛的运用，如移动通信领域、语音识别领域等诸多领域。语音增强技术的主要目的是为了提高语音质量以及语音可懂度。目前，语音增强方法主要分为谱减法、子空间算法以及基于统计模型的算法三种。随着深度学习的发展，神经网络已经被应用到了语音增强领域。Speech enhancement refers to the process of separating noise and speech signal when real speech is disturbed by noise. Speech enhancement technology has been widely used, such as mobile communication, speech recognition and many other fields. The main purpose of speech enhancement technology is to improve speech quality and speech intelligibility. At present, speech enhancement methods are mainly divided into three types: spectral subtraction, subspace algorithm and algorithm based on statistical model. With the development of deep learning, neural networks have been applied to the field of speech enhancement.

图1所示的谱减法是语音增强技术中最早去噪技术之一。谱减法去噪基于以下原理：假设噪声是加性噪声，即y(m)＝x(m)+n(m)，其中y(m)是包含噪声的信号，x(m)是纯净的语音信号，n(m)是加性噪声；通过从包含噪声的语音信号中减去对噪声谱的估计，就可以得到纯净的语音信号。这一假设的前提条件是噪声信号是平稳的，使得在不存在目标信号的语音段间，可以对噪声信号进行估计并且进行更新。Spectral subtraction shown in Figure 1 is one of the earliest denoising techniques in speech enhancement techniques. Spectral subtraction denoising is based on the following principle: Assuming that the noise is additive noise, that is, y(m)=x(m)+n(m), where y(m) is the signal containing noise, and x(m) is the pure speech signal, n(m) is additive noise; the pure speech signal can be obtained by subtracting an estimate of the noise spectrum from the noise-containing speech signal. The premise of this assumption is that the noise signal is stationary, so that between speech segments in which the target signal does not exist, the noise signal can be estimated and updated.

谱减法是一种相对简单的语音增强算法，其原理是从输入的混合语音信号的幅度谱值减去估计的噪声幅度谱值，利用人耳对相位的不灵敏性，把谱减前的相位角信息直接用到谱减后的信息中来合成最终谱减后的语音信号。由于谱减法只包含一次傅里叶变化和傅里叶逆变化，因此它的计算量较小，并且易于实现。但现实中许多噪声是不平稳的信号，因此使用谱减法对语音信号进行增强后，增强后的语音信号往往存在着大量的音乐噪声，从而导致语音信号失真，使得信号的可懂度与语音质量较差。Spectral subtraction is a relatively simple speech enhancement algorithm. Its principle is to subtract the estimated noise amplitude spectrum value from the amplitude spectrum value of the input mixed speech signal, and use the insensitivity of the human ear to phase to subtract the phase before the spectrum subtraction. The angle information is directly used in the spectrally subtracted information to synthesize the final spectrally subtracted speech signal. Since spectral subtraction only contains one Fourier transform and inverse Fourier transform, it is computationally less expensive and easy to implement. But in reality, many noises are unstable signals, so after using spectral subtraction to enhance the speech signal, there is often a lot of musical noise in the enhanced speech signal, which leads to the distortion of the speech signal, which makes the intelligibility of the signal and the speech quality. poor.

发明内容SUMMARY OF THE INVENTION

本发明的目的是解决基于谱减法的语音增强过程中存在的语音信号失真、信号的可懂度与语音质量较差等问题。为达上述目的，本发明提供了一种基于DNN-CLSTM网络的语音增强方法，其特征在于，包括以下步骤：The purpose of the present invention is to solve the problems of voice signal distortion, poor signal intelligibility and poor voice quality in the voice enhancement process based on spectral subtraction. In order to achieve the above-mentioned purpose, the present invention provides a kind of speech enhancement method based on DNN-CLSTM network, it is characterized in that, comprises the following steps:

步骤一：获取至少两路含噪语音信号，含噪语音信号由纯净语音信号和噪声信号相加而成：Step 1: Obtain at least two noisy speech signals. The noisy speech signal is formed by adding the pure speech signal and the noise signal:

y(m)＝x(m)+n(m)y(m)=x(m)+n(m)

其中，y(m)是包含噪声的含噪语音信号，x(m)是纯净语音信号，n(m)是噪声信号，m为离散时间序列；Among them, y(m) is the noisy speech signal containing noise, x(m) is the pure speech signal, n(m) is the noise signal, and m is the discrete time series;

步骤二：分帧加窗，获取纯净语音信号以及含噪语音信号的幅值和相位作为第一特征：对含噪语音信号进行加窗分帧处理，并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位；同时，在不含目标信号并且只含有噪声的语音信号段，对噪声进行估计，求出噪声信号幅值；Step 2: windowing by frame, obtaining the amplitude and phase of the pure speech signal and the noisy speech signal as the first feature: performing windowing and framing processing on the noisy speech signal, and using discrete Fourier transform to obtain the noisy speech The amplitude and phase of the signal; at the same time, in the speech signal segment that does not contain the target signal and only contains noise, the noise is estimated to obtain the amplitude of the noise signal;

步骤三、用含噪语音信号的幅值减去所述噪声信号幅值，从而得到谱减语音信号幅值作为第二特征；Step 3, subtracting the noise signal amplitude from the amplitude of the noisy speech signal, thereby obtaining the spectrum minus the speech signal amplitude as the second feature;

步骤四、求取MFCC作为第三特征；Step 4, seek MFCC as the third feature;

步骤五：建立DNN-CLSTM网络模型进行训练；Step 5: Establish a DNN-CLSTM network model for training;

将含噪语音谱减后的语音信号幅值和MFCC这两种特征输入至 DNN-CLSTM网络中进行训练，得到增强后的估计的幅值和MFCC；将估计的幅值和MFCC分别与纯净的幅值和纯净的MFCC数值计算各自的最小均方误差，并将所得到的误差作为调整信号输入进神经网络对网络进行优化，从而得到训练好的网络。Input the two features of speech signal amplitude and MFCC after noise spectrum reduction into the DNN-CLSTM network for training, and obtain the enhanced estimated amplitude and MFCC; Amplitude and pure MFCC values calculate their respective minimum mean square errors, and input the obtained errors as adjustment signals into the neural network to optimize the network, thereby obtaining a trained network.

步骤四的具体过程是：The specific process of step four is:

(1)预处理：预处理包括预加重、分帧、加窗函数；(1) Preprocessing: Preprocessing includes pre-emphasis, framing, and windowing functions;

预加重处理：通过一个一阶的高通滤波器实现，滤波器的传递函数为：Pre-emphasis processing: implemented by a first-order high-pass filter, the transfer function of the filter is:

H(z)＝1-az^-1 H(z)=1-az ^-1

其中，a为预加重系数，一般取值为0.98；Among them, a is the pre-emphasis coefficient, which is generally 0.98;

语音信号x(n)通过预加重处理后的结果为：The result of the speech signal x(n) after pre-emphasis processing is:

y(n)＝x(n)-ax(n-1)y(n)=x(n)-ax(n-1)

分帧加窗：在相邻两帧之间有重叠的部分，即为帧移，设置为10ms；加窗函数：对每一帧语音信号进行汉明窗加窗处理：y(n)经过分帧加窗处理后得到y_i(n)，它的定义为：Frame-by-frame windowing: the overlapping part between two adjacent frames, which is frame shift, is set to 10ms; windowing function: Hamming window windowing is performed on each frame of speech signal: y(n) is divided into After frame windowing, y _i (n) is obtained, which is defined as:

其中，ω(n)为汉明窗，它的表达式为Among them, ω(n) is the Hamming window, and its expression is

其中，y_i(n)表示第i帧语音信号，n表示样点数，L表示帧长；Wherein, y _i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length;

(2)快速傅里叶变换(FFT)(2) Fast Fourier Transform (FFT)

对每帧语音信号y_i(n)进行快速傅里叶变换，得到每帧信号的频谱，表达式如下：Perform fast Fourier transform on each frame of speech signal y _i (n) to obtain the frequency spectrum of each frame signal, the expression is as follows:

Y(i,k)＝FFT[y_i(n)]Y(i,k)=FFT[y _i (n)]

其中，k表示频域中的第k条谱线；where k represents the kth spectral line in the frequency domain;

(3)计算谱线能量(3) Calculate the spectral line energy

频域中每一帧语音信号谱线的能量E(i,k)表示为：The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain is expressed as:

E(i,k)＝[Y(i,k)]² E(i,k)=[Y(i,k)] ²

(4)计算通过Mel滤波器的能量(4) Calculate the energy passing through the Mel filter

每一帧谱线能量通过Mel滤波器的能量S(i,m)定义为：The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter is defined as:

其中，N表示FFT的点数；Among them, N represents the number of FFT points;

每个滤波器的传递函数H_m(k)为The transfer function H _m (k) of each filter is

其中，f(m)为第m个滤波器的中心频率，m为第m个滤波器，M为滤波器的个数；Among them, f(m) is the center frequency of the mth filter, m is the mth filter, and M is the number of filters;

(5)计算MFCC(5) Calculate MFCC

把Mel滤波器的能量取对数后计算离散余弦变换得到MFCC特征参数，如下式所示：After taking the logarithm of the energy of the Mel filter, the discrete cosine transform is calculated to obtain the MFCC characteristic parameters, as shown in the following formula:

其中，j是DCT后的谱线。where j is the spectral line after DCT.

3、步骤五的具体过程是：3. The specific process of step 5 is:

(1)DNN网络建立(1) DNN network establishment

输入层：将经过谱减后的语音幅值和MFCC特征作为输入，输入DNN 网络中，输入层的神经元的节点数为128个；Input layer: take the spectrally subtracted speech amplitude and MFCC feature as input, and input it into the DNN network. The number of neurons in the input layer is 128;

全连接层：设置32个节点，丢弃率设为0.5，设置激活函数为RELU；Fully connected layer: set 32 nodes, set the drop rate to 0.5, and set the activation function to RELU;

全连接层：设置128个节点，丢弃率值设为0.5，设置激活函数为RELU；Fully connected layer: set 128 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

全连接层：设置512个节点，丢弃率值设为0.5，设置激活函数为RELU；Fully connected layer: set 512 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

(2)多目标特征融合：(2) Multi-target feature fusion:

将DNN网络增强后的幅值和MFCC特征与原始含噪语音的幅值和 MFCC特征相结合；Combine the amplitude and MFCC features enhanced by the DNN network with the amplitude and MFCC features of the original noisy speech;

其中

和

分别代表第k个空间领域中经过DNN预测的MFCC特征和语音幅值；

分别代表第k个空间领域中原始含噪语音的 MFCC特征和语音幅值；in

and

respectively represent the MFCC features and speech amplitude predicted by DNN in the kth spatial domain;

represent the MFCC features and speech amplitudes of the original noisy speech in the kth spatial domain, respectively;

(3)C-LSTM网络：(3) C-LSTM network:

(a)CNN:(a) CNN:

卷积层：对DNN网络得到的结果进行卷积，节点数设置为64个节点，步长设为1，卷积核取5*1，激活函数设置为SELU；Convolution layer: Convolve the results obtained by the DNN network, the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 5*1, and the activation function is set to SELU;

BN层：对数据进行归一化；BN layer: normalize the data;

卷积层：节点数设置为64个节点，步长设为1，卷积核取3*1，激活函数设置为SELU；Convolution layer: the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 3*1, and the activation function is set to SELU;

BN层：对数据进行归一化BN layer: normalize the data

卷积层：节点数设置为128个节点，步长设为1，卷积核取5*1，Convolutional layer: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel is set to 5*1,

(b)残差网络(b) Residual network

对DNN网络得到的结果进行卷积，节点数设置为128个节点，步长设为1，卷积核取5*1；Convolve the results obtained by the DNN network, the number of nodes is set to 128 nodes, the step size is set to 1, and the convolution kernel is set to 5*1;

将残差网络得到数据与CNN网络得到的数据进行结合后，使用SELU 激活函数；After combining the data obtained by the residual network with the data obtained by the CNN network, the SELU activation function is used;

Max Pooling层：步长设置为1，池化层大小设置为2Max Pooling layer: step size is set to 1, pooling layer size is set to 2

(c)LSTM网络：(c) LSTM network:

长短时记忆网络的双向网络节点均选取为128节点，激活函数为 Sigmoid函数，The bidirectional network nodes of the long-short-term memory network are selected as 128 nodes, and the activation function is the sigmoid function.

(4)输出层：(4) Output layer:

使用两个前馈神经网络作为输出层，输出预测的语音信号幅值、 MFCC；网络模型采用Adam优化器对网络参数进行优化；所有卷积层采用边缘填充方式。Two feedforward neural networks are used as output layers to output the predicted speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use edge padding.

(5)计算最小均方误差目标函数(5) Calculate the minimum mean square error objective function

其中T＝2，

分别代表第k个声学特征空间预测的MFCC特征向量和预测幅值特征

分别代表第k个声学特征空间纯净的MFCC 特征向量和纯净幅值特征。where T=2,

Represent the MFCC feature vector and predicted magnitude feature of the kth acoustic feature space prediction, respectively

represent the pure MFCC feature vector and pure amplitude feature of the kth acoustic feature space, respectively.

本发明的优点如下：增强后的语音信号平稳，具有高保真性和良好的语音质量。The advantages of the present invention are as follows: the enhanced speech signal is stable, has high fidelity and good speech quality.

下面结合附图和实施例对本发明作详细说明。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

附图说明Description of drawings

图1是传统谱减法流程图。Figure 1 is a flow chart of conventional spectral subtraction.

图2是训练阶段的基于DNN-CLSTM网络的语音增强方法的流程图。Figure 2 is a flow chart of the DNN-CLSTM network-based speech enhancement method in the training phase.

图3是测试阶段的基于DNN-CLSTM网络的语音增强方法流程图。Figure 3 is a flow chart of the speech enhancement method based on DNN-CLSTM network in the testing phase.

图4是MFCC求取流程图。FIG. 4 is a flow chart of MFCC calculation.

图5是建立DNN-CLSTM神经网络的过程架构图。Figure 5 is a process architecture diagram of building a DNN-CLSTM neural network.

图6是纯净语音的语谱图Figure 6 is the spectrogram of pure speech

图7是含有噪声的语谱图Figure 7 is a spectrogram with noise

图8是采用DNN语音增强方法处理后的语谱图Figure 8 is the spectrogram processed by the DNN speech enhancement method

图9是采用CNN语音增强方法处理后的语谱图Figure 9 is the spectrogram processed by the CNN speech enhancement method

图10是采用LSTM语音增强方法处理后的语谱图Figure 10 is the spectrogram processed by the LSTM speech enhancement method

图11是采用GRU语音增强方法处理后的语谱图Figure 11 is the spectrogram processed by the GRU speech enhancement method

图12是采用DNN-CLSTM语音增强方法处理后的语谱图Figure 12 is the spectrogram processed by the DNN-CLSTM speech enhancement method

具体实施方式Detailed ways

为了克服使用谱减法对语音信号进行增强后，增强后的语音信号往往存在着大量的音乐噪声，从而导致语音信号失真，使得信号的可懂度与语音质量较差的缺陷，本实施例提供了一种基于DNN-CLSTM网络的语音增强方法 (如图2和3所示)，包括以下步骤：In order to overcome the defect of using spectral subtraction to enhance the speech signal, the enhanced speech signal often has a large amount of musical noise, which causes the speech signal to be distorted and makes the signal intelligibility and speech quality poor. A speech enhancement method based on DNN-CLSTM network (as shown in Figures 2 and 3), comprising the following steps:

获取两路含噪语音信号(也可以获取两路以上的含噪语音信号，根据实际需要自行选取)；含噪语音信号由纯净的语音信号和噪声信号构成：Acquire two channels of noisy speech signals (you can also obtain more than two channels of noisy speech signals, which can be selected according to actual needs); the noisy speech signals are composed of pure speech signals and noise signals:

y(m)＝x(m)+n(m)y(m)=x(m)+n(m)

其中，y(m)是包含噪声的语音信号，x(m)是纯净的语音信号，n(m)是噪声信号。Among them, y(m) is the speech signal containing noise, x(m) is the pure speech signal, and n(m) is the noise signal.

训练阶段：Training phase:

1、分帧加窗1. Framing and windowing

获取纯净语音信号以及含噪语音信号的幅值和相位，以对含噪语音此乃和处理为例：Obtain the amplitude and phase of the pure speech signal and the noisy speech signal, taking the sum processing of the noisy speech as an example:

对含噪语音信号进行加窗分帧处理，并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位作为第一特征；Windowing and framing the noisy speech signal, and using discrete Fourier transform to obtain the amplitude and phase of the noisy speech signal as the first feature;

对含噪语音信号使用汉明窗加窗分帧：Framing a noisy speech signal using a Hamming window:

y_w(m)＝w(m)y(m)＝w(m)[x(m)+n(m)]＝x_w(m)+n_w(m)yw(m)= _w (m)y(m)= _w (m)[x(m)+n(m)]=xw(m)+ _nw (m)

加窗操作在频域表示为：The windowing operation is expressed in the frequency domain as:

Y_w(f)＝W(f)*Y(f)＝X_w(f)+N_w(f)Yw(f)= _W (f)*Y(f)= _Xw (f)+ _Nw (f)

假设信号是经过加窗处理的，为了简便，将信号的下标w省略。Assuming that the signal is processed by windowing, the subscript w of the signal is omitted for simplicity.

将Y_w(f)用极坐标形式表示：Express Y _w (f) in polar form:

其中|Y(f)|为幅度谱，φ_y(f)为相位信号Phase[Y(f)]。where |Y(f)| is the amplitude spectrum, and φ _y (f) is the phase signal Phase[Y(f)].

2、噪声估计：2. Noise estimation:

在不含目标信号并且只含有噪声的语音信号段，对噪声进行估计，求出噪声信号幅值；本发明选取语音信号前五帧作为噪声信号段。由于噪声的幅度谱|N(f)|未知，但|N(f)|可以通过无语音活动时的平均幅度谱的估计来替代，噪声的相位φ_n(f)可以由含噪语音信号的相位φ_y(f)来替代。当语音信号不存在且只有噪声时，得到平均噪声幅度谱

计算过程如下：In the speech signal segment that does not contain the target signal and only contains noise, the noise is estimated to obtain the noise signal amplitude; the present invention selects the first five frames of the speech signal as the noise signal segment. Since the magnitude spectrum |N(f)| of the noise is unknown, but |N(f)| can be replaced by an estimate of the average magnitude spectrum in the absence of speech activity, the phase _φn (f) of the noise can be determined by the phase φ _y (f) instead. When the speech signal is absent and there is only noise, the average noise amplitude spectrum is obtained

The calculation process is as follows:

其中，|N_i(f)|为第i个噪声的帧的频谱，k为纯噪声信号周期内的帧数。Among them, |N _i (f)| is the frequency spectrum of the ith noise frame, and k is the number of frames in the pure noise signal period.

3、谱减法3. Spectral subtraction

用含噪语音信号的幅值减去噪声信号幅值，从而得到谱减语音信号幅值作为第二特征(采用谱减发获取的增强语音信号成为谱减语音信号，相应的幅值称为谱减语音信号幅值，以与本实施中最终得到的增强语音信号进行区分)；The amplitude of the noise signal is subtracted from the amplitude of the noisy speech signal to obtain the amplitude of the spectrally subtracted speech signal as the second feature (the enhanced speech signal obtained by using spectral subtraction becomes the spectrally subtracted speech signal, and the corresponding amplitude is called the spectrum. The amplitude of the speech signal is subtracted to distinguish it from the enhanced speech signal finally obtained in this implementation);

使用含噪声语音信号的幅值减去噪声信号的幅值，计算过程如下：Using the amplitude of the noisy speech signal to subtract the amplitude of the noise signal, the calculation process is as follows:

其中，

代表经过谱减后语音信号的幅值，|Y(f)|^b表示含噪语音信号的幅值，

表示代表噪声段的噪声统计的平均值。α表示谱减噪声系数。b 是幂指数，当指数b＝1时为幅度谱减法,当指数b＝2时，为功率谱减法。in,

represents the amplitude of the speech signal after spectral subtraction, |Y(f)| ^b represents the amplitude of the noisy speech signal,

Represents the mean value of the noise statistics representing the noise segment. α represents the spectral subtraction noise figure. b is the power exponent, when the exponent b=1, it is the amplitude spectrum subtraction, and when the exponent b=2, it is the power spectrum subtraction.

由于对于噪声信号的估计可能会产生误差，从而导致估计信号的幅度谱

可能会为负值。通常幅度谱的值不应为负，为了避免

为负值，对差分谱进行半波整流，计算过程如下：Since the estimation of the noise signal may produce errors, resulting in the estimated signal amplitude spectrum

May be negative. Usually the value of the magnitude spectrum should not be negative, in order to avoid

is a negative value, half-wave rectification is performed on the difference spectrum, and the calculation process is as follows:

经过谱减后，需要对谱减后语音信号进行降幂，从而得到经过谱减阶段后的语音信号幅值即谱减语音信号幅值

计算过程如下：After spectral subtraction, it is necessary to reduce the power of the spectrally subtracted speech signal, so as to obtain the amplitude of the speech signal after the spectral subtraction stage, that is, the amplitude of the spectrally subtracted speech signal.

The calculation process is as follows:

4、提取MFCC作为第二特征；4. Extract MFCC as the second feature;

(1)预处理(1) Preprocessing

预处理包括预加重、分帧、加窗函数，预加重处理是指通过一个一阶的高通滤波器实现的，滤波器的传递函数为：Pre-processing includes pre-emphasis, framing, and windowing functions. Pre-emphasis processing is realized by a first-order high-pass filter. The transfer function of the filter is:

H(z)＝1-az^-1 H(z)=1-az ^-1

其中，a为预加重系数，一般取值为0.98Among them, a is the pre-emphasis coefficient, which is generally 0.98

y(n)＝x(n)-ax(n-1)y(n)=x(n)-ax(n-1)

分帧是指由语音的产生过程可知语音信号是一个非平稳的时变的信号。短时间内的语音信号可认为是平稳的时不变信号，短时间通常指10～30ms之间，本文取20ms。因此通常用短时分析技术对语音信号进行分析和处理，将许多帧来分析其特征参数，同时为了使帧与帧之间可以平滑地过渡，在相邻两帧之间有重叠的部分，即为帧移，设置为10ms。加窗函数的目的是为了减少频域中的频谱泄露，对每一帧语音信号进行加窗处理，通常采用汉明窗，和矩形窗相比，汉明窗的频谱泄露更小。y(n)经过分帧加窗处理后得到y_i(n)，它的定义为：Framing means that the speech signal is a non-stationary time-varying signal from the process of speech generation. The voice signal in a short time can be considered as a stable time-invariant signal, and the short time usually refers to between 10 and 30ms, and this paper takes 20ms. Therefore, short-term analysis technology is usually used to analyze and process the speech signal, and many frames are used to analyze its characteristic parameters. For frame shift, set to 10ms. The purpose of the windowing function is to reduce the spectral leakage in the frequency domain, and window processing is performed on each frame of speech signal. Usually, a Hamming window is used. Compared with the rectangular window, the spectral leakage of the Hamming window is smaller. After y(n) is framed and windowed, y _i (n) is obtained, which is defined as:

其中，y_i(n)表示第i帧语音信号，n表示样点数，L表示帧长。Among them, y _i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length.

(2)快速傅里叶变换(FFT)(2) Fast Fourier Transform (FFT)

由于语音信号在时域上的变换一般不容易看出信号的特性，所以通常将其变换到频域上来进行分析，不同频率的频谱代表了语音信号不同的特性。因此对每帧语音信号y_i(n)进行快速傅里叶变换，得到每帧信号的频谱，如下式所示：Since it is generally not easy to see the characteristics of the voice signal when it is transformed in the time domain, it is usually transformed into the frequency domain for analysis. The frequency spectrum of different frequencies represents different characteristics of the voice signal. Therefore, fast Fourier transform is performed on each frame of speech signal y _i (n) to obtain the frequency spectrum of each frame signal, as shown in the following formula:

Y(i,k)＝FFT[y_i(n)]Y(i,k)=FFT[y _i (n)]

其中，k表示频域中的第k条谱线。where k represents the kth spectral line in the frequency domain.

(3)计算谱线能量(3) Calculate the spectral line energy

频域中每一帧语音信号谱线的能量E(i,k)可表示为：The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain can be expressed as:

E(i,k)＝[Y(i,k)]² E(i,k)=[Y(i,k)] ²

每一帧谱线能量通过Mel滤波器的能量S(i,m)可定义为：The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter can be defined as:

其中，N表示FFT的点数。Among them, N represents the number of FFT points.

其中，f(m)为第m个滤波器的中心频率，m为第m个滤波器，M为滤波器的个数，通常设置为24。Among them, f(m) is the center frequency of the mth filter, m is the mth filter, and M is the number of filters, which is usually set to 24.

(5)计算MFCC，参见图4，把Mel滤波器的能量取对数后计算离散余弦变换(Discrete Cosine Transform,DCT)得到MFCC特征参数，如下式所示：(5) Calculate the MFCC, see Figure 4, take the logarithm of the energy of the Mel filter and calculate the Discrete Cosine Transform (DCT) to obtain the MFCC feature parameters, as shown in the following formula:

其中，j是DCT后的谱线。where j is the spectral line after DCT.

8、基于DNN-CLSTM网络模型的训练；8. Training based on DNN-CLSTM network model;

测试阶段：Test phase:

1、对含噪语音信号进行加窗分帧处理，并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位作为第一特征；1. Windowing and framing the noisy speech signal, and using discrete Fourier transform to obtain the amplitude and phase of the noisy speech signal as the first feature;

2、对含噪语音信号进行分帧加窗后得到经过谱减阶段后的语音信号幅值即谱减语音信号幅值

作为第二特征；2. After the noisy speech signal is divided into frames and windows, the amplitude of the speech signal after the spectral subtraction stage is obtained, that is, the amplitude of the spectrally subtracted speech signal

as a second feature;

3、对含噪语音信号进行分帧加窗后得到MFCC作为第三特征；3. After the noisy speech signal is divided into frames and windowed, the MFCC is obtained as the third feature;

4、对含噪语音信号进行时频分解，并进行特征提取得到E_MRACC(j,m)以及ΔE_MRACC(j,m)作为深度神经网络的输入的第四特征。4. Perform time-frequency decomposition on the noisy speech signal, and perform feature extraction to obtain E _MRACC (j,m) and ΔE _MRACC (j,m) as the fourth feature of the input of the deep neural network.

5、将含噪语音的幅值，谱减语音信号幅值，MFCC以及特征提取得到的 E_MRACC(j,m)和ΔE_MRACC(j,m)信号等四种特征输入训练好的DNN-CLSTM 网络，得到经过增强后的估计的幅值、MFCC以及掩蔽阈值；5. Input the four features of noise-containing speech amplitude, spectrum minus speech signal amplitude, MFCC and E _MRACC (j,m) and ΔE _MRACC (j,m) signals obtained by feature extraction into the trained DNN-CLSTM network to obtain the enhanced estimated amplitude, MFCC, and masking threshold;

6、将增强语音信号幅值与步骤1中获取的含噪语音信号的相位相结合，进行逆傅立叶变换得到最终的增强语音信号。经过神经网络训练后得到的初步增强语音信号幅值

需要进行时域信号恢复从而得到最终增强后的增强语音信号，首先初步增强后的初步增强语音信号的幅值

需要与步骤1中提取出来的含噪语音信号的相位φ_y(f)进行结合，然后使用逆傅立叶变换转化为时域信号

从而得到最终的增强语音信号。此处得到的增强后的MFCC以及掩蔽阈值不参与波形恢复，在网络处理过程中对网络进行优化。6. Combine the amplitude of the enhanced speech signal with the phase of the noisy speech signal obtained in step 1, and perform inverse Fourier transform to obtain the final enhanced speech signal. The initial enhanced speech signal amplitude obtained after neural network training

It is necessary to perform time-domain signal recovery to obtain the final enhanced enhanced speech signal. First, the amplitude of the preliminary enhanced initially enhanced speech signal

It needs to be combined with the phase φ _y (f) of the noisy speech signal extracted in step 1, and then converted into a time domain signal using the inverse Fourier transform

Thus, the final enhanced speech signal is obtained. The enhanced MFCC and masking threshold obtained here do not participate in waveform recovery, and the network is optimized during network processing.

7、网络建立7. Network establishment

DNN-CLSTM网络包含深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Network,CNN)、残差网络(Residual Network)以及长短时记忆网络(Bidirectional Long-term Memory Network)，建立DNN-CLSTM神经网络的具体过程如图5所示：The DNN-CLSTM network includes Deep Neural Network (DNN), Convolutional Neural Network (CNN), Residual Network and Bidirectional Long-term Memory Network. The specific process of the DNN-CLSTM neural network is shown in Figure 5:

(1)DNN网络建立(1) DNN network establishment

输入层：将经过谱减后得到的谱减语音信号幅值作为输入，输入节点为128个神经元；Input layer: The amplitude of the spectrally subtracted speech signal obtained after spectral subtraction is used as input, and the input node is 128 neurons;

(2)多目标特征融合：(2) Multi-target feature fusion:

其中

和

and

(a)CNN:(a) CNN:

BN层：对数据进行归一化，BN layer: normalize the data,

BN层：对数据进行归一化BN layer: normalize the data

(b)残差网络(b) Residual network

(3)LSTM网络：(3) LSTM network:

(4)输出层：(4) Output layer:

使用两个前馈神经网络作为输出层，输出增强后的语音信号幅值、 MFCC；网络模型采用Adam优化器对网络参数进行优化；所有卷积层采用边缘填充方式。Two feedforward neural networks are used as the output layer to output the enhanced speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use the edge filling method.

其中T＝2，

【试验例】【Test example】

实验中使用的语音数据来自于TIMIT数据集，噪声数据集来源于 Nonspeech噪音库和Noise-15噪音库。本实验中，TIMIT数据集总共包含6300 条语音。将其中约80％的语音作为训练集，另外20％作为测试语音。所有的语音被重采样到16kHz。对于本发明所提出的模型，选取几种典型神经网络语音增强模型与本发明提出的方法作对比，包括(a)DNN(b)CNN(c) LSTM(d)GRU。其中，DNN是基于深度神经网络的语音增强算法，CNN 是基于卷积神经网络的语音增强算法，LSTM是基于长短时记忆神经网络的语音增强算法，GRU是基于门控循环神经网络的语音增强算法。The speech data used in the experiment comes from the TIMIT dataset, and the noise dataset comes from the Nonspeech noise library and the Noise-15 noise library. In this experiment, the TIMIT dataset contains a total of 6300 utterances. About 80% of the speech is used as the training set, and the other 20% is used as the test speech. All speech is resampled to 16kHz. For the model proposed by the present invention, several typical neural network speech enhancement models are selected for comparison with the method proposed by the present invention, including (a) DNN (b) CNN (c) LSTM (d) GRU. Among them, DNN is a speech enhancement algorithm based on deep neural network, CNN is a speech enhancement algorithm based on convolutional neural network, LSTM is a speech enhancement algorithm based on long short-term memory neural network, and GRU is a speech enhancement algorithm based on gated recurrent neural network. .

所有的模型都是在-5dB，0dB,5dB，10dB，15dB,20dB的SNR条件下训练的，并在匹配的信噪比下评估性能。为了测试语音增强模型的鲁棒性，在不匹配的信噪比条件下评估性能。PESQ和LSD是评价语音的两种重要指标，PESQ指主观语音质量评估指标，LESQ得分越高，语音的质量越好； LSD指对数谱距离指标，LSD得分越低，语音质量越好。表1是在匹配的噪声条件下与其他四种算法(DNN,CNN,LSTM,GRU)作对比的测试结果，性能最佳的算法结果用粗体标注。表2是在不匹配的噪声条件下与其他四种算法(DNN，CNN，LSTM，GRU)作对比的测试结果，性能最佳的算法结果用粗体标注。All models are trained under SNR conditions of -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, and the performance is evaluated at matching SNR. To test the robustness of the speech enhancement model, the performance is evaluated under mismatched signal-to-noise ratio conditions. PESQ and LSD are two important indicators for evaluating speech. PESQ refers to the subjective speech quality evaluation index. The higher the LESQ score, the better the speech quality. LSD refers to the logarithmic spectral distance index. The lower the LSD score, the better the speech quality. Table 1 shows the test results compared with the other four algorithms (DNN, CNN, LSTM, GRU) under matched noise conditions, and the results of the best performing algorithms are marked in bold. Table 2 shows the test results compared with the other four algorithms (DNN, CNN, LSTM, GRU) under unmatched noise conditions, and the results of the best performing algorithms are marked in bold.

表1在匹配噪声条件下测试结果，性能最佳的已用粗体标注Table 1 Test results under matched noise conditions, the best performance has been marked in bold

表2在不匹配噪声条件下测试结果，性能最佳的已用粗体标注Table 2 Test results under mismatched noise conditions, the best performance has been marked in bold

Claims

1. a speech enhancement method based on DNN-CLSTM network, is characterized in that comprising the following steps:

Step 1: Acquire the noisy speech signal. The noisy speech signal is formed by adding the pure speech signal and the noise signal:

y(m)=x(m)+n(m)

Among them, y(m) is the noisy speech signal, x(m) is the pure speech signal, n(m) is the noise signal, and m is the discrete time series;

Step 2: frame-by-frame window processing to obtain the amplitude and phase of the pure speech signal and the noisy speech signal;

The noisy speech signal is processed by windowing and framing, and the amplitude and phase of the noisy speech signal are obtained by discrete Fourier transform. Using the first five frame signals in the speech segment as noise estimation, get the noise signal amplitude;

Step 3: Subtract the amplitude of the noise signal from the amplitude of the noisy speech signal to obtain the amplitude of the spectrum minus the speech signal as the first feature;

Step 4: obtain the MFCC of the speech signal as the second feature;

Step 5: Establish a DNN-CLSTM network model for training;

Input the two features of speech signal amplitude and MFCC after noise spectrum reduction into the DNN-CLSTM network for training, and obtain the predicted amplitude and MFCC; The pure MFCC values calculate their respective minimum mean square errors (MMSE), and input the obtained error as an adjustment signal into the neural network to optimize the network, thereby obtaining a trained network.

2. the speech enhancement method based on DNN-CLSTM network as claimed in claim 1, is characterized in that, the concrete process of described step 4 is:

(1) Preprocessing: Preprocessing includes pre-emphasis, framing, and windowing functions;

Pre-emphasis processing: implemented by a first-order high-pass filter, the transfer function of the filter is:

H(z)=1-az ^-1

Among them, a is the pre-emphasis coefficient, which is generally 0.98; the result of the speech signal x(n) after pre-emphasis processing is:

y(n)=x(n)-ax(n-1)

Frame-by-frame windowing: the overlapping part between two adjacent frames, which is frame shift, is set to 10ms; windowing function: Hamming window windowing is performed on each frame of speech signal: y(n) is divided into After frame windowing, y _i (n) is obtained, which is defined as:

Among them, ω(n) is the Hamming window, and its expression is

Wherein, y _i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length;

(2) Fast Fourier Transform (FFT)

Perform fast Fourier transform on each frame of speech signal y _i (n) to obtain the frequency spectrum of each frame signal, the expression is as follows:

Y(i,k)=FFT[y _i (n)]

where k represents the kth spectral line in the frequency domain;

(3) Calculate the spectral line energy

The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain is expressed as:

E(i,k)=[Y(i,k)] ²

(4) Calculate the energy passing through the Mel filter

The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter is defined as:

Among them, N represents the number of FFT points, and M is the number of filters;

The transfer function H _m (k) of each filter is

Among them, f(m) is the center frequency of the mth filter, and m is the mth filter;

(5) Calculate MFCC

After taking the logarithm of the energy of the Mel filter, the discrete cosine transform is calculated to obtain the MFCC characteristic parameters, as shown in the following formula:

where j is the discrete cosine transform (DCT) spectral line.

3. as claimed in claim 1 based on the speech enhancement method of DNN-CLSTM network, it is characterized in that, the concrete process of described step 5 is:

(1) DNN network establishment

Input layer: take the spectrally subtracted speech amplitude and MFCC feature as input, and input it into the DNN network. The number of neurons in the input layer is 128;

Fully connected layer: set 32 nodes, set the drop rate to 0.5, and set the activation function to RELU;

Fully connected layer: set 128 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

Fully connected layer: set 512 nodes, set the drop rate value to 0.5, and set the activation function to RELU; (2) Multi-target feature fusion:

Combine the amplitude and MFCC features enhanced by the DNN network with the amplitude and MFCC features of the original noisy speech;

in

and

(3) C-LSTM network:

(a) CNN:

Convolution layer: Convolve the results obtained by the DNN network, the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 5*1, and the activation function is set to SELU;

BN layer: normalize the data;

Convolution layer: the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 3*1, and the activation function is set to SELU;

BN layer: normalize the data

Convolutional layer: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel is set to 5*1,

(b) Residual network

Convolve the results obtained by the DNN network, the number of nodes is set to 128 nodes, the step size is set to 1, and the convolution kernel is set to 5*1;

After combining the data obtained by the residual network with the data obtained by the CNN network, the SELU activation function is used;

Max Pooling layer: step size is set to 1, pooling layer size is set to 2

(c) LSTM network:

The bidirectional network nodes of the long-short-term memory network are selected as 128 nodes, and the activation function is the sigmoid function.

(4) Output layer:

Two feedforward neural networks are used as output layers to output the predicted speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use edge padding.

(5) Calculate the minimum mean square error objective function

where T=2,