[go: up one dir, main page]

CN112735456A - Speech enhancement method based on DNN-CLSTM network - Google Patents

Speech enhancement method based on DNN-CLSTM network Download PDF

Info

Publication number
CN112735456A
CN112735456A CN202011323987.2A CN202011323987A CN112735456A CN 112735456 A CN112735456 A CN 112735456A CN 202011323987 A CN202011323987 A CN 202011323987A CN 112735456 A CN112735456 A CN 112735456A
Authority
CN
China
Prior art keywords
speech
network
amplitude
speech signal
mfcc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011323987.2A
Other languages
Chinese (zh)
Other versions
CN112735456B (en
Inventor
汪友明
张天琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202011323987.2A priority Critical patent/CN112735456B/en
Publication of CN112735456A publication Critical patent/CN112735456A/en
Application granted granted Critical
Publication of CN112735456B publication Critical patent/CN112735456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

本发明是一种基于深度神经网络和残差长短时记忆网络(DNN‑CLSTM)的语音增强方法。此方法将谱减法获取的语音幅值特征和快速傅里叶变换获取的语音梅尔倒谱系数(MFCC)特征输入至DNN‑CLSTM网络模型,实现语音增强的目的。首先,对含噪语音进行时频掩蔽和加窗分帧处理,利用快速傅里叶变换获取含噪语音的幅值和相位特征,并估计出含噪语音的噪声幅值;然后,用含噪语音幅值减去估计的噪声信号幅值,得到谱减后的语音信号幅值作为神经网络输入的第一特征。其次,对含噪语音进行快速傅里叶变换(FFT),求取语音信号的谱线能量进而得到含噪语音的MFCC特征作为语音信号的第二特征。将上述二种特征输入至DNN‑CLSTM网络中进行训练得到网络模型,并采取最小均方误差(MMSE)损失函数评价指标评估模型有效性。最后,将实际含噪语音集输入至完成训练的语音增强网络模型中,预测出增强后的估计幅值和MFCC,采用逆傅立叶变换得到最终的增强语音信号。本发明具有语音的高保真性。

Figure 202011323987

The invention is a speech enhancement method based on deep neural network and residual long short-term memory network (DNN-CLSTM). In this method, the speech amplitude feature obtained by spectral subtraction and the speech Mel cepstral coefficient (MFCC) feature obtained by fast Fourier transform are input into the DNN‑CLSTM network model to achieve the purpose of speech enhancement. First, time-frequency masking and windowing are performed on the noisy speech, and the amplitude and phase characteristics of the noisy speech are obtained by using fast Fourier transform, and the noise amplitude of the noisy speech is estimated. The estimated noise signal amplitude is subtracted from the speech amplitude to obtain the spectrally subtracted speech signal amplitude as the first feature input to the neural network. Secondly, perform fast Fourier transform (FFT) on the noisy speech to obtain the spectral line energy of the speech signal and then obtain the MFCC feature of the noisy speech as the second characteristic of the speech signal. The above two features are input into the DNN-CLSTM network for training to obtain a network model, and the minimum mean square error (MMSE) loss function evaluation index is used to evaluate the effectiveness of the model. Finally, input the actual noisy speech set into the trained speech enhancement network model, predict the enhanced estimated amplitude and MFCC, and use the inverse Fourier transform to obtain the final enhanced speech signal. The present invention has high fidelity of speech.

Figure 202011323987

Description

一种基于DNN-CLSTM网络的语音增强方法A speech enhancement method based on DNN-CLSTM network

技术领域technical field

本发明属于语音增强技术领域,具体涉及一种基于DNN-CLSTM网络的语音增强方法。The invention belongs to the technical field of speech enhancement, in particular to a speech enhancement method based on a DNN-CLSTM network.

背景技术Background technique

语音作为信息传输的主要方式之一,在生活中得到了大量的应用,随着技术的发展,语音不仅在人与人交流之间起到了信息传递的作用,在人机交互中也大量运用到了语音信号。然而在我们的通信过程中,语音信号往往都伴随着大量的噪声信号,如工厂噪声、汽车噪声或者餐厅的嘈杂声等背景噪声。包含了大量噪声的语音信号会使得接收方在对语音信号中包含的有用信息进行提取时产生大量的干扰。针对这一问题,语音信号增强技术得到了广泛关注。As one of the main methods of information transmission, voice has been widely used in life. With the development of technology, voice not only plays the role of information transmission between people, but also is widely used in human-computer interaction. voice signal. However, in our communication process, speech signals are often accompanied by a large number of noise signals, such as factory noise, car noise or background noise such as restaurant noise. A speech signal containing a lot of noise will cause a lot of interference when the receiver extracts the useful information contained in the speech signal. In response to this problem, speech signal enhancement technology has received extensive attention.

语音增强是指现实中的语音被噪声干扰时,将噪声和语音信号分离的过程。语音增强技术现已经得到了广泛的运用,如移动通信领域、语音识别领域等诸多领域。语音增强技术的主要目的是为了提高语音质量以及语音可懂度。目前,语音增强方法主要分为谱减法、子空间算法以及基于统计模型的算法三种。随着深度学习的发展,神经网络已经被应用到了语音增强领域。Speech enhancement refers to the process of separating noise and speech signal when real speech is disturbed by noise. Speech enhancement technology has been widely used, such as mobile communication, speech recognition and many other fields. The main purpose of speech enhancement technology is to improve speech quality and speech intelligibility. At present, speech enhancement methods are mainly divided into three types: spectral subtraction, subspace algorithm and algorithm based on statistical model. With the development of deep learning, neural networks have been applied to the field of speech enhancement.

图1所示的谱减法是语音增强技术中最早去噪技术之一。谱减法去噪基于以下原理:假设噪声是加性噪声,即y(m)=x(m)+n(m),其中y(m)是包含噪声的信号,x(m)是纯净的语音信号,n(m)是加性噪声;通过从包含噪声的语音信号中减去对噪声谱的估计,就可以得到纯净的语音信号。这一假设的前提条件是噪声信号是平稳的,使得在不存在目标信号的语音段间,可以对噪声信号进行估计并且进行更新。Spectral subtraction shown in Figure 1 is one of the earliest denoising techniques in speech enhancement techniques. Spectral subtraction denoising is based on the following principle: Assuming that the noise is additive noise, that is, y(m)=x(m)+n(m), where y(m) is the signal containing noise, and x(m) is the pure speech signal, n(m) is additive noise; the pure speech signal can be obtained by subtracting an estimate of the noise spectrum from the noise-containing speech signal. The premise of this assumption is that the noise signal is stationary, so that between speech segments in which the target signal does not exist, the noise signal can be estimated and updated.

谱减法是一种相对简单的语音增强算法,其原理是从输入的混合语音信号的幅度谱值减去估计的噪声幅度谱值,利用人耳对相位的不灵敏性,把谱减前的相位角信息直接用到谱减后的信息中来合成最终谱减后的语音信号。由于谱减法只包含一次傅里叶变化和傅里叶逆变化,因此它的计算量较小,并且易于实现。但现实中许多噪声是不平稳的信号,因此使用谱减法对语音信号进行增强后,增强后的语音信号往往存在着大量的音乐噪声,从而导致语音信号失真,使得信号的可懂度与语音质量较差。Spectral subtraction is a relatively simple speech enhancement algorithm. Its principle is to subtract the estimated noise amplitude spectrum value from the amplitude spectrum value of the input mixed speech signal, and use the insensitivity of the human ear to phase to subtract the phase before the spectrum subtraction. The angle information is directly used in the spectrally subtracted information to synthesize the final spectrally subtracted speech signal. Since spectral subtraction only contains one Fourier transform and inverse Fourier transform, it is computationally less expensive and easy to implement. But in reality, many noises are unstable signals, so after using spectral subtraction to enhance the speech signal, there is often a lot of musical noise in the enhanced speech signal, which leads to the distortion of the speech signal, which makes the intelligibility of the signal and the speech quality. poor.

发明内容SUMMARY OF THE INVENTION

本发明的目的是解决基于谱减法的语音增强过程中存在的语音信号失真、信号的可懂度与语音质量较差等问题。为达上述目的,本发明提供了一种基于DNN-CLSTM网络的语音增强方法,其特征在于,包括以下步骤:The purpose of the present invention is to solve the problems of voice signal distortion, poor signal intelligibility and poor voice quality in the voice enhancement process based on spectral subtraction. In order to achieve the above-mentioned purpose, the present invention provides a kind of speech enhancement method based on DNN-CLSTM network, it is characterized in that, comprises the following steps:

步骤一:获取至少两路含噪语音信号,含噪语音信号由纯净语音信号和噪声信号相加而成:Step 1: Obtain at least two noisy speech signals. The noisy speech signal is formed by adding the pure speech signal and the noise signal:

y(m)=x(m)+n(m)y(m)=x(m)+n(m)

其中,y(m)是包含噪声的含噪语音信号,x(m)是纯净语音信号,n(m)是噪声信号,m为离散时间序列;Among them, y(m) is the noisy speech signal containing noise, x(m) is the pure speech signal, n(m) is the noise signal, and m is the discrete time series;

步骤二:分帧加窗,获取纯净语音信号以及含噪语音信号的幅值和相位作为第一特征:对含噪语音信号进行加窗分帧处理,并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位;同时,在不含目标信号并且只含有噪声的语音信号段,对噪声进行估计,求出噪声信号幅值;Step 2: windowing by frame, obtaining the amplitude and phase of the pure speech signal and the noisy speech signal as the first feature: performing windowing and framing processing on the noisy speech signal, and using discrete Fourier transform to obtain the noisy speech The amplitude and phase of the signal; at the same time, in the speech signal segment that does not contain the target signal and only contains noise, the noise is estimated to obtain the amplitude of the noise signal;

步骤三、用含噪语音信号的幅值减去所述噪声信号幅值,从而得到谱减语音信号幅值作为第二特征;Step 3, subtracting the noise signal amplitude from the amplitude of the noisy speech signal, thereby obtaining the spectrum minus the speech signal amplitude as the second feature;

步骤四、求取MFCC作为第三特征;Step 4, seek MFCC as the third feature;

步骤五:建立DNN-CLSTM网络模型进行训练;Step 5: Establish a DNN-CLSTM network model for training;

将含噪语音谱减后的语音信号幅值和MFCC这两种特征输入至 DNN-CLSTM网络中进行训练,得到增强后的估计的幅值和MFCC;将估计的幅值和MFCC分别与纯净的幅值和纯净的MFCC数值计算各自的最小均方误差,并将所得到的误差作为调整信号输入进神经网络对网络进行优化,从而得到训练好的网络。Input the two features of speech signal amplitude and MFCC after noise spectrum reduction into the DNN-CLSTM network for training, and obtain the enhanced estimated amplitude and MFCC; Amplitude and pure MFCC values calculate their respective minimum mean square errors, and input the obtained errors as adjustment signals into the neural network to optimize the network, thereby obtaining a trained network.

步骤四的具体过程是:The specific process of step four is:

(1)预处理:预处理包括预加重、分帧、加窗函数;(1) Preprocessing: Preprocessing includes pre-emphasis, framing, and windowing functions;

预加重处理:通过一个一阶的高通滤波器实现,滤波器的传递函数为:Pre-emphasis processing: implemented by a first-order high-pass filter, the transfer function of the filter is:

H(z)=1-az-1 H(z)=1-az -1

其中,a为预加重系数,一般取值为0.98;Among them, a is the pre-emphasis coefficient, which is generally 0.98;

语音信号x(n)通过预加重处理后的结果为:The result of the speech signal x(n) after pre-emphasis processing is:

y(n)=x(n)-ax(n-1)y(n)=x(n)-ax(n-1)

分帧加窗:在相邻两帧之间有重叠的部分,即为帧移,设置为10ms;加窗函数:对每一帧语音信号进行汉明窗加窗处理:y(n)经过分帧加窗处理后得到yi(n),它的定义为:Frame-by-frame windowing: the overlapping part between two adjacent frames, which is frame shift, is set to 10ms; windowing function: Hamming window windowing is performed on each frame of speech signal: y(n) is divided into After frame windowing, y i (n) is obtained, which is defined as:

Figure RE-RE-GDA0002977011930000031
Figure RE-RE-GDA0002977011930000031

其中,ω(n)为汉明窗,它的表达式为Among them, ω(n) is the Hamming window, and its expression is

Figure RE-RE-GDA0002977011930000032
Figure RE-RE-GDA0002977011930000032

其中,yi(n)表示第i帧语音信号,n表示样点数,L表示帧长;Wherein, y i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length;

(2)快速傅里叶变换(FFT)(2) Fast Fourier Transform (FFT)

对每帧语音信号yi(n)进行快速傅里叶变换,得到每帧信号的频谱,表达式如下:Perform fast Fourier transform on each frame of speech signal y i (n) to obtain the frequency spectrum of each frame signal, the expression is as follows:

Y(i,k)=FFT[yi(n)]Y(i,k)=FFT[y i (n)]

其中,k表示频域中的第k条谱线;where k represents the kth spectral line in the frequency domain;

(3)计算谱线能量(3) Calculate the spectral line energy

频域中每一帧语音信号谱线的能量E(i,k)表示为:The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain is expressed as:

E(i,k)=[Y(i,k)]2 E(i,k)=[Y(i,k)] 2

(4)计算通过Mel滤波器的能量(4) Calculate the energy passing through the Mel filter

每一帧谱线能量通过Mel滤波器的能量S(i,m)定义为:The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter is defined as:

Figure RE-RE-GDA0002977011930000041
Figure RE-RE-GDA0002977011930000041

其中,N表示FFT的点数;Among them, N represents the number of FFT points;

每个滤波器的传递函数Hm(k)为The transfer function H m (k) of each filter is

Figure RE-RE-GDA0002977011930000042
Figure RE-RE-GDA0002977011930000042

其中,f(m)为第m个滤波器的中心频率,m为第m个滤波器,M为滤波器的个数;Among them, f(m) is the center frequency of the mth filter, m is the mth filter, and M is the number of filters;

(5)计算MFCC(5) Calculate MFCC

把Mel滤波器的能量取对数后计算离散余弦变换得到MFCC特征参数,如下式所示:After taking the logarithm of the energy of the Mel filter, the discrete cosine transform is calculated to obtain the MFCC characteristic parameters, as shown in the following formula:

Figure RE-RE-GDA0002977011930000043
Figure RE-RE-GDA0002977011930000043

其中,j是DCT后的谱线。where j is the spectral line after DCT.

3、步骤五的具体过程是:3. The specific process of step 5 is:

(1)DNN网络建立(1) DNN network establishment

输入层:将经过谱减后的语音幅值和MFCC特征作为输入,输入DNN 网络中,输入层的神经元的节点数为128个;Input layer: take the spectrally subtracted speech amplitude and MFCC feature as input, and input it into the DNN network. The number of neurons in the input layer is 128;

全连接层:设置32个节点,丢弃率设为0.5,设置激活函数为RELU;Fully connected layer: set 32 nodes, set the drop rate to 0.5, and set the activation function to RELU;

全连接层:设置128个节点,丢弃率值设为0.5,设置激活函数为RELU;Fully connected layer: set 128 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

全连接层:设置512个节点,丢弃率值设为0.5,设置激活函数为RELU;Fully connected layer: set 512 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

(2)多目标特征融合:(2) Multi-target feature fusion:

将DNN网络增强后的幅值和MFCC特征与原始含噪语音的幅值和 MFCC特征相结合;Combine the amplitude and MFCC features enhanced by the DNN network with the amplitude and MFCC features of the original noisy speech;

Figure RE-RE-GDA0002977011930000051
Figure RE-RE-GDA0002977011930000051

其中

Figure RE-RE-GDA0002977011930000052
Figure RE-RE-GDA0002977011930000053
分别代表第k个空间领域中经过DNN预测的MFCC特征和语音幅值;
Figure RE-RE-GDA0002977011930000054
分别代表第k个空间领域中原始含噪语音的 MFCC特征和语音幅值;in
Figure RE-RE-GDA0002977011930000052
and
Figure RE-RE-GDA0002977011930000053
respectively represent the MFCC features and speech amplitude predicted by DNN in the kth spatial domain;
Figure RE-RE-GDA0002977011930000054
represent the MFCC features and speech amplitudes of the original noisy speech in the kth spatial domain, respectively;

(3)C-LSTM网络:(3) C-LSTM network:

(a)CNN:(a) CNN:

卷积层:对DNN网络得到的结果进行卷积,节点数设置为64个节点,步长设为1,卷积核取5*1,激活函数设置为SELU;Convolution layer: Convolve the results obtained by the DNN network, the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 5*1, and the activation function is set to SELU;

BN层:对数据进行归一化;BN layer: normalize the data;

卷积层:节点数设置为64个节点,步长设为1,卷积核取3*1,激活函数设置为SELU;Convolution layer: the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 3*1, and the activation function is set to SELU;

BN层:对数据进行归一化BN layer: normalize the data

卷积层:节点数设置为128个节点,步长设为1,卷积核取5*1,Convolutional layer: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel is set to 5*1,

(b)残差网络(b) Residual network

对DNN网络得到的结果进行卷积,节点数设置为128个节点,步长设为1,卷积核取5*1;Convolve the results obtained by the DNN network, the number of nodes is set to 128 nodes, the step size is set to 1, and the convolution kernel is set to 5*1;

将残差网络得到数据与CNN网络得到的数据进行结合后,使用SELU 激活函数;After combining the data obtained by the residual network with the data obtained by the CNN network, the SELU activation function is used;

Max Pooling层:步长设置为1,池化层大小设置为2Max Pooling layer: step size is set to 1, pooling layer size is set to 2

(c)LSTM网络:(c) LSTM network:

长短时记忆网络的双向网络节点均选取为128节点,激活函数为 Sigmoid函数,The bidirectional network nodes of the long-short-term memory network are selected as 128 nodes, and the activation function is the sigmoid function.

(4)输出层:(4) Output layer:

使用两个前馈神经网络作为输出层,输出预测的语音信号幅值、 MFCC;网络模型采用Adam优化器对网络参数进行优化;所有卷积层采用边缘填充方式。Two feedforward neural networks are used as output layers to output the predicted speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use edge padding.

(5)计算最小均方误差目标函数(5) Calculate the minimum mean square error objective function

Figure RE-RE-GDA0002977011930000061
Figure RE-RE-GDA0002977011930000061

其中T=2,

Figure RE-RE-GDA0002977011930000062
分别代表第k个声学特征空间预测的MFCC特征向量和预测幅值特征
Figure RE-RE-GDA0002977011930000063
分别代表第k个声学特征空间纯净的MFCC 特征向量和纯净幅值特征。where T=2,
Figure RE-RE-GDA0002977011930000062
Represent the MFCC feature vector and predicted magnitude feature of the kth acoustic feature space prediction, respectively
Figure RE-RE-GDA0002977011930000063
represent the pure MFCC feature vector and pure amplitude feature of the kth acoustic feature space, respectively.

本发明的优点如下:增强后的语音信号平稳,具有高保真性和良好的语音质量。The advantages of the present invention are as follows: the enhanced speech signal is stable, has high fidelity and good speech quality.

下面结合附图和实施例对本发明作详细说明。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

附图说明Description of drawings

图1是传统谱减法流程图。Figure 1 is a flow chart of conventional spectral subtraction.

图2是训练阶段的基于DNN-CLSTM网络的语音增强方法的流程图。Figure 2 is a flow chart of the DNN-CLSTM network-based speech enhancement method in the training phase.

图3是测试阶段的基于DNN-CLSTM网络的语音增强方法流程图。Figure 3 is a flow chart of the speech enhancement method based on DNN-CLSTM network in the testing phase.

图4是MFCC求取流程图。FIG. 4 is a flow chart of MFCC calculation.

图5是建立DNN-CLSTM神经网络的过程架构图。Figure 5 is a process architecture diagram of building a DNN-CLSTM neural network.

图6是纯净语音的语谱图Figure 6 is the spectrogram of pure speech

图7是含有噪声的语谱图Figure 7 is a spectrogram with noise

图8是采用DNN语音增强方法处理后的语谱图Figure 8 is the spectrogram processed by the DNN speech enhancement method

图9是采用CNN语音增强方法处理后的语谱图Figure 9 is the spectrogram processed by the CNN speech enhancement method

图10是采用LSTM语音增强方法处理后的语谱图Figure 10 is the spectrogram processed by the LSTM speech enhancement method

图11是采用GRU语音增强方法处理后的语谱图Figure 11 is the spectrogram processed by the GRU speech enhancement method

图12是采用DNN-CLSTM语音增强方法处理后的语谱图Figure 12 is the spectrogram processed by the DNN-CLSTM speech enhancement method

具体实施方式Detailed ways

为了克服使用谱减法对语音信号进行增强后,增强后的语音信号往往存在着大量的音乐噪声,从而导致语音信号失真,使得信号的可懂度与语音质量较差的缺陷,本实施例提供了一种基于DNN-CLSTM网络的语音增强方法 (如图2和3所示),包括以下步骤:In order to overcome the defect of using spectral subtraction to enhance the speech signal, the enhanced speech signal often has a large amount of musical noise, which causes the speech signal to be distorted and makes the signal intelligibility and speech quality poor. A speech enhancement method based on DNN-CLSTM network (as shown in Figures 2 and 3), comprising the following steps:

获取两路含噪语音信号(也可以获取两路以上的含噪语音信号,根据实际需要自行选取);含噪语音信号由纯净的语音信号和噪声信号构成:Acquire two channels of noisy speech signals (you can also obtain more than two channels of noisy speech signals, which can be selected according to actual needs); the noisy speech signals are composed of pure speech signals and noise signals:

y(m)=x(m)+n(m)y(m)=x(m)+n(m)

其中,y(m)是包含噪声的语音信号,x(m)是纯净的语音信号,n(m)是噪声信号。Among them, y(m) is the speech signal containing noise, x(m) is the pure speech signal, and n(m) is the noise signal.

训练阶段:Training phase:

1、分帧加窗1. Framing and windowing

获取纯净语音信号以及含噪语音信号的幅值和相位,以对含噪语音此乃和处理为例:Obtain the amplitude and phase of the pure speech signal and the noisy speech signal, taking the sum processing of the noisy speech as an example:

对含噪语音信号进行加窗分帧处理,并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位作为第一特征;Windowing and framing the noisy speech signal, and using discrete Fourier transform to obtain the amplitude and phase of the noisy speech signal as the first feature;

对含噪语音信号使用汉明窗加窗分帧:Framing a noisy speech signal using a Hamming window:

yw(m)=w(m)y(m)=w(m)[x(m)+n(m)]=xw(m)+nw(m)yw(m)= w (m)y(m)= w (m)[x(m)+n(m)]=xw(m)+ nw (m)

加窗操作在频域表示为:The windowing operation is expressed in the frequency domain as:

Yw(f)=W(f)*Y(f)=Xw(f)+Nw(f)Yw(f)= W (f)*Y(f)= Xw (f)+ Nw (f)

假设信号是经过加窗处理的,为了简便,将信号的下标w省略。Assuming that the signal is processed by windowing, the subscript w of the signal is omitted for simplicity.

将Yw(f)用极坐标形式表示:Express Y w (f) in polar form:

Figure RE-RE-GDA0002977011930000081
Figure RE-RE-GDA0002977011930000081

其中|Y(f)|为幅度谱,φy(f)为相位信号Phase[Y(f)]。where |Y(f)| is the amplitude spectrum, and φ y (f) is the phase signal Phase[Y(f)].

2、噪声估计:2. Noise estimation:

在不含目标信号并且只含有噪声的语音信号段,对噪声进行估计,求出噪声信号幅值;本发明选取语音信号前五帧作为噪声信号段。由于噪声的幅度谱|N(f)|未知,但|N(f)|可以通过无语音活动时的平均幅度谱的估计来替代,噪声的相位φn(f)可以由含噪语音信号的相位φy(f)来替代。当语音信号不存在且只有噪声时,得到平均噪声幅度谱

Figure RE-RE-GDA0002977011930000082
计算过程如下:In the speech signal segment that does not contain the target signal and only contains noise, the noise is estimated to obtain the noise signal amplitude; the present invention selects the first five frames of the speech signal as the noise signal segment. Since the magnitude spectrum |N(f)| of the noise is unknown, but |N(f)| can be replaced by an estimate of the average magnitude spectrum in the absence of speech activity, the phase φn (f) of the noise can be determined by the phase φ y (f) instead. When the speech signal is absent and there is only noise, the average noise amplitude spectrum is obtained
Figure RE-RE-GDA0002977011930000082
The calculation process is as follows:

Figure RE-RE-GDA0002977011930000083
Figure RE-RE-GDA0002977011930000083

其中,|Ni(f)|为第i个噪声的帧的频谱,k为纯噪声信号周期内的帧数。Among them, |N i (f)| is the frequency spectrum of the ith noise frame, and k is the number of frames in the pure noise signal period.

3、谱减法3. Spectral subtraction

用含噪语音信号的幅值减去噪声信号幅值,从而得到谱减语音信号幅值作为第二特征(采用谱减发获取的增强语音信号成为谱减语音信号,相应的幅值称为谱减语音信号幅值,以与本实施中最终得到的增强语音信号进行区分);The amplitude of the noise signal is subtracted from the amplitude of the noisy speech signal to obtain the amplitude of the spectrally subtracted speech signal as the second feature (the enhanced speech signal obtained by using spectral subtraction becomes the spectrally subtracted speech signal, and the corresponding amplitude is called the spectrum. The amplitude of the speech signal is subtracted to distinguish it from the enhanced speech signal finally obtained in this implementation);

使用含噪声语音信号的幅值减去噪声信号的幅值,计算过程如下:Using the amplitude of the noisy speech signal to subtract the amplitude of the noise signal, the calculation process is as follows:

Figure RE-RE-GDA0002977011930000084
Figure RE-RE-GDA0002977011930000084

其中,

Figure RE-RE-GDA0002977011930000085
代表经过谱减后语音信号的幅值,|Y(f)|b表示含噪语音信号的幅值,
Figure RE-RE-GDA0002977011930000086
表示代表噪声段的噪声统计的平均值。α表示谱减噪声系数。b 是幂指数,当指数b=1时为幅度谱减法,当指数b=2时,为功率谱减法。in,
Figure RE-RE-GDA0002977011930000085
represents the amplitude of the speech signal after spectral subtraction, |Y(f)| b represents the amplitude of the noisy speech signal,
Figure RE-RE-GDA0002977011930000086
Represents the mean value of the noise statistics representing the noise segment. α represents the spectral subtraction noise figure. b is the power exponent, when the exponent b=1, it is the amplitude spectrum subtraction, and when the exponent b=2, it is the power spectrum subtraction.

由于对于噪声信号的估计可能会产生误差,从而导致估计信号的幅度谱

Figure RE-RE-GDA0002977011930000087
可能会为负值。通常幅度谱的值不应为负,为了避免
Figure RE-RE-GDA0002977011930000091
为负值,对差分谱进行半波整流,计算过程如下:Since the estimation of the noise signal may produce errors, resulting in the estimated signal amplitude spectrum
Figure RE-RE-GDA0002977011930000087
May be negative. Usually the value of the magnitude spectrum should not be negative, in order to avoid
Figure RE-RE-GDA0002977011930000091
is a negative value, half-wave rectification is performed on the difference spectrum, and the calculation process is as follows:

Figure RE-RE-GDA0002977011930000092
Figure RE-RE-GDA0002977011930000092

经过谱减后,需要对谱减后语音信号进行降幂,从而得到经过谱减阶段后的语音信号幅值即谱减语音信号幅值

Figure RE-RE-GDA0002977011930000093
计算过程如下:After spectral subtraction, it is necessary to reduce the power of the spectrally subtracted speech signal, so as to obtain the amplitude of the speech signal after the spectral subtraction stage, that is, the amplitude of the spectrally subtracted speech signal.
Figure RE-RE-GDA0002977011930000093
The calculation process is as follows:

Figure RE-RE-GDA0002977011930000094
Figure RE-RE-GDA0002977011930000094

4、提取MFCC作为第二特征;4. Extract MFCC as the second feature;

(1)预处理(1) Preprocessing

预处理包括预加重、分帧、加窗函数,预加重处理是指通过一个一阶的高通滤波器实现的,滤波器的传递函数为:Pre-processing includes pre-emphasis, framing, and windowing functions. Pre-emphasis processing is realized by a first-order high-pass filter. The transfer function of the filter is:

H(z)=1-az-1 H(z)=1-az -1

其中,a为预加重系数,一般取值为0.98Among them, a is the pre-emphasis coefficient, which is generally 0.98

语音信号x(n)通过预加重处理后的结果为:The result of the speech signal x(n) after pre-emphasis processing is:

y(n)=x(n)-ax(n-1)y(n)=x(n)-ax(n-1)

分帧是指由语音的产生过程可知语音信号是一个非平稳的时变的信号。短时间内的语音信号可认为是平稳的时不变信号,短时间通常指10~30ms之间,本文取20ms。因此通常用短时分析技术对语音信号进行分析和处理,将许多帧来分析其特征参数,同时为了使帧与帧之间可以平滑地过渡,在相邻两帧之间有重叠的部分,即为帧移,设置为10ms。加窗函数的目的是为了减少频域中的频谱泄露,对每一帧语音信号进行加窗处理,通常采用汉明窗,和矩形窗相比,汉明窗的频谱泄露更小。y(n)经过分帧加窗处理后得到yi(n),它的定义为:Framing means that the speech signal is a non-stationary time-varying signal from the process of speech generation. The voice signal in a short time can be considered as a stable time-invariant signal, and the short time usually refers to between 10 and 30ms, and this paper takes 20ms. Therefore, short-term analysis technology is usually used to analyze and process the speech signal, and many frames are used to analyze its characteristic parameters. For frame shift, set to 10ms. The purpose of the windowing function is to reduce the spectral leakage in the frequency domain, and window processing is performed on each frame of speech signal. Usually, a Hamming window is used. Compared with the rectangular window, the spectral leakage of the Hamming window is smaller. After y(n) is framed and windowed, y i (n) is obtained, which is defined as:

Figure RE-RE-GDA0002977011930000101
Figure RE-RE-GDA0002977011930000101

其中,ω(n)为汉明窗,它的表达式为Among them, ω(n) is the Hamming window, and its expression is

Figure RE-RE-GDA0002977011930000102
Figure RE-RE-GDA0002977011930000102

其中,yi(n)表示第i帧语音信号,n表示样点数,L表示帧长。Among them, y i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length.

(2)快速傅里叶变换(FFT)(2) Fast Fourier Transform (FFT)

由于语音信号在时域上的变换一般不容易看出信号的特性,所以通常将其变换到频域上来进行分析,不同频率的频谱代表了语音信号不同的特性。因此对每帧语音信号yi(n)进行快速傅里叶变换,得到每帧信号的频谱,如下式所示:Since it is generally not easy to see the characteristics of the voice signal when it is transformed in the time domain, it is usually transformed into the frequency domain for analysis. The frequency spectrum of different frequencies represents different characteristics of the voice signal. Therefore, fast Fourier transform is performed on each frame of speech signal y i (n) to obtain the frequency spectrum of each frame signal, as shown in the following formula:

Y(i,k)=FFT[yi(n)]Y(i,k)=FFT[y i (n)]

其中,k表示频域中的第k条谱线。where k represents the kth spectral line in the frequency domain.

(3)计算谱线能量(3) Calculate the spectral line energy

频域中每一帧语音信号谱线的能量E(i,k)可表示为:The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain can be expressed as:

E(i,k)=[Y(i,k)]2 E(i,k)=[Y(i,k)] 2

(4)计算通过Mel滤波器的能量(4) Calculate the energy passing through the Mel filter

每一帧谱线能量通过Mel滤波器的能量S(i,m)可定义为:The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter can be defined as:

Figure RE-RE-GDA0002977011930000103
Figure RE-RE-GDA0002977011930000103

其中,N表示FFT的点数。Among them, N represents the number of FFT points.

每个滤波器的传递函数Hm(k)为The transfer function H m (k) of each filter is

Figure RE-RE-GDA0002977011930000111
Figure RE-RE-GDA0002977011930000111

其中,f(m)为第m个滤波器的中心频率,m为第m个滤波器,M为滤波器的个数,通常设置为24。Among them, f(m) is the center frequency of the mth filter, m is the mth filter, and M is the number of filters, which is usually set to 24.

(5)计算MFCC,参见图4,把Mel滤波器的能量取对数后计算离散余弦变换(Discrete Cosine Transform,DCT)得到MFCC特征参数,如下式所示:(5) Calculate the MFCC, see Figure 4, take the logarithm of the energy of the Mel filter and calculate the Discrete Cosine Transform (DCT) to obtain the MFCC feature parameters, as shown in the following formula:

Figure RE-RE-GDA0002977011930000112
Figure RE-RE-GDA0002977011930000112

其中,j是DCT后的谱线。where j is the spectral line after DCT.

8、基于DNN-CLSTM网络模型的训练;8. Training based on DNN-CLSTM network model;

将含噪语音谱减后的语音信号幅值和MFCC这两种特征输入至 DNN-CLSTM网络中进行训练,得到增强后的估计的幅值和MFCC;将估计的幅值和MFCC分别与纯净的幅值和纯净的MFCC数值计算各自的最小均方误差,并将所得到的误差作为调整信号输入进神经网络对网络进行优化,从而得到训练好的网络。Input the two features of speech signal amplitude and MFCC after noise spectrum reduction into the DNN-CLSTM network for training, and obtain the enhanced estimated amplitude and MFCC; Amplitude and pure MFCC values calculate their respective minimum mean square errors, and input the obtained errors as adjustment signals into the neural network to optimize the network, thereby obtaining a trained network.

测试阶段:Test phase:

1、对含噪语音信号进行加窗分帧处理,并且使用离散傅里叶变化得到含噪语音信号的幅值以及相位作为第一特征;1. Windowing and framing the noisy speech signal, and using discrete Fourier transform to obtain the amplitude and phase of the noisy speech signal as the first feature;

2、对含噪语音信号进行分帧加窗后得到经过谱减阶段后的语音信号幅值即谱减语音信号幅值

Figure RE-RE-GDA0002977011930000113
作为第二特征;2. After the noisy speech signal is divided into frames and windows, the amplitude of the speech signal after the spectral subtraction stage is obtained, that is, the amplitude of the spectrally subtracted speech signal
Figure RE-RE-GDA0002977011930000113
as a second feature;

3、对含噪语音信号进行分帧加窗后得到MFCC作为第三特征;3. After the noisy speech signal is divided into frames and windowed, the MFCC is obtained as the third feature;

4、对含噪语音信号进行时频分解,并进行特征提取得到EMRACC(j,m)以及ΔEMRACC(j,m)作为深度神经网络的输入的第四特征。4. Perform time-frequency decomposition on the noisy speech signal, and perform feature extraction to obtain E MRACC (j,m) and ΔE MRACC (j,m) as the fourth feature of the input of the deep neural network.

5、将含噪语音的幅值,谱减语音信号幅值,MFCC以及特征提取得到的 EMRACC(j,m)和ΔEMRACC(j,m)信号等四种特征输入训练好的DNN-CLSTM 网络,得到经过增强后的估计的幅值、MFCC以及掩蔽阈值;5. Input the four features of noise-containing speech amplitude, spectrum minus speech signal amplitude, MFCC and E MRACC (j,m) and ΔE MRACC (j,m) signals obtained by feature extraction into the trained DNN-CLSTM network to obtain the enhanced estimated amplitude, MFCC, and masking threshold;

6、将增强语音信号幅值与步骤1中获取的含噪语音信号的相位相结合,进行逆傅立叶变换得到最终的增强语音信号。经过神经网络训练后得到的初步增强语音信号幅值

Figure RE-RE-GDA0002977011930000121
需要进行时域信号恢复从而得到最终增强后的增强语音信号,首先初步增强后的初步增强语音信号的幅值
Figure RE-RE-GDA0002977011930000122
需要与步骤1中提取出来的含噪语音信号的相位φy(f)进行结合,然后使用逆傅立叶变换转化为时域信号
Figure RE-RE-GDA0002977011930000123
从而得到最终的增强语音信号。此处得到的增强后的MFCC以及掩蔽阈值不参与波形恢复,在网络处理过程中对网络进行优化。6. Combine the amplitude of the enhanced speech signal with the phase of the noisy speech signal obtained in step 1, and perform inverse Fourier transform to obtain the final enhanced speech signal. The initial enhanced speech signal amplitude obtained after neural network training
Figure RE-RE-GDA0002977011930000121
It is necessary to perform time-domain signal recovery to obtain the final enhanced enhanced speech signal. First, the amplitude of the preliminary enhanced initially enhanced speech signal
Figure RE-RE-GDA0002977011930000122
It needs to be combined with the phase φ y (f) of the noisy speech signal extracted in step 1, and then converted into a time domain signal using the inverse Fourier transform
Figure RE-RE-GDA0002977011930000123
Thus, the final enhanced speech signal is obtained. The enhanced MFCC and masking threshold obtained here do not participate in waveform recovery, and the network is optimized during network processing.

7、网络建立7. Network establishment

DNN-CLSTM网络包含深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Network,CNN)、残差网络(Residual Network)以及长短时记忆网络(Bidirectional Long-term Memory Network),建立DNN-CLSTM神经网络的具体过程如图5所示:The DNN-CLSTM network includes Deep Neural Network (DNN), Convolutional Neural Network (CNN), Residual Network and Bidirectional Long-term Memory Network. The specific process of the DNN-CLSTM neural network is shown in Figure 5:

(1)DNN网络建立(1) DNN network establishment

输入层:将经过谱减后得到的谱减语音信号幅值作为输入,输入节点为128个神经元;Input layer: The amplitude of the spectrally subtracted speech signal obtained after spectral subtraction is used as input, and the input node is 128 neurons;

全连接层:设置32个节点,丢弃率设为0.5,设置激活函数为RELU;Fully connected layer: set 32 nodes, set the drop rate to 0.5, and set the activation function to RELU;

全连接层:设置128个节点,丢弃率值设为0.5,设置激活函数为RELU;Fully connected layer: set 128 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

全连接层:设置512个节点,丢弃率值设为0.5,设置激活函数为RELU;Fully connected layer: set 512 nodes, set the drop rate value to 0.5, and set the activation function to RELU;

(2)多目标特征融合:(2) Multi-target feature fusion:

将DNN网络增强后的幅值和MFCC特征与原始含噪语音的幅值和 MFCC特征相结合;Combine the amplitude and MFCC features enhanced by the DNN network with the amplitude and MFCC features of the original noisy speech;

Figure RE-RE-GDA0002977011930000131
Figure RE-RE-GDA0002977011930000131

其中

Figure RE-RE-GDA0002977011930000132
Figure RE-RE-GDA0002977011930000133
分别代表第k个空间领域中经过DNN预测的MFCC特征和语音幅值;
Figure RE-RE-GDA0002977011930000134
分别代表第k个空间领域中原始含噪语音的 MFCC特征和语音幅值;in
Figure RE-RE-GDA0002977011930000132
and
Figure RE-RE-GDA0002977011930000133
respectively represent the MFCC features and speech amplitude predicted by DNN in the kth spatial domain;
Figure RE-RE-GDA0002977011930000134
represent the MFCC features and speech amplitudes of the original noisy speech in the kth spatial domain, respectively;

(a)CNN:(a) CNN:

卷积层:对DNN网络得到的结果进行卷积,节点数设置为64个节点,步长设为1,卷积核取5*1,激活函数设置为SELU;Convolution layer: Convolve the results obtained by the DNN network, the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 5*1, and the activation function is set to SELU;

BN层:对数据进行归一化,BN layer: normalize the data,

卷积层:节点数设置为64个节点,步长设为1,卷积核取3*1,激活函数设置为SELU;Convolution layer: the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 3*1, and the activation function is set to SELU;

BN层:对数据进行归一化BN layer: normalize the data

卷积层:节点数设置为128个节点,步长设为1,卷积核取5*1,Convolutional layer: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel is set to 5*1,

(b)残差网络(b) Residual network

对DNN网络得到的结果进行卷积,节点数设置为128个节点,步长设为1,卷积核取5*1;Convolve the results obtained by the DNN network, the number of nodes is set to 128 nodes, the step size is set to 1, and the convolution kernel is set to 5*1;

将残差网络得到数据与CNN网络得到的数据进行结合后,使用SELU 激活函数;After combining the data obtained by the residual network with the data obtained by the CNN network, the SELU activation function is used;

Max Pooling层:步长设置为1,池化层大小设置为2Max Pooling layer: step size is set to 1, pooling layer size is set to 2

(3)LSTM网络:(3) LSTM network:

长短时记忆网络的双向网络节点均选取为128节点,激活函数为 Sigmoid函数,The bidirectional network nodes of the long-short-term memory network are selected as 128 nodes, and the activation function is the sigmoid function.

(4)输出层:(4) Output layer:

使用两个前馈神经网络作为输出层,输出增强后的语音信号幅值、 MFCC;网络模型采用Adam优化器对网络参数进行优化;所有卷积层采用边缘填充方式。Two feedforward neural networks are used as the output layer to output the enhanced speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use the edge filling method.

(5)计算最小均方误差目标函数(5) Calculate the minimum mean square error objective function

Figure RE-RE-GDA0002977011930000141
Figure RE-RE-GDA0002977011930000141

其中T=2,

Figure RE-RE-GDA0002977011930000142
分别代表第k个声学特征空间预测的MFCC特征向量和预测幅值特征
Figure RE-RE-GDA0002977011930000143
分别代表第k个声学特征空间纯净的MFCC 特征向量和纯净幅值特征。where T=2,
Figure RE-RE-GDA0002977011930000142
Represent the MFCC feature vector and predicted magnitude feature of the kth acoustic feature space prediction, respectively
Figure RE-RE-GDA0002977011930000143
represent the pure MFCC feature vector and pure amplitude feature of the kth acoustic feature space, respectively.

【试验例】【Test example】

实验中使用的语音数据来自于TIMIT数据集,噪声数据集来源于 Nonspeech噪音库和Noise-15噪音库。本实验中,TIMIT数据集总共包含6300 条语音。将其中约80%的语音作为训练集,另外20%作为测试语音。所有的语音被重采样到16kHz。对于本发明所提出的模型,选取几种典型神经网络语音增强模型与本发明提出的方法作对比,包括(a)DNN(b)CNN(c) LSTM(d)GRU。其中,DNN是基于深度神经网络的语音增强算法,CNN 是基于卷积神经网络的语音增强算法,LSTM是基于长短时记忆神经网络的语音增强算法,GRU是基于门控循环神经网络的语音增强算法。The speech data used in the experiment comes from the TIMIT dataset, and the noise dataset comes from the Nonspeech noise library and the Noise-15 noise library. In this experiment, the TIMIT dataset contains a total of 6300 utterances. About 80% of the speech is used as the training set, and the other 20% is used as the test speech. All speech is resampled to 16kHz. For the model proposed by the present invention, several typical neural network speech enhancement models are selected for comparison with the method proposed by the present invention, including (a) DNN (b) CNN (c) LSTM (d) GRU. Among them, DNN is a speech enhancement algorithm based on deep neural network, CNN is a speech enhancement algorithm based on convolutional neural network, LSTM is a speech enhancement algorithm based on long short-term memory neural network, and GRU is a speech enhancement algorithm based on gated recurrent neural network. .

所有的模型都是在-5dB,0dB,5dB,10dB,15dB,20dB的SNR条件下训练的,并在匹配的信噪比下评估性能。为了测试语音增强模型的鲁棒性,在不匹配的信噪比条件下评估性能。PESQ和LSD是评价语音的两种重要指标,PESQ指主观语音质量评估指标,LESQ得分越高,语音的质量越好; LSD指对数谱距离指标,LSD得分越低,语音质量越好。表1是在匹配的噪声条件下与其他四种算法(DNN,CNN,LSTM,GRU)作对比的测试结果,性能最佳的算法结果用粗体标注。表2是在不匹配的噪声条件下与其他四种算法(DNN,CNN,LSTM,GRU)作对比的测试结果,性能最佳的算法结果用粗体标注。All models are trained under SNR conditions of -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, and the performance is evaluated at matching SNR. To test the robustness of the speech enhancement model, the performance is evaluated under mismatched signal-to-noise ratio conditions. PESQ and LSD are two important indicators for evaluating speech. PESQ refers to the subjective speech quality evaluation index. The higher the LESQ score, the better the speech quality. LSD refers to the logarithmic spectral distance index. The lower the LSD score, the better the speech quality. Table 1 shows the test results compared with the other four algorithms (DNN, CNN, LSTM, GRU) under matched noise conditions, and the results of the best performing algorithms are marked in bold. Table 2 shows the test results compared with the other four algorithms (DNN, CNN, LSTM, GRU) under unmatched noise conditions, and the results of the best performing algorithms are marked in bold.

表1在匹配噪声条件下测试结果,性能最佳的已用粗体标注Table 1 Test results under matched noise conditions, the best performance has been marked in bold

Figure RE-RE-GDA0002977011930000161
Figure RE-RE-GDA0002977011930000161

表2在不匹配噪声条件下测试结果,性能最佳的已用粗体标注Table 2 Test results under mismatched noise conditions, the best performance has been marked in bold

Figure RE-RE-GDA0002977011930000162
Figure RE-RE-GDA0002977011930000162

Claims (3)

1.一种基于DNN-CLSTM网络的语音增强方法,其特征在于包括以下步骤:1. a speech enhancement method based on DNN-CLSTM network, is characterized in that comprising the following steps: 步骤一:获取含噪语音信号。含噪语音信号由纯净语音信号和噪声信号相加而成:Step 1: Acquire the noisy speech signal. The noisy speech signal is formed by adding the pure speech signal and the noise signal: y(m)=x(m)+n(m)y(m)=x(m)+n(m) 其中,y(m)是含噪的语音信号,x(m)是纯净的语音信号,n(m)是噪声信号,m为离散时间序列;Among them, y(m) is the noisy speech signal, x(m) is the pure speech signal, n(m) is the noise signal, and m is the discrete time series; 步骤二:分帧加窗处理,获取纯净语音信号和含噪语音信号的幅值和相位;Step 2: frame-by-frame window processing to obtain the amplitude and phase of the pure speech signal and the noisy speech signal; 对含噪语音信号进行加窗分帧处理,并使用离散傅里叶变化得到含噪语音信号的幅值以及相位。利用语音段中的前五帧信号作为噪声估计,求出噪声信号幅值;The noisy speech signal is processed by windowing and framing, and the amplitude and phase of the noisy speech signal are obtained by discrete Fourier transform. Using the first five frame signals in the speech segment as noise estimation, get the noise signal amplitude; 步骤三:用含噪语音信号的幅值减去所述噪声信号幅值即可得到谱减语音信号幅值作为第一特征;Step 3: Subtract the amplitude of the noise signal from the amplitude of the noisy speech signal to obtain the amplitude of the spectrum minus the speech signal as the first feature; 步骤四:求取语音信号的MFCC作为第二特征;Step 4: obtain the MFCC of the speech signal as the second feature; 步骤五:建立DNN-CLSTM网络模型进行训练;Step 5: Establish a DNN-CLSTM network model for training; 将含噪语音谱减后的语音信号幅值和MFCC这两种特征输入至DNN-CLSTM网络中进行训练,得到预测的幅值和MFCC;将预测的幅值和MFCC分别与纯净的幅值和纯净的MFCC数值计算各自的最小均方误差(MMSE),并将所得到的误差作为调整信号输入进神经网络对网络进行优化,从而得到训练好的网络。Input the two features of speech signal amplitude and MFCC after noise spectrum reduction into the DNN-CLSTM network for training, and obtain the predicted amplitude and MFCC; The pure MFCC values calculate their respective minimum mean square errors (MMSE), and input the obtained error as an adjustment signal into the neural network to optimize the network, thereby obtaining a trained network. 2.如权利要求1基于DNN-CLSTM网络的语音增强方法,其特征在于,所述步骤四的具体过程是:2. the speech enhancement method based on DNN-CLSTM network as claimed in claim 1, is characterized in that, the concrete process of described step 4 is: (1)预处理:预处理包括预加重、分帧、加窗函数;(1) Preprocessing: Preprocessing includes pre-emphasis, framing, and windowing functions; 预加重处理:通过一个一阶的高通滤波器实现,滤波器的传递函数为:Pre-emphasis processing: implemented by a first-order high-pass filter, the transfer function of the filter is: H(z)=1-az-1 H(z)=1-az -1 其中,a为预加重系数,一般取值为0.98;语音信号x(n)通过预加重处理后的结果为:Among them, a is the pre-emphasis coefficient, which is generally 0.98; the result of the speech signal x(n) after pre-emphasis processing is: y(n)=x(n)-ax(n-1)y(n)=x(n)-ax(n-1) 分帧加窗:在相邻两帧之间有重叠的部分,即为帧移,设置为10ms;加窗函数:对每一帧语音信号进行汉明窗加窗处理:y(n)经过分帧加窗处理后得到yi(n),其定义为:Frame-by-frame windowing: the overlapping part between two adjacent frames, which is frame shift, is set to 10ms; windowing function: Hamming window windowing is performed on each frame of speech signal: y(n) is divided into After frame windowing, y i (n) is obtained, which is defined as:
Figure FDA0002793741570000021
Figure FDA0002793741570000021
其中,ω(n)为汉明窗,它的表达式为Among them, ω(n) is the Hamming window, and its expression is
Figure FDA0002793741570000022
Figure FDA0002793741570000022
其中,yi(n)表示第i帧语音信号,n表示样点数,L表示帧长;Wherein, y i (n) represents the ith frame of speech signal, n represents the number of samples, and L represents the frame length; (2)快速傅里叶变换(FFT)(2) Fast Fourier Transform (FFT) 对每帧语音信号yi(n)进行快速傅里叶变换,得到每帧信号的频谱,表达式如下:Perform fast Fourier transform on each frame of speech signal y i (n) to obtain the frequency spectrum of each frame signal, the expression is as follows: Y(i,k)=FFT[yi(n)]Y(i,k)=FFT[y i (n)] 其中,k表示频域中的第k条谱线;where k represents the kth spectral line in the frequency domain; (3)计算谱线能量(3) Calculate the spectral line energy 频域中每一帧语音信号谱线的能量E(i,k)表示为:The energy E(i,k) of the spectral line of each frame of speech signal in the frequency domain is expressed as: E(i,k)=[Y(i,k)]2 E(i,k)=[Y(i,k)] 2 (4)计算通过Mel滤波器的能量(4) Calculate the energy passing through the Mel filter 每一帧谱线能量通过Mel滤波器的能量S(i,m)定义为:The energy S(i,m) of the spectral line energy of each frame passing through the Mel filter is defined as:
Figure FDA0002793741570000023
Figure FDA0002793741570000023
其中,N表示FFT的点数,M为滤波器的个数;Among them, N represents the number of FFT points, and M is the number of filters; 每个滤波器的传递函数Hm(k)为The transfer function H m (k) of each filter is
Figure FDA0002793741570000031
Figure FDA0002793741570000031
其中,f(m)为第m个滤波器的中心频率,m为第m个滤波器;Among them, f(m) is the center frequency of the mth filter, and m is the mth filter; (5)计算MFCC(5) Calculate MFCC 将Mel滤波器的能量取对数后计算离散余弦变换得到MFCC特征参数,如下式所示:After taking the logarithm of the energy of the Mel filter, the discrete cosine transform is calculated to obtain the MFCC characteristic parameters, as shown in the following formula:
Figure FDA0002793741570000032
Figure FDA0002793741570000032
其中,j是离散余弦变换(DCT)后的谱线。where j is the discrete cosine transform (DCT) spectral line.
3.如权利要求1基于DNN-CLSTM网络的语音增强方法,其特征在于,所述步骤五的具体过程是:3. as claimed in claim 1 based on the speech enhancement method of DNN-CLSTM network, it is characterized in that, the concrete process of described step 5 is: (1)DNN网络建立(1) DNN network establishment 输入层:将经过谱减后的语音幅值和MFCC特征作为输入,输入DNN网络中,输入层的神经元的节点数为128个;Input layer: take the spectrally subtracted speech amplitude and MFCC feature as input, and input it into the DNN network. The number of neurons in the input layer is 128; 全连接层:设置32个节点,丢弃率设为0.5,设置激活函数为RELU;Fully connected layer: set 32 nodes, set the drop rate to 0.5, and set the activation function to RELU; 全连接层:设置128个节点,丢弃率值设为0.5,设置激活函数为RELU;Fully connected layer: set 128 nodes, set the drop rate value to 0.5, and set the activation function to RELU; 全连接层:设置512个节点,丢弃率值设为0.5,设置激活函数为RELU;(2)多目标特征融合:Fully connected layer: set 512 nodes, set the drop rate value to 0.5, and set the activation function to RELU; (2) Multi-target feature fusion: 将DNN网络增强后的幅值和MFCC特征与原始含噪语音的幅值和MFCC特征相结合;Combine the amplitude and MFCC features enhanced by the DNN network with the amplitude and MFCC features of the original noisy speech;
Figure FDA0002793741570000041
Figure FDA0002793741570000041
其中
Figure FDA0002793741570000042
Figure FDA0002793741570000043
分别代表第k个空间领域中经过DNN预测的MFCC特征和语音幅值;
Figure FDA0002793741570000044
分别代表第k个空间领域中原始含噪语音的MFCC特征和语音幅值;
in
Figure FDA0002793741570000042
and
Figure FDA0002793741570000043
respectively represent the MFCC features and speech amplitude predicted by DNN in the kth spatial domain;
Figure FDA0002793741570000044
represent the MFCC features and speech amplitudes of the original noisy speech in the kth spatial domain, respectively;
(3)C-LSTM网络:(3) C-LSTM network: (a)CNN:(a) CNN: 卷积层:对DNN网络得到的结果进行卷积,节点数设置为64个节点,步长设为1,卷积核取5*1,激活函数设置为SELU;Convolution layer: Convolve the results obtained by the DNN network, the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 5*1, and the activation function is set to SELU; BN层:对数据进行归一化;BN layer: normalize the data; 卷积层:节点数设置为64个节点,步长设为1,卷积核取3*1,激活函数设置为SELU;Convolution layer: the number of nodes is set to 64 nodes, the step size is set to 1, the convolution kernel is set to 3*1, and the activation function is set to SELU; BN层:对数据进行归一化BN layer: normalize the data 卷积层:节点数设置为128个节点,步长设为1,卷积核取5*1,Convolutional layer: the number of nodes is set to 128 nodes, the step size is set to 1, the convolution kernel is set to 5*1, (b)残差网络(b) Residual network 对DNN网络得到的结果进行卷积,节点数设置为128个节点,步长设为1,卷积核取5*1;Convolve the results obtained by the DNN network, the number of nodes is set to 128 nodes, the step size is set to 1, and the convolution kernel is set to 5*1; 将残差网络得到数据与CNN网络得到的数据进行结合后,使用SELU激活函数;After combining the data obtained by the residual network with the data obtained by the CNN network, the SELU activation function is used; Max Pooling层:步长设置为1,池化层大小设置为2Max Pooling layer: step size is set to 1, pooling layer size is set to 2 (c)LSTM网络:(c) LSTM network: 长短时记忆网络的双向网络节点均选取为128节点,激活函数为Sigmoid函数,The bidirectional network nodes of the long-short-term memory network are selected as 128 nodes, and the activation function is the sigmoid function. (4)输出层:(4) Output layer: 使用两个前馈神经网络作为输出层,输出预测的语音信号幅值、MFCC;网络模型采用Adam优化器对网络参数进行优化;所有卷积层采用边缘填充方式。Two feedforward neural networks are used as output layers to output the predicted speech signal amplitude and MFCC; the network model uses Adam optimizer to optimize the network parameters; all convolutional layers use edge padding. (5)计算最小均方误差目标函数(5) Calculate the minimum mean square error objective function
Figure FDA0002793741570000051
Figure FDA0002793741570000051
其中T=2,
Figure FDA0002793741570000052
分别代表第k个声学特征空间预测的MFCC特征向量和预测幅值特征
Figure FDA0002793741570000053
分别代表第k个声学特征空间纯净的MFCC特征向量和纯净幅值特征。
where T=2,
Figure FDA0002793741570000052
Represent the MFCC feature vector and predicted magnitude feature of the kth acoustic feature space prediction, respectively
Figure FDA0002793741570000053
represent the pure MFCC feature vector and pure amplitude feature of the kth acoustic feature space, respectively.
CN202011323987.2A 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network Active CN112735456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011323987.2A CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011323987.2A CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Publications (2)

Publication Number Publication Date
CN112735456A true CN112735456A (en) 2021-04-30
CN112735456B CN112735456B (en) 2024-01-16

Family

ID=75597716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011323987.2A Active CN112735456B (en) 2020-11-23 2020-11-23 Speech enhancement method based on DNN-CLSTM network

Country Status (1)

Country Link
CN (1) CN112735456B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192520A (en) * 2021-07-01 2021-07-30 腾讯科技(深圳)有限公司 Audio information processing method and device, electronic equipment and storage medium
CN113269305A (en) * 2021-05-20 2021-08-17 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113611323A (en) * 2021-05-07 2021-11-05 北京至芯开源科技有限责任公司 Voice enhancement method and system based on dual-channel convolution attention network
CN114093379A (en) * 2021-12-15 2022-02-25 荣耀终端有限公司 Noise elimination method and device
CN114220448A (en) * 2021-12-16 2022-03-22 游密科技(深圳)有限公司 Voice signal generation method, device, computer equipment and storage medium
CN114267368A (en) * 2021-12-22 2022-04-01 北京百度网讯科技有限公司 Training method of audio noise reduction model, and audio noise reduction method and device
CN114283829A (en) * 2021-12-13 2022-04-05 电子科技大学 Voice enhancement method based on dynamic gate control convolution cyclic network
CN114582000A (en) * 2022-03-18 2022-06-03 南京工业大学 A fusion model of multimodal elderly emotion recognition based on facial expressions and speech in video images and its establishment method
CN114582352A (en) * 2022-02-24 2022-06-03 广州方硅信息技术有限公司 Training method of speech enhancement model, speech enhancement method, device and equipment
CN115240699A (en) * 2022-07-21 2022-10-25 电信科学技术第五研究所有限公司 Noise estimation and voice noise reduction method and system based on deep learning
CN115756376A (en) * 2022-10-21 2023-03-07 中电智恒信息科技服务有限公司 A conference volume control method, device and system based on LSTM
CN117193391A (en) * 2023-11-07 2023-12-08 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system
CN119418712A (en) * 2025-01-07 2025-02-11 西安赛普特信息科技有限公司 A noise reduction method for real-time speech at the edge
WO2025035975A1 (en) * 2023-08-17 2025-02-20 腾讯科技(深圳)有限公司 Training method for speech enhancement network, speech enhancement method, and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109410976A (en) * 2018-11-01 2019-03-01 北京工业大学 Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid
CN110060704A (en) * 2019-03-26 2019-07-26 天津大学 A kind of sound enhancement method of improved multiple target criterion study
WO2020024452A1 (en) * 2018-07-31 2020-02-06 平安科技(深圳)有限公司 Deep learning-based answering method and apparatus, and readable storage medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024452A1 (en) * 2018-07-31 2020-02-06 平安科技(深圳)有限公司 Deep learning-based answering method and apparatus, and readable storage medium
CN109410976A (en) * 2018-11-01 2019-03-01 北京工业大学 Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid
CN110060704A (en) * 2019-03-26 2019-07-26 天津大学 A kind of sound enhancement method of improved multiple target criterion study
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚远;王秋菊;周伟;鲍程毅;彭磊;: "改进谱减法结合神经网络的语音增强研究", 电子测量技术, no. 07 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611323A (en) * 2021-05-07 2021-11-05 北京至芯开源科技有限责任公司 Voice enhancement method and system based on dual-channel convolution attention network
CN113611323B (en) * 2021-05-07 2024-02-20 北京至芯开源科技有限责任公司 Voice enhancement method and system based on double-channel convolution attention network
CN113269305A (en) * 2021-05-20 2021-08-17 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113269305B (en) * 2021-05-20 2024-05-03 郑州铁路职业技术学院 Feedback voice strengthening method for strengthening memory
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113192520A (en) * 2021-07-01 2021-07-30 腾讯科技(深圳)有限公司 Audio information processing method and device, electronic equipment and storage medium
CN114283829A (en) * 2021-12-13 2022-04-05 电子科技大学 Voice enhancement method based on dynamic gate control convolution cyclic network
CN114283829B (en) * 2021-12-13 2023-06-16 电子科技大学 Voice enhancement method based on dynamic gating convolution circulation network
CN114093379B (en) * 2021-12-15 2022-06-21 北京荣耀终端有限公司 Noise elimination method and device
CN114093379A (en) * 2021-12-15 2022-02-25 荣耀终端有限公司 Noise elimination method and device
CN114220448A (en) * 2021-12-16 2022-03-22 游密科技(深圳)有限公司 Voice signal generation method, device, computer equipment and storage medium
CN114267368A (en) * 2021-12-22 2022-04-01 北京百度网讯科技有限公司 Training method of audio noise reduction model, and audio noise reduction method and device
CN114582352A (en) * 2022-02-24 2022-06-03 广州方硅信息技术有限公司 Training method of speech enhancement model, speech enhancement method, device and equipment
CN114582000A (en) * 2022-03-18 2022-06-03 南京工业大学 A fusion model of multimodal elderly emotion recognition based on facial expressions and speech in video images and its establishment method
CN115240699A (en) * 2022-07-21 2022-10-25 电信科学技术第五研究所有限公司 Noise estimation and voice noise reduction method and system based on deep learning
CN115756376A (en) * 2022-10-21 2023-03-07 中电智恒信息科技服务有限公司 A conference volume control method, device and system based on LSTM
WO2025035975A1 (en) * 2023-08-17 2025-02-20 腾讯科技(深圳)有限公司 Training method for speech enhancement network, speech enhancement method, and electronic device
CN117193391A (en) * 2023-11-07 2023-12-08 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system
CN117193391B (en) * 2023-11-07 2024-01-23 北京铁力山科技股份有限公司 Intelligent control desk angle adjustment system
CN119418712A (en) * 2025-01-07 2025-02-11 西安赛普特信息科技有限公司 A noise reduction method for real-time speech at the edge

Also Published As

Publication number Publication date
CN112735456B (en) 2024-01-16

Similar Documents

Publication Publication Date Title
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN105957520B (en) A Speech State Detection Method Applicable to Echo Cancellation System
CN110867181B (en) Multi-target speech enhancement method based on joint estimation of SCNN and TCNN
WO2020177371A1 (en) Environment adaptive neural network noise reduction method and system for digital hearing aids, and storage medium
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
CN110767244B (en) Speech enhancement method
CN110085249A (en) The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN105023572A (en) Noised voice end point robustness detection method
Tu et al. A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition
CN111899750B (en) Speech Enhancement Algorithm Combined with Cochlear Speech Features and Jump Deep Neural Networks
CN108335702A (en) A kind of audio defeat method based on deep neural network
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
Hou et al. Domain adversarial training for speech enhancement
CN114283835A (en) Voice enhancement and detection method suitable for actual communication condition
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
CN112634926B (en) Short wave channel voice anti-fading auxiliary enhancement method based on convolutional neural network
CN115132215A (en) A single-channel speech enhancement method
Chen Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering
CN118899005B (en) Audio signal processing method, device, computer equipment and storage medium
CN116013339A (en) A single-channel speech enhancement method based on improved CRN
Sivapatham et al. Gammatone filter bank-deep neural network-based monaural speech enhancement for unseen conditions
Schmidt et al. Reduction of non-stationary noise using a non-negative latent variable decomposition
TWI749547B (en) Speech enhancement system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant