CN112530451A

CN112530451A - Speech enhancement method based on denoising autoencoder

Info

Publication number: CN112530451A
Application number: CN202011128458.7A
Authority: CN
Inventors: 张世强; 胡显秋; 张婷娟; 于乐; 顾雷
Original assignee: Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd; State Grid Corp of China SGCC; Northeast Dianli University
Current assignee: Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd; State Grid Corp of China SGCC; Northeast Electric Power University
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-03-19

Abstract

A speech enhancement method based on denoising auto-encoder of the present invention is characterized in that the content includes: constructing a de-noising auto-encoder training model, obtaining time-domain difference values from a multi-microphone array, and reconstructing an original sound prediction model for de-noising Noise processing can effectively reduce the interference of noise on speech signals and significantly improve the signal-to-noise ratio of speech signals.

Description

Speech enhancement method based on denoising autoencoder

技术领域technical field

本发明属于语音信号处理技术领域，涉及一种基于去噪自编码器的语音增强方法。The invention belongs to the technical field of speech signal processing, and relates to a speech enhancement method based on a denoising autoencoder.

背景技术Background technique

语音降噪是语音处理系统的重要前端。背景噪音和人声干扰会降低语音信号的质量及可懂度，并在实际应用中导致性能下降，包括语音通信、助听器和语音识别。语音降噪的一个关键目标是提高干扰噪声存在时的质量和可懂度。Speech noise reduction is an important front-end for speech processing systems. Background noise and human-voice interference can reduce the quality and intelligibility of speech signals and cause performance degradation in real-world applications, including voice communications, hearing aids, and speech recognition. A key goal of speech noise reduction is to improve the quality and intelligibility in the presence of interfering noise.

在语音降噪算法中，最常用的方法就是谱减法。谱减法具有算法简单、运算量小的特点。该算法的不足之处是处理后会产生听上去类似音乐的“音乐噪声”。基于自适应滤波器法的语音降噪算法，可以利用前一帧的滤波器参数和滤波结果，自动调整当前帧的滤波器参数，它对干净语音信号和噪声的先验知识要求较低。从而适应干净语音信号和噪声未知的随机变化和统计性，所以降噪后的语音无论在信噪比和听感上都有明显的进步。但这类算法往往存在收敛速度慢、不适用于非平稳噪声问题。基于最小均方误差估计法(MMSE)语音降噪算法能够有效抑制残留的“音乐噪声”。但是这种方法在低信噪比的情况下，对语音帧和非语音帧的识别极容易出错，导致降噪后的语音严重失真。基于子空间的语音降噪算法是通过空间分解将整个空间分为纯噪声子空间和纯语音子空间。通过设计一个既要保证残差信号谱的同时，还要考虑使语音失真最小化的估计器，来去除噪声子空间并估计语音信号特征值从而实现语音降噪。其中一种最常用的基于最优约束估计器的子空间语音降噪，但该语音降噪算法复杂度很高，很难在嵌入式平台上实现。小波变换方法是一种新型变换分析方法，能够在时间或空间上进行频率的局部分析。通过伸缩和平移运算对信号进行逐步尺度细化，具有多分辨率分析的特性，能够自适应信号分析的要求，现已被广泛运用在音频、图像处理领域中。依据小波变换能够有效去除数据的相关性特点，使得干净语音信号能量集中在小波域中的较大的小波系数中，噪声能量则集中在较小的小波系数中。其本质上是一种小波域滤波算法，选择适当的阈值是关系到系统表现的关键所在。但是阈值获取困难且算法复杂度越来越高，较难用于实时通信。深度神经网络(DNN)应用于语音降噪工作变得越来越流行。基于深度神经网络的语音降噪算法是通过堆叠自动编码器，形成一个深层神经网络，输入为含噪语音的对数功率谱，对应输出干净语音信号的对数功率谱。该网络虽然较传统单通道语音算法有较好的降噪效果，但存在网络训练困难、在低信噪比条件下表现能力欠佳的问题。In speech noise reduction algorithms, the most commonly used method is spectral subtraction. Spectral subtraction has the characteristics of simple algorithm and small computational complexity. The downside of this algorithm is that it produces "musical noise" that sounds like music after processing. The speech noise reduction algorithm based on the adaptive filter method can use the filter parameters and filtering results of the previous frame to automatically adjust the filter parameters of the current frame, which requires less prior knowledge of clean speech signals and noise. Therefore, it can adapt to the unknown random changes and statistics of clean speech signals and noise, so the speech after noise reduction has obvious improvement in signal-to-noise ratio and hearing sense. However, such algorithms often have slow convergence speed and are not suitable for non-stationary noise problems. The speech noise reduction algorithm based on Minimum Mean Square Error Estimation (MMSE) can effectively suppress the residual "music noise". However, in the case of low signal-to-noise ratio, the recognition of speech frames and non-speech frames is extremely error-prone, resulting in severe distortion of speech after noise reduction. The speech noise reduction algorithm based on subspace divides the whole space into pure noise subspace and pure speech subspace through space decomposition. By designing an estimator that not only guarantees the residual signal spectrum, but also considers minimizing the speech distortion, the noise subspace is removed and the eigenvalues of the speech signal are estimated to achieve speech noise reduction. One of the most commonly used subspace speech noise reduction based on optimal constraint estimator, but this speech noise reduction algorithm is very complex and difficult to implement on embedded platforms. The wavelet transform method is a new transform analysis method, which can perform local analysis of frequency in time or space. The signal is gradually refined by scaling and translation operations, which has the characteristics of multi-resolution analysis and can adapt to the requirements of signal analysis. It has been widely used in the fields of audio and image processing. According to the wavelet transform can effectively remove the correlation characteristics of the data, the clean speech signal energy is concentrated in the larger wavelet coefficients in the wavelet domain, and the noise energy is concentrated in the smaller wavelet coefficients. It is essentially a wavelet domain filtering algorithm, and choosing an appropriate threshold is the key to the performance of the system. However, the threshold value is difficult to obtain and the algorithm complexity is getting higher and higher, and it is difficult to be used for real-time communication. Deep neural networks (DNNs) are becoming more and more popular for speech noise reduction work. The speech noise reduction algorithm based on deep neural network is to form a deep neural network by stacking auto-encoders. Although the network has better noise reduction effect than the traditional single-channel speech algorithm, it has the problems of difficulty in network training and poor performance under the condition of low signal-to-noise ratio.

发明内容SUMMARY OF THE INVENTION

本发明的目的是，为了降低噪声对语音信号的干扰，提高语音信号的信噪比，提出一种基于去噪自编码器的语音增强方法，实现语音信号的增强。The purpose of the present invention is to propose a speech enhancement method based on denoising autoencoder to realize the enhancement of speech signal in order to reduce the interference of noise on speech signal and improve the signal-to-noise ratio of speech signal.

本发明的目的是由以下技术方案来实现的：一种基于去噪自编码器的语音增强方法，其特征是，它包括的内容有：构建去噪自编码器训练模型，多麦克风阵列获取时域差值，重构原声预测模型进行去噪处理，The object of the present invention is achieved by the following technical solutions: a speech enhancement method based on denoising autoencoder, characterized in that the content it includes includes: constructing a denoising autoencoder training model, when multiple microphone arrays acquire Domain difference, reconstruct the original sound prediction model for denoising,

1)构建去噪自编码器训练模型1) Build a denoising autoencoder training model

去噪自编码器训练模型设计为三层网络模型，第一层为输入层，中间层为隐藏层，设计节点个数为1024个，第三层为输出层，将输出层与原始无损数据进行比对，最小化损失值：The denoising autoencoder training model is designed as a three-layer network model. The first layer is the input layer, the middle layer is the hidden layer, the number of design nodes is 1024, and the third layer is the output layer. The output layer is compared with the original lossless data. Alignment, minimize the loss value:

式中，

是样本x经过损坏过程

后得到的损坏样本，通常分布p_decoder是因子的分布，平局参数由前馈网络给出，这里对负对数释然

进行基于梯度下降法的近似最小化，

即是样本

的概率分布，这样构成了确定的自编码器，也就是一个前馈的网络，并且能够使用与其他前馈网络完全相同的方式进行训练，因此整个自动编码器就可类比为下一个期望的梯度下降：In the formula,

is the sample x that has undergone the damage process

The damaged samples obtained later, usually the distribution p _decoder is the distribution of factors, and the draw parameters are given by the feedforward network, where the negative logarithm is relieved

perform an approximate minimization based on gradient descent,

the sample

The probability distribution of , which constitutes a deterministic auto-encoder, which is a feed-forward network, and can be trained in exactly the same way as other feed-forward networks, so the entire auto-encoder can be analogized to the next expected gradient decline:

其中，

是训练数据的分布，

表示对

分布的期望值，

表示对

样本

在全量x上的下一个期望值；in,

is the distribution of training data,

express right

the expected value of the distribution,

express right

sample

the next expected value on the full x;

2)多麦克风阵列获取时域差值2) Multi-microphone array to obtain time domain difference

麦克风阵列的语音增强方法的优势在于考虑了声源的位置信息，能够实现空间滤波，所以对具有方向性的噪声具有优良的抑制效果，因此，将麦克风阵列的技术应用在抑制干扰语音中，具体实现是对期望方向的语音信号进行保留；The advantage of the voice enhancement method of the microphone array is that it considers the position information of the sound source and can realize spatial filtering, so it has an excellent suppression effect on directional noise. Therefore, the technology of the microphone array is applied to suppress the interfering voice. The realization is to reserve the speech signal in the desired direction;

首先，不同的麦克风由于位置不同，所以接收的语音信号必定存在着时间偏差，因此利用抽头延迟线结构(Tapped Delay-lines，TDLs)来实现对宽带语音信号的波束形成，TDLs结构的固定波束形成算法，通过多抽头的延迟来产生不同频率的分量，然后通过滤波系数描述来约束各麦克风的输入信号，使得期望方向上的信号得到保留，并在非期望方向上形成零陷，从而实现对固定声源方向的波束形成，TDLs结构的固定波束形成算法能够对固定噪声源方向的信号进行抑制，并且对相干和非相干噪声都能实现有效地抑制，其表达式为式(3)：First of all, due to the different positions of different microphones, there must be time deviations in the received voice signals. Therefore, tapped delay-lines (TDLs) are used to realize beamforming of broadband voice signals, and fixed beamforming of TDLs structure is used. The algorithm generates components of different frequencies through multi-tap delay, and then constrains the input signal of each microphone through the filter coefficient description, so that the signal in the desired direction is retained, and a null is formed in the undesired direction, so as to realize the fixed For beamforming in the direction of the sound source, the fixed beamforming algorithm of the TDLs structure can suppress the signal in the direction of the fixed noise source, and can effectively suppress both coherent and incoherent noise. Its expression is Equation (3):

F＝WD (3)F=WD (3)

式中，矩阵D为方向矩阵，用来对不同角度的语音信号进行频域对齐，W为不同入射角度的语音信号，ω₀，…,ω_J-1，分别代表了不同的频率分量，矩阵F是目标响应矩阵，同样地，每一个分量对应着不同入射角度信号的目标响应，通过设置目标响应矩阵F，就能够决定固定波束形成结构对哪些方向的语音信号进行保留，又对哪些方向的语音信号进行抑制，矩阵W是权重系数矩阵，也是TDLs结构需要设计的部分，通过求解式(3)，得到的矩阵系数解ω_i,j，便是最终需要的设计的滤波器系数；In the formula, the matrix D is the direction matrix, which is used to align the speech signals of different angles in the frequency domain, W is the speech signals of different incident angles, ω ₀ , ..., ω _J-1 , respectively represent different frequency components, the matrix F is the target response matrix. Similarly, each component corresponds to the target response of signals with different incident angles. By setting the target response matrix F, it can be determined which directions of speech signals are reserved by the fixed beamforming structure, and which directions are reserved. To suppress the speech signal, the matrix W is the weight coefficient matrix, which is also the part that needs to be designed in the TDLs structure. By solving the formula (3), the obtained matrix coefficient solution ω _i,j is the final designed filter coefficient;

然后利用信号的输出来自适应地调整类似TDLs结构中的权重系数ω_i,j，来达到对声学环境的变化具有一定鲁棒性的目的，在自适应的波束形成算法中，使用LCMV结构进行调整，LCMV结构是在式(3)的基础上进行调整，调整为式(4)：Then, the output of the signal is used to adaptively adjust the weight coefficient ω _i,j in the similar TDLs structure to achieve a certain robustness to changes in the acoustic environment. In the adaptive beamforming algorithm, the LCMV structure is used to adjust , the LCMV structure is adjusted on the basis of formula (3), and adjusted to formula (4):

其中，R_yy为输入信号Y的自相关矩阵的期望，用R_yy≈YY^H来进行估算，argmin_WW^HR_yyW表示通过最小化输出功率来自适应地调整权重系数W，从而使干扰目标方向的信号得到抑制，求解式(3)与式(4)，便得到系数矩阵W的值：Among them, R _yy is the expectation of the autocorrelation matrix of the input signal Y, which is estimated by R _yy ≈ YY ^H , and argmin _W W ^H R _yy W represents the adaptive adjustment of the weight coefficient W by minimizing the output power, so as to make the interference target The signal in the direction is suppressed, and equations (3) and (4) are solved to obtain the value of the coefficient matrix W:

根据上述解系数矩阵W的值，计算出时域上的差值；According to the value of the above-mentioned solution coefficient matrix W, the difference value in the time domain is calculated;

3)重构原声预测模型进行去噪处理3) Reconstruct the original sound prediction model for denoising processing

在计算出时域差值后，得出的语音信号为失真的语音信号，因为单独使用多麦克风阵列算法的结构，将存在同频语音相减低消的情况，同时对于不同域的语音信号，存在风噪声消除不彻底，导致“音乐噪声”的问题，处理到此处的模型并不具有良好的鲁棒性，因此需要对失真的语音信号进行重新预测，将失真语音作为输入层传入第一步的自编码器模型之前，还需要进行一步滤波去噪处理：After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness, so it is necessary to re-predict the distorted speech signal, and the distorted speech is introduced into the first layer as the input layer. Before the autoencoder model of the first step, a further step of filtering and denoising is required:

是估计的先验信噪比(a prior SNR)，所以整个求解的过程都是围绕如何求解这个先验信噪比进行的，而在这之前，先要估计后验信噪比和语音存在概率，后验信噪比的定义如下：

is the estimated prior signal-to-noise ratio (a prior SNR), so the entire solution process revolves around how to solve this prior signal-to-noise ratio, and before that, the posterior signal-to-noise ratio and the probability of speech existence must be estimated , the posterior signal-to-noise ratio is defined as follows:

是噪声的功率谱，是通过Cohen提出的OMLSA方法求得的，对比γ(t,d)和预先设定的阈值Tr，如果大于这个阈值，则语音的存在的索引I(d)设为1，否则为0，其实这有点类似理想二值掩蔽的概念，即如果是语音主导的就设定为1，否则就是设定为0，那么语音存在概率就能够通过以下方式进行估计：

is the power spectrum of the noise, which is obtained by the OMLSA method proposed by Cohen. Compare γ(t, d) with the preset threshold Tr. If it is greater than this threshold, the index I(d) of the existence of speech is set to 1 , otherwise it is 0. In fact, this is somewhat similar to the concept of ideal binary masking, that is, if it is voice-dominant, it is set to 1, otherwise it is set to 0, then the probability of voice existence can be estimated in the following ways:

p(t,d)＝0.95p(t-1,d)+0.05I(d) (8)p(t,d)=0.95p(t-1,d)+0.05I(d) (8)

能够看出语音存在概率是通过前一时刻的语音存在概率和当前频段的语音存在索引的迭代平均结果，最终先验信噪比能够通过如下方式进行估计：It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:

先验信噪比有三部分构成，第一部分是前一时刻的先验信噪比，第二部分是通过DNN估计得到的语音和通过OMLSA方法估计得到的噪声谱而算得的先验信噪比，最后一部分是利用后验信噪比对先验信噪比的最大似然估计，得到结果后再重新输入第一步的自编码器模型，结果为最终的降噪语音。The prior signal-to-noise ratio consists of three parts. The first part is the prior signal-to-noise ratio at the previous moment. The second part is the prior signal-to-noise ratio calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method. The last part is to use the maximum likelihood estimation of the prior signal-to-noise ratio by the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise-reduced speech.

本发明的一种基于去噪自编码器的语音增强方法，它包括的内容有：构建去噪自编码器训练模型，多麦克风阵列获取时域差值，重构原声预测模型进行去噪处理等步骤，能够有效的降低噪声对语音信号的干扰，提高语音信号的信噪比，具有科学合理、结构简单、去噪效果好、适用范围广等优点。A speech enhancement method based on denoising auto-encoder of the present invention includes the following steps: constructing a de-noising auto-encoder training model, acquiring time-domain difference values from a multi-microphone array, reconstructing an original sound prediction model for de-noising processing, etc. The steps can effectively reduce the interference of noise on the speech signal, improve the signal-to-noise ratio of the speech signal, and have the advantages of scientific and reasonable, simple structure, good denoising effect, wide application range and the like.

附图说明Description of drawings

图1为一种基于去噪自编码器的语音增强方法流程图。FIG. 1 is a flow chart of a speech enhancement method based on denoising autoencoder.

具体实施方式Detailed ways

下面利用附图和具体实施方式对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

参照图1，本发明的基于去噪自编码器的语音增强方法，它包括的内容有：构建去噪自编码器训练模型，多麦克风阵列获取时域差值，重构原声预测模型进行去噪处理。Referring to Fig. 1, the speech enhancement method based on the denoising autoencoder of the present invention includes the following: constructing a denoising autoencoder training model, obtaining a time domain difference with a multi-microphone array, reconstructing the original sound prediction model for denoising deal with.

式中，

是样本x经过损坏过程

进行基于梯度下降法的近似最小化，

即是样本

is the sample x that has undergone the damage process

perform an approximate minimization based on gradient descent,

the sample

其中，

是训练数据的分布，

表示对

分布的期望值，

表示对

样本

在全量x上的下一个期望值。in,

is the distribution of training data,

express right

the expected value of the distribution,

express right

sample

The next expected value on full x.

F＝WD (3)F=WD (3)

根据上述解系数矩阵W的值，计算出时域上的差值。According to the value of the above solution coefficient matrix W, the difference value in the time domain is calculated.

在计算出时域差值后，得出的语音信号为失真的语音信号，因为单独使用多麦克风阵列算法的结构，将存在同频语音相减低消的情况，同时对于不同域的语音信号，存在风噪声消除不彻底，导致“音乐噪声”的问题，处理到此处的模型并不具有良好的鲁棒性，因此需要对失真的语音信号进行重新预测，将失真语音作为输入层传乳第一步的自编码器模型之前，还需要进行一步滤波去噪处理：After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness. Therefore, it is necessary to re-predict the distorted speech signal, and the distorted speech is used as the input layer to transmit milk first. Before the autoencoder model of the first step, a further step of filtering and denoising is required:

这里的

是估计的先验信噪比(a prior SNR)，所以整个求解的过程都是围绕如何求解这个先验信噪比进行的，而在这之前，先要估计后验信噪比和(aposteriorSNR)和语音存在概率，后验信噪比的定义如下：here

is the estimated prior signal-to-noise ratio (a prior SNR), so the entire solution process revolves around how to solve this prior signal-to-noise ratio, and before this, the posterior signal-to-noise ratio sum (aposteriorSNR) must be estimated and the probability of speech existence, the posterior signal-to-noise ratio is defined as follows:

这里的

是噪声的功率谱，是通过Cohen提出的OMLSA方法求得的(Cohen,2003)，对比γ(t,d)和预先设定的阈值Tr，如果大于这个阈值，则语音的存在的索引I(d)设为1，否则为0，其实这有点类似理想二值掩蔽的概念，即如果是语音主导的就设定为1，否则就是设定为0，那么语音存在概率就能够通过以下方式进行估计：here

is the power spectrum of the noise, which is obtained by the OMLSA method proposed by Cohen (Cohen, 2003). Compare γ(t, d) with the preset threshold Tr. If it is greater than this threshold, the index I ( d) is set to 1, otherwise it is 0. In fact, this is a bit similar to the concept of ideal binary masking, that is, if it is voice-dominant, it is set to 1, otherwise it is set to 0, then the probability of voice existence can be carried out in the following ways. estimate:

p(t,d)＝0.95p(t-1,d)+0.05I(d) (8)p(t,d)=0.95p(t-1,d)+0.05I(d) (8)

可以看出语音存在概率是通过前一时刻的语音存在概率和当前频段的语音存在索引的迭代平均结果，最终先验信噪比能够通过如下方式进行估计：It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:

这里的先验信噪比有三部分构成，第一部分是前一时刻的先验信噪比，第二部分是通过DNN估计得到的语音和通过OMLSA方法估计得到的噪声谱而算得的先验信噪比，最后一部分是利用后验信噪比对先验信噪比的最大似然估计，得到结果后再重新输入第一步的自编码器模型，结果为最终的降噪语音。The prior signal-to-noise ratio here consists of three parts, the first is the prior signal-to-noise ratio at the previous moment, and the second is the prior signal-to-noise calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method The last part is the maximum likelihood estimation of the prior signal-to-noise ratio using the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise reduction speech.

本发明的软件程序依据自动化、网络和计算机处理技术编制，是本领域技术人员所熟悉的技术。The software program of the present invention is compiled according to automation, network and computer processing technology, and is a technology familiar to those skilled in the art.

本发明实施例仅用于对本发明作进一步的说明，并非穷举，并不构成对权利要求保护范围的限定，本领域技术人员根据本发明实施例获得的启示，不经过创造性劳动就能够想到其它实质上等同的替代，均在本发明保护范围内。The embodiments of the present invention are only used to further illustrate the present invention, are not exhaustive, and do not constitute a limitation on the protection scope of the claims. Those skilled in the art can think of other Substantially equivalent substitutions are all within the protection scope of the present invention.

Claims

1. a speech enhancement method based on denoising self-encoder, it is characterized in that, the content it includes is: build denoising self-encoder training model, multi-microphone array obtains time domain difference, reconstructs the original sound prediction model to remove noise processing,

1) Build a denoising autoencoder training model

The denoising autoencoder training model is designed as a three-layer network model. The first layer is the input layer, the middle layer is the hidden layer, the number of design nodes is 1024, and the third layer is the output layer. The output layer is compared with the original lossless data. Alignment, minimize the loss value:

In the formula,

is the sample x that has undergone the damage process

perform an approximate minimization based on gradient descent,

the sample

in,

is the distribution of training data,

express right

the expected value of the distribution,

express right

sample

the next expected value on the full x;

2) Multi-microphone array to obtain time domain difference

The advantage of the voice enhancement method of the microphone array is that it considers the position information of the sound source and can realize spatial filtering, so it has an excellent suppression effect on directional noise. Therefore, the technology of the microphone array is applied to suppress the interfering voice. The realization is to reserve the speech signal in the desired direction;

First of all, due to the different positions of different microphones, there must be time deviations in the received voice signals. Therefore, tapped delay-lines (TDLs) are used to realize beamforming of broadband voice signals, and fixed beamforming of TDLs structure is used. The algorithm generates components of different frequencies through multi-tap delay, and then constrains the input signal of each microphone through the filter coefficient description, so that the signal in the desired direction is retained, and a null is formed in the undesired direction, so as to realize the fixed For beamforming in the direction of the sound source, the fixed beamforming algorithm of the TDLs structure can suppress the signal in the direction of the fixed noise source, and can effectively suppress both coherent and incoherent noise. Its expression is Equation (3):

F=WD (3)

In the formula, the matrix D is the direction matrix, which is used to align the speech signals of different angles in the frequency domain, W is the speech signals of different incident angles, ω ₀ , ..., ω _J-1 , respectively represent different frequency components, the matrix F is the target response matrix. Similarly, each component corresponds to the target response of signals with different incident angles. By setting the target response matrix F, it can be determined which directions of speech signals are reserved by the fixed beamforming structure, and which directions are reserved. The speech signal is suppressed, and the matrix W is the weight coefficient matrix, which is also the part of the TDLs structure that needs to be designed. By solving the formula (3), the obtained matrix coefficient solution ω _{i, j} is the final designed filter coefficient;

Then, the output of the signal is used to adaptively adjust the weight coefficient ω _i,j in the structure similar to TDLs, so as to achieve a certain robustness to the changes of the acoustic environment. In the adaptive beamforming algorithm, the LCMV structure is used to adjust , the LCMV structure is adjusted on the basis of formula (3), and adjusted to formula (4):

Among them, R _yy is the expectation of the autocorrelation matrix of the input signal Y, which is estimated by R _yy ≈ YY ^H , and argmin _w W ^H R _yy W represents the adaptive adjustment of the weight coefficient W by minimizing the output power, so as to make the interference target The signal in the direction is suppressed, and equations (3) and (4) are solved to obtain the value of the coefficient matrix W:

According to the value of the above-mentioned solution coefficient matrix W, the difference value in the time domain is calculated;

3) Reconstruct the original sound prediction model for denoising processing

After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness, so it is necessary to re-predict the distorted speech signal, and the distorted speech is introduced into the first layer as the input layer. Before the autoencoder model of the first step, a further step of filtering and denoising is required:

p(t,d)=0.95p(t-1,d)+0.05I(d) (8)

It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:

The prior signal-to-noise ratio consists of three parts. The first part is the prior signal-to-noise ratio at the previous moment. The second part is the prior signal-to-noise ratio calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method. The last part is to use the maximum likelihood estimation of the prior signal-to-noise ratio by the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise-reduced speech.