[go: up one dir, main page]

CN112530451A - Speech enhancement method based on denoising autoencoder - Google Patents

Speech enhancement method based on denoising autoencoder Download PDF

Info

Publication number
CN112530451A
CN112530451A CN202011128458.7A CN202011128458A CN112530451A CN 112530451 A CN112530451 A CN 112530451A CN 202011128458 A CN202011128458 A CN 202011128458A CN 112530451 A CN112530451 A CN 112530451A
Authority
CN
China
Prior art keywords
speech
signal
noise
matrix
noise ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011128458.7A
Other languages
Chinese (zh)
Inventor
张世强
胡显秋
张婷娟
于乐
顾雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd
State Grid Corp of China SGCC
Northeast Electric Power University
Original Assignee
Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd
State Grid Corp of China SGCC
Northeast Dianli University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd, State Grid Corp of China SGCC, Northeast Dianli University filed Critical Yichun Power Supply Co Of State Grid Heilongjiang Electric Power Co ltd
Priority to CN202011128458.7A priority Critical patent/CN112530451A/en
Publication of CN112530451A publication Critical patent/CN112530451A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明的一种基于去噪自编码器的语音增强方法,其特点是,包括的内容有:构建去噪自编码器训练模型,多麦克风阵列获取时域差值,重构原声预测模型进行去噪处理,能够有效的降低噪声对语音信号的干扰,显著提高语音信号的信噪比,具有科学合理、结构简单、去噪效果好、适用范围广等优点。

Figure 202011128458

A speech enhancement method based on denoising auto-encoder of the present invention is characterized in that the content includes: constructing a de-noising auto-encoder training model, obtaining time-domain difference values from a multi-microphone array, and reconstructing an original sound prediction model for de-noising Noise processing can effectively reduce the interference of noise on speech signals and significantly improve the signal-to-noise ratio of speech signals.

Figure 202011128458

Description

基于去噪自编码器的语音增强方法Speech enhancement method based on denoising autoencoder

技术领域technical field

本发明属于语音信号处理技术领域,涉及一种基于去噪自编码器的语音增强方法。The invention belongs to the technical field of speech signal processing, and relates to a speech enhancement method based on a denoising autoencoder.

背景技术Background technique

语音降噪是语音处理系统的重要前端。背景噪音和人声干扰会降低语音信号的质量及可懂度,并在实际应用中导致性能下降,包括语音通信、助听器和语音识别。语音降噪的一个关键目标是提高干扰噪声存在时的质量和可懂度。Speech noise reduction is an important front-end for speech processing systems. Background noise and human-voice interference can reduce the quality and intelligibility of speech signals and cause performance degradation in real-world applications, including voice communications, hearing aids, and speech recognition. A key goal of speech noise reduction is to improve the quality and intelligibility in the presence of interfering noise.

在语音降噪算法中,最常用的方法就是谱减法。谱减法具有算法简单、运算量小的特点。该算法的不足之处是处理后会产生听上去类似音乐的“音乐噪声”。基于自适应滤波器法的语音降噪算法,可以利用前一帧的滤波器参数和滤波结果,自动调整当前帧的滤波器参数,它对干净语音信号和噪声的先验知识要求较低。从而适应干净语音信号和噪声未知的随机变化和统计性,所以降噪后的语音无论在信噪比和听感上都有明显的进步。但这类算法往往存在收敛速度慢、不适用于非平稳噪声问题。基于最小均方误差估计法(MMSE)语音降噪算法能够有效抑制残留的“音乐噪声”。但是这种方法在低信噪比的情况下,对语音帧和非语音帧的识别极容易出错,导致降噪后的语音严重失真。基于子空间的语音降噪算法是通过空间分解将整个空间分为纯噪声子空间和纯语音子空间。通过设计一个既要保证残差信号谱的同时,还要考虑使语音失真最小化的估计器,来去除噪声子空间并估计语音信号特征值从而实现语音降噪。其中一种最常用的基于最优约束估计器的子空间语音降噪,但该语音降噪算法复杂度很高,很难在嵌入式平台上实现。小波变换方法是一种新型变换分析方法,能够在时间或空间上进行频率的局部分析。通过伸缩和平移运算对信号进行逐步尺度细化,具有多分辨率分析的特性,能够自适应信号分析的要求,现已被广泛运用在音频、图像处理领域中。依据小波变换能够有效去除数据的相关性特点,使得干净语音信号能量集中在小波域中的较大的小波系数中,噪声能量则集中在较小的小波系数中。其本质上是一种小波域滤波算法,选择适当的阈值是关系到系统表现的关键所在。但是阈值获取困难且算法复杂度越来越高,较难用于实时通信。深度神经网络(DNN)应用于语音降噪工作变得越来越流行。基于深度神经网络的语音降噪算法是通过堆叠自动编码器,形成一个深层神经网络,输入为含噪语音的对数功率谱,对应输出干净语音信号的对数功率谱。该网络虽然较传统单通道语音算法有较好的降噪效果,但存在网络训练困难、在低信噪比条件下表现能力欠佳的问题。In speech noise reduction algorithms, the most commonly used method is spectral subtraction. Spectral subtraction has the characteristics of simple algorithm and small computational complexity. The downside of this algorithm is that it produces "musical noise" that sounds like music after processing. The speech noise reduction algorithm based on the adaptive filter method can use the filter parameters and filtering results of the previous frame to automatically adjust the filter parameters of the current frame, which requires less prior knowledge of clean speech signals and noise. Therefore, it can adapt to the unknown random changes and statistics of clean speech signals and noise, so the speech after noise reduction has obvious improvement in signal-to-noise ratio and hearing sense. However, such algorithms often have slow convergence speed and are not suitable for non-stationary noise problems. The speech noise reduction algorithm based on Minimum Mean Square Error Estimation (MMSE) can effectively suppress the residual "music noise". However, in the case of low signal-to-noise ratio, the recognition of speech frames and non-speech frames is extremely error-prone, resulting in severe distortion of speech after noise reduction. The speech noise reduction algorithm based on subspace divides the whole space into pure noise subspace and pure speech subspace through space decomposition. By designing an estimator that not only guarantees the residual signal spectrum, but also considers minimizing the speech distortion, the noise subspace is removed and the eigenvalues of the speech signal are estimated to achieve speech noise reduction. One of the most commonly used subspace speech noise reduction based on optimal constraint estimator, but this speech noise reduction algorithm is very complex and difficult to implement on embedded platforms. The wavelet transform method is a new transform analysis method, which can perform local analysis of frequency in time or space. The signal is gradually refined by scaling and translation operations, which has the characteristics of multi-resolution analysis and can adapt to the requirements of signal analysis. It has been widely used in the fields of audio and image processing. According to the wavelet transform can effectively remove the correlation characteristics of the data, the clean speech signal energy is concentrated in the larger wavelet coefficients in the wavelet domain, and the noise energy is concentrated in the smaller wavelet coefficients. It is essentially a wavelet domain filtering algorithm, and choosing an appropriate threshold is the key to the performance of the system. However, the threshold value is difficult to obtain and the algorithm complexity is getting higher and higher, and it is difficult to be used for real-time communication. Deep neural networks (DNNs) are becoming more and more popular for speech noise reduction work. The speech noise reduction algorithm based on deep neural network is to form a deep neural network by stacking auto-encoders. Although the network has better noise reduction effect than the traditional single-channel speech algorithm, it has the problems of difficulty in network training and poor performance under the condition of low signal-to-noise ratio.

发明内容SUMMARY OF THE INVENTION

本发明的目的是,为了降低噪声对语音信号的干扰,提高语音信号的信噪比,提出一种基于去噪自编码器的语音增强方法,实现语音信号的增强。The purpose of the present invention is to propose a speech enhancement method based on denoising autoencoder to realize the enhancement of speech signal in order to reduce the interference of noise on speech signal and improve the signal-to-noise ratio of speech signal.

本发明的目的是由以下技术方案来实现的:一种基于去噪自编码器的语音增强方法,其特征是,它包括的内容有:构建去噪自编码器训练模型,多麦克风阵列获取时域差值,重构原声预测模型进行去噪处理,The object of the present invention is achieved by the following technical solutions: a speech enhancement method based on denoising autoencoder, characterized in that the content it includes includes: constructing a denoising autoencoder training model, when multiple microphone arrays acquire Domain difference, reconstruct the original sound prediction model for denoising,

1)构建去噪自编码器训练模型1) Build a denoising autoencoder training model

去噪自编码器训练模型设计为三层网络模型,第一层为输入层,中间层为隐藏层,设计节点个数为1024个,第三层为输出层,将输出层与原始无损数据进行比对,最小化损失值:The denoising autoencoder training model is designed as a three-layer network model. The first layer is the input layer, the middle layer is the hidden layer, the number of design nodes is 1024, and the third layer is the output layer. The output layer is compared with the original lossless data. Alignment, minimize the loss value:

Figure BDA0002734310050000021
Figure BDA0002734310050000021

式中,

Figure BDA0002734310050000022
是样本x经过损坏过程
Figure BDA00027343100500000213
后得到的损坏样本,通常分布pdecoder是因子的分布,平局参数由前馈网络给出,这里对负对数释然
Figure BDA0002734310050000023
进行基于梯度下降法的近似最小化,
Figure BDA0002734310050000024
即是样本
Figure BDA0002734310050000025
的概率分布,这样构成了确定的自编码器,也就是一个前馈的网络,并且能够使用与其他前馈网络完全相同的方式进行训练,因此整个自动编码器就可类比为下一个期望的梯度下降:In the formula,
Figure BDA0002734310050000022
is the sample x that has undergone the damage process
Figure BDA00027343100500000213
The damaged samples obtained later, usually the distribution p decoder is the distribution of factors, and the draw parameters are given by the feedforward network, where the negative logarithm is relieved
Figure BDA0002734310050000023
perform an approximate minimization based on gradient descent,
Figure BDA0002734310050000024
the sample
Figure BDA0002734310050000025
The probability distribution of , which constitutes a deterministic auto-encoder, which is a feed-forward network, and can be trained in exactly the same way as other feed-forward networks, so the entire auto-encoder can be analogized to the next expected gradient decline:

Figure BDA0002734310050000026
Figure BDA0002734310050000026

其中,

Figure BDA0002734310050000027
是训练数据的分布,
Figure BDA0002734310050000028
表示对
Figure BDA0002734310050000029
分布的期望值,
Figure BDA00027343100500000210
表示对
Figure BDA00027343100500000211
样本
Figure BDA00027343100500000212
在全量x上的下一个期望值;in,
Figure BDA0002734310050000027
is the distribution of training data,
Figure BDA0002734310050000028
express right
Figure BDA0002734310050000029
the expected value of the distribution,
Figure BDA00027343100500000210
express right
Figure BDA00027343100500000211
sample
Figure BDA00027343100500000212
the next expected value on the full x;

2)多麦克风阵列获取时域差值2) Multi-microphone array to obtain time domain difference

麦克风阵列的语音增强方法的优势在于考虑了声源的位置信息,能够实现空间滤波,所以对具有方向性的噪声具有优良的抑制效果,因此,将麦克风阵列的技术应用在抑制干扰语音中,具体实现是对期望方向的语音信号进行保留;The advantage of the voice enhancement method of the microphone array is that it considers the position information of the sound source and can realize spatial filtering, so it has an excellent suppression effect on directional noise. Therefore, the technology of the microphone array is applied to suppress the interfering voice. The realization is to reserve the speech signal in the desired direction;

首先,不同的麦克风由于位置不同,所以接收的语音信号必定存在着时间偏差,因此利用抽头延迟线结构(Tapped Delay-lines,TDLs)来实现对宽带语音信号的波束形成,TDLs结构的固定波束形成算法,通过多抽头的延迟来产生不同频率的分量,然后通过滤波系数描述来约束各麦克风的输入信号,使得期望方向上的信号得到保留,并在非期望方向上形成零陷,从而实现对固定声源方向的波束形成,TDLs结构的固定波束形成算法能够对固定噪声源方向的信号进行抑制,并且对相干和非相干噪声都能实现有效地抑制,其表达式为式(3):First of all, due to the different positions of different microphones, there must be time deviations in the received voice signals. Therefore, tapped delay-lines (TDLs) are used to realize beamforming of broadband voice signals, and fixed beamforming of TDLs structure is used. The algorithm generates components of different frequencies through multi-tap delay, and then constrains the input signal of each microphone through the filter coefficient description, so that the signal in the desired direction is retained, and a null is formed in the undesired direction, so as to realize the fixed For beamforming in the direction of the sound source, the fixed beamforming algorithm of the TDLs structure can suppress the signal in the direction of the fixed noise source, and can effectively suppress both coherent and incoherent noise. Its expression is Equation (3):

F=WD (3)F=WD (3)

式中,矩阵D为方向矩阵,用来对不同角度的语音信号进行频域对齐,W为不同入射角度的语音信号,ω0,…,ωJ-1,分别代表了不同的频率分量,矩阵F是目标响应矩阵,同样地,每一个分量对应着不同入射角度信号的目标响应,通过设置目标响应矩阵F,就能够决定固定波束形成结构对哪些方向的语音信号进行保留,又对哪些方向的语音信号进行抑制,矩阵W是权重系数矩阵,也是TDLs结构需要设计的部分,通过求解式(3),得到的矩阵系数解ωi,j,便是最终需要的设计的滤波器系数;In the formula, the matrix D is the direction matrix, which is used to align the speech signals of different angles in the frequency domain, W is the speech signals of different incident angles, ω 0 , ..., ω J-1 , respectively represent different frequency components, the matrix F is the target response matrix. Similarly, each component corresponds to the target response of signals with different incident angles. By setting the target response matrix F, it can be determined which directions of speech signals are reserved by the fixed beamforming structure, and which directions are reserved. To suppress the speech signal, the matrix W is the weight coefficient matrix, which is also the part that needs to be designed in the TDLs structure. By solving the formula (3), the obtained matrix coefficient solution ω i,j is the final designed filter coefficient;

然后利用信号的输出来自适应地调整类似TDLs结构中的权重系数ωi,j,来达到对声学环境的变化具有一定鲁棒性的目的,在自适应的波束形成算法中,使用LCMV结构进行调整,LCMV结构是在式(3)的基础上进行调整,调整为式(4):Then, the output of the signal is used to adaptively adjust the weight coefficient ω i,j in the similar TDLs structure to achieve a certain robustness to changes in the acoustic environment. In the adaptive beamforming algorithm, the LCMV structure is used to adjust , the LCMV structure is adjusted on the basis of formula (3), and adjusted to formula (4):

Figure BDA0002734310050000031
Figure BDA0002734310050000031

其中,Ryy为输入信号Y的自相关矩阵的期望,用Ryy≈YYH来进行估算,argminWWHRyyW表示通过最小化输出功率来自适应地调整权重系数W,从而使干扰目标方向的信号得到抑制,求解式(3)与式(4),便得到系数矩阵W的值:Among them, R yy is the expectation of the autocorrelation matrix of the input signal Y, which is estimated by R yy ≈ YY H , and argmin W W H R yy W represents the adaptive adjustment of the weight coefficient W by minimizing the output power, so as to make the interference target The signal in the direction is suppressed, and equations (3) and (4) are solved to obtain the value of the coefficient matrix W:

Figure BDA0002734310050000032
Figure BDA0002734310050000032

根据上述解系数矩阵W的值,计算出时域上的差值;According to the value of the above-mentioned solution coefficient matrix W, the difference value in the time domain is calculated;

3)重构原声预测模型进行去噪处理3) Reconstruct the original sound prediction model for denoising processing

在计算出时域差值后,得出的语音信号为失真的语音信号,因为单独使用多麦克风阵列算法的结构,将存在同频语音相减低消的情况,同时对于不同域的语音信号,存在风噪声消除不彻底,导致“音乐噪声”的问题,处理到此处的模型并不具有良好的鲁棒性,因此需要对失真的语音信号进行重新预测,将失真语音作为输入层传入第一步的自编码器模型之前,还需要进行一步滤波去噪处理:After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness, so it is necessary to re-predict the distorted speech signal, and the distorted speech is introduced into the first layer as the input layer. Before the autoencoder model of the first step, a further step of filtering and denoising is required:

Figure BDA0002734310050000033
Figure BDA0002734310050000033

Figure BDA0002734310050000034
是估计的先验信噪比(a prior SNR),所以整个求解的过程都是围绕如何求解这个先验信噪比进行的,而在这之前,先要估计后验信噪比和语音存在概率,后验信噪比的定义如下:
Figure BDA0002734310050000034
is the estimated prior signal-to-noise ratio (a prior SNR), so the entire solution process revolves around how to solve this prior signal-to-noise ratio, and before that, the posterior signal-to-noise ratio and the probability of speech existence must be estimated , the posterior signal-to-noise ratio is defined as follows:

Figure BDA0002734310050000035
Figure BDA0002734310050000035

Figure BDA0002734310050000036
是噪声的功率谱,是通过Cohen提出的OMLSA方法求得的,对比γ(t,d)和预先设定的阈值Tr,如果大于这个阈值,则语音的存在的索引I(d)设为1,否则为0,其实这有点类似理想二值掩蔽的概念,即如果是语音主导的就设定为1,否则就是设定为0,那么语音存在概率就能够通过以下方式进行估计:
Figure BDA0002734310050000036
is the power spectrum of the noise, which is obtained by the OMLSA method proposed by Cohen. Compare γ(t, d) with the preset threshold Tr. If it is greater than this threshold, the index I(d) of the existence of speech is set to 1 , otherwise it is 0. In fact, this is somewhat similar to the concept of ideal binary masking, that is, if it is voice-dominant, it is set to 1, otherwise it is set to 0, then the probability of voice existence can be estimated in the following ways:

p(t,d)=0.95p(t-1,d)+0.05I(d) (8)p(t,d)=0.95p(t-1,d)+0.05I(d) (8)

能够看出语音存在概率是通过前一时刻的语音存在概率和当前频段的语音存在索引的迭代平均结果,最终先验信噪比能够通过如下方式进行估计:It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:

Figure BDA0002734310050000041
Figure BDA0002734310050000041

先验信噪比有三部分构成,第一部分是前一时刻的先验信噪比,第二部分是通过DNN估计得到的语音和通过OMLSA方法估计得到的噪声谱而算得的先验信噪比,最后一部分是利用后验信噪比对先验信噪比的最大似然估计,得到结果后再重新输入第一步的自编码器模型,结果为最终的降噪语音。The prior signal-to-noise ratio consists of three parts. The first part is the prior signal-to-noise ratio at the previous moment. The second part is the prior signal-to-noise ratio calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method. The last part is to use the maximum likelihood estimation of the prior signal-to-noise ratio by the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise-reduced speech.

本发明的一种基于去噪自编码器的语音增强方法,它包括的内容有:构建去噪自编码器训练模型,多麦克风阵列获取时域差值,重构原声预测模型进行去噪处理等步骤,能够有效的降低噪声对语音信号的干扰,提高语音信号的信噪比,具有科学合理、结构简单、去噪效果好、适用范围广等优点。A speech enhancement method based on denoising auto-encoder of the present invention includes the following steps: constructing a de-noising auto-encoder training model, acquiring time-domain difference values from a multi-microphone array, reconstructing an original sound prediction model for de-noising processing, etc. The steps can effectively reduce the interference of noise on the speech signal, improve the signal-to-noise ratio of the speech signal, and have the advantages of scientific and reasonable, simple structure, good denoising effect, wide application range and the like.

附图说明Description of drawings

图1为一种基于去噪自编码器的语音增强方法流程图。FIG. 1 is a flow chart of a speech enhancement method based on denoising autoencoder.

具体实施方式Detailed ways

下面利用附图和具体实施方式对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

参照图1,本发明的基于去噪自编码器的语音增强方法,它包括的内容有:构建去噪自编码器训练模型,多麦克风阵列获取时域差值,重构原声预测模型进行去噪处理。Referring to Fig. 1, the speech enhancement method based on the denoising autoencoder of the present invention includes the following: constructing a denoising autoencoder training model, obtaining a time domain difference with a multi-microphone array, reconstructing the original sound prediction model for denoising deal with.

1)构建去噪自编码器训练模型1) Build a denoising autoencoder training model

去噪自编码器训练模型设计为三层网络模型,第一层为输入层,中间层为隐藏层,设计节点个数为1024个,第三层为输出层,将输出层与原始无损数据进行比对,最小化损失值:The denoising autoencoder training model is designed as a three-layer network model. The first layer is the input layer, the middle layer is the hidden layer, the number of design nodes is 1024, and the third layer is the output layer. The output layer is compared with the original lossless data. Alignment, minimize the loss value:

Figure BDA0002734310050000042
Figure BDA0002734310050000042

式中,

Figure BDA0002734310050000043
是样本x经过损坏过程
Figure BDA0002734310050000044
后得到的损坏样本,通常分布pdecoder是因子的分布,平局参数由前馈网络给出,这里对负对数释然
Figure BDA0002734310050000045
进行基于梯度下降法的近似最小化,
Figure BDA0002734310050000046
即是样本
Figure BDA0002734310050000047
的概率分布,这样构成了确定的自编码器,也就是一个前馈的网络,并且能够使用与其他前馈网络完全相同的方式进行训练,因此整个自动编码器就可类比为下一个期望的梯度下降:In the formula,
Figure BDA0002734310050000043
is the sample x that has undergone the damage process
Figure BDA0002734310050000044
The damaged samples obtained later, usually the distribution p decoder is the distribution of factors, and the draw parameters are given by the feedforward network, where the negative logarithm is relieved
Figure BDA0002734310050000045
perform an approximate minimization based on gradient descent,
Figure BDA0002734310050000046
the sample
Figure BDA0002734310050000047
The probability distribution of , which constitutes a deterministic auto-encoder, which is a feed-forward network, and can be trained in exactly the same way as other feed-forward networks, so the entire auto-encoder can be analogized to the next expected gradient decline:

Figure BDA0002734310050000048
Figure BDA0002734310050000048

其中,

Figure BDA0002734310050000051
是训练数据的分布,
Figure BDA0002734310050000052
表示对
Figure BDA0002734310050000053
分布的期望值,
Figure BDA0002734310050000054
表示对
Figure BDA0002734310050000055
样本
Figure BDA0002734310050000056
在全量x上的下一个期望值。in,
Figure BDA0002734310050000051
is the distribution of training data,
Figure BDA0002734310050000052
express right
Figure BDA0002734310050000053
the expected value of the distribution,
Figure BDA0002734310050000054
express right
Figure BDA0002734310050000055
sample
Figure BDA0002734310050000056
The next expected value on full x.

2)多麦克风阵列获取时域差值2) Multi-microphone array to obtain time domain difference

麦克风阵列的语音增强方法的优势在于考虑了声源的位置信息,能够实现空间滤波,所以对具有方向性的噪声具有优良的抑制效果,因此,将麦克风阵列的技术应用在抑制干扰语音中,具体实现是对期望方向的语音信号进行保留;The advantage of the voice enhancement method of the microphone array is that it considers the position information of the sound source and can realize spatial filtering, so it has an excellent suppression effect on directional noise. Therefore, the technology of the microphone array is applied to suppress the interfering voice. The realization is to reserve the speech signal in the desired direction;

首先,不同的麦克风由于位置不同,所以接收的语音信号必定存在着时间偏差,因此利用抽头延迟线结构(Tapped Delay-lines,TDLs)来实现对宽带语音信号的波束形成,TDLs结构的固定波束形成算法,通过多抽头的延迟来产生不同频率的分量,然后通过滤波系数描述来约束各麦克风的输入信号,使得期望方向上的信号得到保留,并在非期望方向上形成零陷,从而实现对固定声源方向的波束形成,TDLs结构的固定波束形成算法能够对固定噪声源方向的信号进行抑制,并且对相干和非相干噪声都能实现有效地抑制,其表达式为式(3):First of all, due to the different positions of different microphones, there must be time deviations in the received voice signals. Therefore, tapped delay-lines (TDLs) are used to realize beamforming of broadband voice signals, and fixed beamforming of TDLs structure is used. The algorithm generates components of different frequencies through multi-tap delay, and then constrains the input signal of each microphone through the filter coefficient description, so that the signal in the desired direction is retained, and a null is formed in the undesired direction, so as to realize the fixed For beamforming in the direction of the sound source, the fixed beamforming algorithm of the TDLs structure can suppress the signal in the direction of the fixed noise source, and can effectively suppress both coherent and incoherent noise. Its expression is Equation (3):

F=WD (3)F=WD (3)

式中,矩阵D为方向矩阵,用来对不同角度的语音信号进行频域对齐,W为不同入射角度的语音信号,ω0,…,ωJ-1,分别代表了不同的频率分量,矩阵F是目标响应矩阵,同样地,每一个分量对应着不同入射角度信号的目标响应,通过设置目标响应矩阵F,就能够决定固定波束形成结构对哪些方向的语音信号进行保留,又对哪些方向的语音信号进行抑制,矩阵W是权重系数矩阵,也是TDLs结构需要设计的部分,通过求解式(3),得到的矩阵系数解ωi,j,便是最终需要的设计的滤波器系数;In the formula, the matrix D is the direction matrix, which is used to align the speech signals of different angles in the frequency domain, W is the speech signals of different incident angles, ω 0 , ..., ω J-1 , respectively represent different frequency components, the matrix F is the target response matrix. Similarly, each component corresponds to the target response of signals with different incident angles. By setting the target response matrix F, it can be determined which directions of speech signals are reserved by the fixed beamforming structure, and which directions are reserved. To suppress the speech signal, the matrix W is the weight coefficient matrix, which is also the part that needs to be designed in the TDLs structure. By solving the formula (3), the obtained matrix coefficient solution ω i,j is the final designed filter coefficient;

然后利用信号的输出来自适应地调整类似TDLs结构中的权重系数ωi,j,来达到对声学环境的变化具有一定鲁棒性的目的,在自适应的波束形成算法中,使用LCMV结构进行调整,LCMV结构是在式(3)的基础上进行调整,调整为式(4):Then, the output of the signal is used to adaptively adjust the weight coefficient ω i,j in the similar TDLs structure to achieve a certain robustness to changes in the acoustic environment. In the adaptive beamforming algorithm, the LCMV structure is used to adjust , the LCMV structure is adjusted on the basis of formula (3), and adjusted to formula (4):

Figure BDA0002734310050000057
Figure BDA0002734310050000057

其中,Ryy为输入信号Y的自相关矩阵的期望,用Ryy≈YYH来进行估算,argminWWHRyyW表示通过最小化输出功率来自适应地调整权重系数W,从而使干扰目标方向的信号得到抑制,求解式(3)与式(4),便得到系数矩阵W的值:Among them, R yy is the expectation of the autocorrelation matrix of the input signal Y, which is estimated by R yy ≈ YY H , and argmin W W H R yy W represents the adaptive adjustment of the weight coefficient W by minimizing the output power, so as to make the interference target The signal in the direction is suppressed, and equations (3) and (4) are solved to obtain the value of the coefficient matrix W:

Figure BDA0002734310050000058
Figure BDA0002734310050000058

根据上述解系数矩阵W的值,计算出时域上的差值。According to the value of the above solution coefficient matrix W, the difference value in the time domain is calculated.

3)重构原声预测模型进行去噪处理3) Reconstruct the original sound prediction model for denoising processing

在计算出时域差值后,得出的语音信号为失真的语音信号,因为单独使用多麦克风阵列算法的结构,将存在同频语音相减低消的情况,同时对于不同域的语音信号,存在风噪声消除不彻底,导致“音乐噪声”的问题,处理到此处的模型并不具有良好的鲁棒性,因此需要对失真的语音信号进行重新预测,将失真语音作为输入层传乳第一步的自编码器模型之前,还需要进行一步滤波去噪处理:After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness. Therefore, it is necessary to re-predict the distorted speech signal, and the distorted speech is used as the input layer to transmit milk first. Before the autoencoder model of the first step, a further step of filtering and denoising is required:

Figure BDA0002734310050000061
Figure BDA0002734310050000061

这里的

Figure BDA0002734310050000062
是估计的先验信噪比(a prior SNR),所以整个求解的过程都是围绕如何求解这个先验信噪比进行的,而在这之前,先要估计后验信噪比和(aposteriorSNR)和语音存在概率,后验信噪比的定义如下:here
Figure BDA0002734310050000062
is the estimated prior signal-to-noise ratio (a prior SNR), so the entire solution process revolves around how to solve this prior signal-to-noise ratio, and before this, the posterior signal-to-noise ratio sum (aposteriorSNR) must be estimated and the probability of speech existence, the posterior signal-to-noise ratio is defined as follows:

Figure BDA0002734310050000063
Figure BDA0002734310050000063

这里的

Figure BDA0002734310050000064
是噪声的功率谱,是通过Cohen提出的OMLSA方法求得的(Cohen,2003),对比γ(t,d)和预先设定的阈值Tr,如果大于这个阈值,则语音的存在的索引I(d)设为1,否则为0,其实这有点类似理想二值掩蔽的概念,即如果是语音主导的就设定为1,否则就是设定为0,那么语音存在概率就能够通过以下方式进行估计:here
Figure BDA0002734310050000064
is the power spectrum of the noise, which is obtained by the OMLSA method proposed by Cohen (Cohen, 2003). Compare γ(t, d) with the preset threshold Tr. If it is greater than this threshold, the index I ( d) is set to 1, otherwise it is 0. In fact, this is a bit similar to the concept of ideal binary masking, that is, if it is voice-dominant, it is set to 1, otherwise it is set to 0, then the probability of voice existence can be carried out in the following ways. estimate:

p(t,d)=0.95p(t-1,d)+0.05I(d) (8)p(t,d)=0.95p(t-1,d)+0.05I(d) (8)

可以看出语音存在概率是通过前一时刻的语音存在概率和当前频段的语音存在索引的迭代平均结果,最终先验信噪比能够通过如下方式进行估计:It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:

Figure BDA0002734310050000065
Figure BDA0002734310050000065

这里的先验信噪比有三部分构成,第一部分是前一时刻的先验信噪比,第二部分是通过DNN估计得到的语音和通过OMLSA方法估计得到的噪声谱而算得的先验信噪比,最后一部分是利用后验信噪比对先验信噪比的最大似然估计,得到结果后再重新输入第一步的自编码器模型,结果为最终的降噪语音。The prior signal-to-noise ratio here consists of three parts, the first is the prior signal-to-noise ratio at the previous moment, and the second is the prior signal-to-noise calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method The last part is the maximum likelihood estimation of the prior signal-to-noise ratio using the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise reduction speech.

本发明的软件程序依据自动化、网络和计算机处理技术编制,是本领域技术人员所熟悉的技术。The software program of the present invention is compiled according to automation, network and computer processing technology, and is a technology familiar to those skilled in the art.

本发明实施例仅用于对本发明作进一步的说明,并非穷举,并不构成对权利要求保护范围的限定,本领域技术人员根据本发明实施例获得的启示,不经过创造性劳动就能够想到其它实质上等同的替代,均在本发明保护范围内。The embodiments of the present invention are only used to further illustrate the present invention, are not exhaustive, and do not constitute a limitation on the protection scope of the claims. Those skilled in the art can think of other Substantially equivalent substitutions are all within the protection scope of the present invention.

Claims (1)

1.一种基于去噪自编码器的语音增强方法,其特征是,它包括的内容有:构建去噪自编码器训练模型,多麦克风阵列获取时域差值,重构原声预测模型进行去噪处理,1. a speech enhancement method based on denoising self-encoder, it is characterized in that, the content it includes is: build denoising self-encoder training model, multi-microphone array obtains time domain difference, reconstructs the original sound prediction model to remove noise processing, 1)构建去噪自编码器训练模型1) Build a denoising autoencoder training model 去噪自编码器训练模型设计为三层网络模型,第一层为输入层,中间层为隐藏层,设计节点个数为1024个,第三层为输出层,将输出层与原始无损数据进行比对,最小化损失值:The denoising autoencoder training model is designed as a three-layer network model. The first layer is the input layer, the middle layer is the hidden layer, the number of design nodes is 1024, and the third layer is the output layer. The output layer is compared with the original lossless data. Alignment, minimize the loss value:
Figure FDA0002734310040000011
Figure FDA0002734310040000011
式中,
Figure FDA0002734310040000012
是样本x经过损坏过程
Figure FDA0002734310040000013
后得到的损坏样本,通常分布pdecoder是因子的分布,平局参数由前馈网络给出,这里对负对数释然
Figure FDA00027343100400000114
进行基于梯度下降法的近似最小化,
Figure FDA0002734310040000015
即是样本
Figure FDA00027343100400000113
的概率分布,这样构成了确定的自编码器,也就是一个前馈的网络,并且能够使用与其他前馈网络完全相同的方式进行训练,因此整个自动编码器就可类比为下一个期望的梯度下降:
In the formula,
Figure FDA0002734310040000012
is the sample x that has undergone the damage process
Figure FDA0002734310040000013
The damaged samples obtained later, usually the distribution p decoder is the distribution of factors, and the draw parameters are given by the feedforward network, where the negative logarithm is relieved
Figure FDA00027343100400000114
perform an approximate minimization based on gradient descent,
Figure FDA0002734310040000015
the sample
Figure FDA00027343100400000113
The probability distribution of , which constitutes a deterministic auto-encoder, which is a feed-forward network, and can be trained in exactly the same way as other feed-forward networks, so the entire auto-encoder can be analogized to the next expected gradient decline:
Figure FDA0002734310040000016
Figure FDA0002734310040000016
其中,
Figure FDA0002734310040000017
是训练数据的分布,
Figure FDA0002734310040000018
表示对
Figure FDA0002734310040000019
分布的期望值,
Figure FDA00027343100400000110
表示对
Figure FDA00027343100400000111
样本
Figure FDA00027343100400000112
在全量x上的下一个期望值;
in,
Figure FDA0002734310040000017
is the distribution of training data,
Figure FDA0002734310040000018
express right
Figure FDA0002734310040000019
the expected value of the distribution,
Figure FDA00027343100400000110
express right
Figure FDA00027343100400000111
sample
Figure FDA00027343100400000112
the next expected value on the full x;
2)多麦克风阵列获取时域差值2) Multi-microphone array to obtain time domain difference 麦克风阵列的语音增强方法的优势在于考虑了声源的位置信息,能够实现空间滤波,所以对具有方向性的噪声具有优良的抑制效果,因此,将麦克风阵列的技术应用在抑制干扰语音中,具体实现是对期望方向的语音信号进行保留;The advantage of the voice enhancement method of the microphone array is that it considers the position information of the sound source and can realize spatial filtering, so it has an excellent suppression effect on directional noise. Therefore, the technology of the microphone array is applied to suppress the interfering voice. The realization is to reserve the speech signal in the desired direction; 首先,不同的麦克风由于位置不同,所以接收的语音信号必定存在着时间偏差,因此利用抽头延迟线结构(Tapped Delay-lines,TDLs)来实现对宽带语音信号的波束形成,TDLs结构的固定波束形成算法,通过多抽头的延迟来产生不同频率的分量,然后通过滤波系数描述来约束各麦克风的输入信号,使得期望方向上的信号得到保留,并在非期望方向上形成零陷,从而实现对固定声源方向的波束形成,TDLs结构的固定波束形成算法能够对固定噪声源方向的信号进行抑制,并且对相干和非相干噪声都能实现有效地抑制,其表达式为式(3):First of all, due to the different positions of different microphones, there must be time deviations in the received voice signals. Therefore, tapped delay-lines (TDLs) are used to realize beamforming of broadband voice signals, and fixed beamforming of TDLs structure is used. The algorithm generates components of different frequencies through multi-tap delay, and then constrains the input signal of each microphone through the filter coefficient description, so that the signal in the desired direction is retained, and a null is formed in the undesired direction, so as to realize the fixed For beamforming in the direction of the sound source, the fixed beamforming algorithm of the TDLs structure can suppress the signal in the direction of the fixed noise source, and can effectively suppress both coherent and incoherent noise. Its expression is Equation (3): F=WD (3)F=WD (3) 式中,矩阵D为方向矩阵,用来对不同角度的语音信号进行频域对齐,W为不同入射角度的语音信号,ω0,…,ωJ-1,分别代表了不同的频率分量,矩阵F是目标响应矩阵,同样地,每一个分量对应着不同入射角度信号的目标响应,通过设置目标响应矩阵F,就能够决定固定波束形成结构对哪些方向的语音信号进行保留,又对哪些方向的语音信号进行抑制,矩阵W是权重系数矩阵,也是TDLs结构需要设计的部分,通过求解式(3),得到的矩阵系数解ωi,j,便是最终需要的设计的滤波器系数;In the formula, the matrix D is the direction matrix, which is used to align the speech signals of different angles in the frequency domain, W is the speech signals of different incident angles, ω 0 , ..., ω J-1 , respectively represent different frequency components, the matrix F is the target response matrix. Similarly, each component corresponds to the target response of signals with different incident angles. By setting the target response matrix F, it can be determined which directions of speech signals are reserved by the fixed beamforming structure, and which directions are reserved. The speech signal is suppressed, and the matrix W is the weight coefficient matrix, which is also the part of the TDLs structure that needs to be designed. By solving the formula (3), the obtained matrix coefficient solution ω i, j is the final designed filter coefficient; 然后利用信号的输出来自适应地调整类似TDLs结构中的权重系数ωi,j,来达到对声学环境的变化具有一定鲁棒性的目的,在自适应的波束形成算法中,使用LCMV结构进行调整,LCMV结构是在式(3)的基础上进行调整,调整为式(4):Then, the output of the signal is used to adaptively adjust the weight coefficient ω i,j in the structure similar to TDLs, so as to achieve a certain robustness to the changes of the acoustic environment. In the adaptive beamforming algorithm, the LCMV structure is used to adjust , the LCMV structure is adjusted on the basis of formula (3), and adjusted to formula (4):
Figure FDA0002734310040000021
Figure FDA0002734310040000021
其中,Ryy为输入信号Y的自相关矩阵的期望,用Ryy≈YYH来进行估算,argminwWHRyyW表示通过最小化输出功率来自适应地调整权重系数W,从而使干扰目标方向的信号得到抑制,求解式(3)与式(4),便得到系数矩阵W的值:Among them, R yy is the expectation of the autocorrelation matrix of the input signal Y, which is estimated by R yy ≈ YY H , and argmin w W H R yy W represents the adaptive adjustment of the weight coefficient W by minimizing the output power, so as to make the interference target The signal in the direction is suppressed, and equations (3) and (4) are solved to obtain the value of the coefficient matrix W:
Figure FDA0002734310040000022
Figure FDA0002734310040000022
根据上述解系数矩阵W的值,计算出时域上的差值;According to the value of the above-mentioned solution coefficient matrix W, the difference value in the time domain is calculated; 3)重构原声预测模型进行去噪处理3) Reconstruct the original sound prediction model for denoising processing 在计算出时域差值后,得出的语音信号为失真的语音信号,因为单独使用多麦克风阵列算法的结构,将存在同频语音相减低消的情况,同时对于不同域的语音信号,存在风噪声消除不彻底,导致“音乐噪声”的问题,处理到此处的模型并不具有良好的鲁棒性,因此需要对失真的语音信号进行重新预测,将失真语音作为输入层传入第一步的自编码器模型之前,还需要进行一步滤波去噪处理:After calculating the difference in the time domain, the obtained speech signal is a distorted speech signal, because the structure of the multi-microphone array algorithm is used alone, there will be a situation where the same frequency speech is reduced and canceled. At the same time, for speech signals in different domains, there are The wind noise is not completely eliminated, which leads to the problem of "music noise". The model processed here does not have good robustness, so it is necessary to re-predict the distorted speech signal, and the distorted speech is introduced into the first layer as the input layer. Before the autoencoder model of the first step, a further step of filtering and denoising is required:
Figure FDA0002734310040000023
Figure FDA0002734310040000023
Figure FDA0002734310040000024
是估计的先验信噪比(a prior SNR),所以整个求解的过程都是围绕如何求解这个先验信噪比进行的,而在这之前,先要估计后验信噪比和语音存在概率,后验信噪比的定义如下:
Figure FDA0002734310040000024
is the estimated prior signal-to-noise ratio (a prior SNR), so the entire solution process revolves around how to solve this prior signal-to-noise ratio, and before that, the posterior signal-to-noise ratio and the probability of speech existence must be estimated , the posterior signal-to-noise ratio is defined as follows:
Figure FDA0002734310040000025
Figure FDA0002734310040000025
Figure FDA0002734310040000026
是噪声的功率谱,是通过Cohen提出的OMLSA方法求得的,对比γ(t,d)和预先设定的阈值Tr,如果大于这个阈值,则语音的存在的索引I(d)设为1,否则为0,其实这有点类似理想二值掩蔽的概念,即如果是语音主导的就设定为1,否则就是设定为0,那么语音存在概率就能够通过以下方式进行估计:
Figure FDA0002734310040000026
is the power spectrum of the noise, which is obtained by the OMLSA method proposed by Cohen. Compare γ(t, d) with the preset threshold Tr. If it is greater than this threshold, the index I(d) of the existence of speech is set to 1 , otherwise it is 0. In fact, this is somewhat similar to the concept of ideal binary masking, that is, if it is voice-dominant, it is set to 1, otherwise it is set to 0, then the probability of voice existence can be estimated in the following ways:
p(t,d)=0.95p(t-1,d)+0.05I(d) (8)p(t,d)=0.95p(t-1,d)+0.05I(d) (8) 能够看出语音存在概率是通过前一时刻的语音存在概率和当前频段的语音存在索引的迭代平均结果,最终先验信噪比能够通过如下方式进行估计:It can be seen that the speech existence probability is the iterative average result of the speech existence probability of the previous moment and the speech existence index of the current frequency band, and the final prior signal-to-noise ratio can be estimated in the following way:
Figure FDA0002734310040000031
Figure FDA0002734310040000031
先验信噪比有三部分构成,第一部分是前一时刻的先验信噪比,第二部分是通过DNN估计得到的语音和通过OMLSA方法估计得到的噪声谱而算得的先验信噪比,最后一部分是利用后验信噪比对先验信噪比的最大似然估计,得到结果后再重新输入第一步的自编码器模型,结果为最终的降噪语音。The prior signal-to-noise ratio consists of three parts. The first part is the prior signal-to-noise ratio at the previous moment. The second part is the prior signal-to-noise ratio calculated from the speech estimated by DNN and the noise spectrum estimated by the OMLSA method. The last part is to use the maximum likelihood estimation of the prior signal-to-noise ratio by the posterior signal-to-noise ratio, and then re-input the auto-encoder model of the first step after obtaining the result, and the result is the final noise-reduced speech.
CN202011128458.7A 2020-10-20 2020-10-20 Speech enhancement method based on denoising autoencoder Pending CN112530451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011128458.7A CN112530451A (en) 2020-10-20 2020-10-20 Speech enhancement method based on denoising autoencoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011128458.7A CN112530451A (en) 2020-10-20 2020-10-20 Speech enhancement method based on denoising autoencoder

Publications (1)

Publication Number Publication Date
CN112530451A true CN112530451A (en) 2021-03-19

Family

ID=74979054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011128458.7A Pending CN112530451A (en) 2020-10-20 2020-10-20 Speech enhancement method based on denoising autoencoder

Country Status (1)

Country Link
CN (1) CN112530451A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345469A (en) * 2021-05-24 2021-09-03 北京小米移动软件有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN114400023A (en) * 2022-01-22 2022-04-26 天津中科听芯科技有限公司 Method and equipment for detecting voice quality of hearing aid
CN114723663A (en) * 2022-03-03 2022-07-08 中国人民解放军战略支援部队信息工程大学 A Preprocessing Defense Method Against Target Detection Adversarial Attacks
CN115662444A (en) * 2022-12-14 2023-01-31 北京惠朗时代科技有限公司 Electronic seal voice interactive application method and system based on artificial intelligence
CN116774149A (en) * 2023-08-10 2023-09-19 海底鹰深海科技股份有限公司 Underwater acoustic communication and positioning integrated system
CN117037827A (en) * 2023-08-10 2023-11-10 长沙东玛克信息科技有限公司 Multi-channel microphone array voice modulation method
CN117349603A (en) * 2023-12-06 2024-01-05 小舟科技有限公司 Adaptive noise reduction method and device for electroencephalogram signals, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1633121A1 (en) * 2004-09-03 2006-03-08 Harman Becker Automotive Systems GmbH Speech signal processing with combined adaptive noise reduction and adaptive echo compensation
US9813808B1 (en) * 2013-03-14 2017-11-07 Amazon Technologies, Inc. Adaptive directional audio enhancement and selection
CN107396158A (en) * 2017-08-21 2017-11-24 深圳创维-Rgb电子有限公司 A kind of acoustic control interactive device, acoustic control exchange method and television set
CN108922554A (en) * 2018-06-04 2018-11-30 南京信息工程大学 The constant Wave beam forming voice enhancement algorithm of LCMV frequency based on logarithm Power estimation
EP3462452A1 (en) * 2012-08-24 2019-04-03 Oticon A/s Noise estimation for use with noise reduction and echo cancellation in personal communication
CN109994120A (en) * 2017-12-29 2019-07-09 福州瑞芯微电子股份有限公司 Sound enhancement method, system, speaker and storage medium based on diamylose
CN111755013A (en) * 2020-07-07 2020-10-09 苏州思必驰信息科技有限公司 Denoising automatic encoder training method and speaker recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1633121A1 (en) * 2004-09-03 2006-03-08 Harman Becker Automotive Systems GmbH Speech signal processing with combined adaptive noise reduction and adaptive echo compensation
EP3462452A1 (en) * 2012-08-24 2019-04-03 Oticon A/s Noise estimation for use with noise reduction and echo cancellation in personal communication
US9813808B1 (en) * 2013-03-14 2017-11-07 Amazon Technologies, Inc. Adaptive directional audio enhancement and selection
CN107396158A (en) * 2017-08-21 2017-11-24 深圳创维-Rgb电子有限公司 A kind of acoustic control interactive device, acoustic control exchange method and television set
CN109994120A (en) * 2017-12-29 2019-07-09 福州瑞芯微电子股份有限公司 Sound enhancement method, system, speaker and storage medium based on diamylose
CN108922554A (en) * 2018-06-04 2018-11-30 南京信息工程大学 The constant Wave beam forming voice enhancement algorithm of LCMV frequency based on logarithm Power estimation
CN111755013A (en) * 2020-07-07 2020-10-09 苏州思必驰信息科技有限公司 Denoising automatic encoder training method and speaker recognition system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROHITH MARS: "A frequency-invariant fixed beamformer for speech enhancement", <SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014 ASIA-PACIFIC> *
杨蕾: "麦克风阵列语音增强方法研究", 《中国优秀硕士学位论文全文数据库》 *
陈鑫源: "自适应双数据流语音增强方法研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345469A (en) * 2021-05-24 2021-09-03 北京小米移动软件有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN114400023A (en) * 2022-01-22 2022-04-26 天津中科听芯科技有限公司 Method and equipment for detecting voice quality of hearing aid
CN114723663A (en) * 2022-03-03 2022-07-08 中国人民解放军战略支援部队信息工程大学 A Preprocessing Defense Method Against Target Detection Adversarial Attacks
CN115662444A (en) * 2022-12-14 2023-01-31 北京惠朗时代科技有限公司 Electronic seal voice interactive application method and system based on artificial intelligence
CN115662444B (en) * 2022-12-14 2023-04-07 北京惠朗时代科技有限公司 Electronic seal voice interactive application method and system based on artificial intelligence
CN116774149A (en) * 2023-08-10 2023-09-19 海底鹰深海科技股份有限公司 Underwater acoustic communication and positioning integrated system
CN117037827A (en) * 2023-08-10 2023-11-10 长沙东玛克信息科技有限公司 Multi-channel microphone array voice modulation method
CN117349603A (en) * 2023-12-06 2024-01-05 小舟科技有限公司 Adaptive noise reduction method and device for electroencephalogram signals, equipment and storage medium
CN117349603B (en) * 2023-12-06 2024-03-12 小舟科技有限公司 Adaptive noise reduction method and device for electroencephalogram signals, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112530451A (en) Speech enhancement method based on denoising autoencoder
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
US11373667B2 (en) Real-time single-channel speech enhancement in noisy and time-varying environments
CN108172231B (en) A Kalman Filter-Based Reverberation Method and System
CN112735460B (en) Beam forming method and system based on time-frequency masking value estimation
CN108154885A (en) It is a kind of to use QR-RLS algorithms to multicenter voice signal dereverberation method
CN112581973A (en) Voice enhancement method and system
CN113362846B (en) A Speech Enhancement Method Based on Generalized Sidelobe Cancellation Structure
CN111081267A (en) Multi-channel far-field speech enhancement method
CN106653043B (en) Adaptive Beamforming Method for Reducing Speech Distortion
CN111814515A (en) Active noise cancellation method based on improved variable-step LMS adaptation
Yang et al. A noise reduction method based on LMS adaptive filter of audio signals
CN110534127A (en) Applied to the microphone array voice enhancement method and device in indoor environment
CN112992173B (en) Signal separation and denoising method based on improved BCA blind source separation
Kothapally et al. Monaural speech dereverberation using deformable convolutional networks
CN110970044B (en) A speech enhancement method for speech recognition
Chen Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering
CN113658605B (en) Speech enhancement method based on deep learning assisted RLS filtering processing
CN114242095A (en) Neural network noise reduction system and method based on OMLSA framework adopting harmonic structure
CN113066483A (en) A Generative Adversarial Network Speech Enhancement Method Based on Sparse Continuity Constraints
CN111933169B (en) Voice noise reduction method for secondarily utilizing voice existence probability
CN114038475A (en) A Single-Channel Speech Enhancement System Based on Spectral Compensation
CN107393547A (en) Subband spectrum subtracts the double microarray sound enhancement methods offset with generalized sidelobe
CN113851141A (en) Novel method and device for noise suppression by microphone array
Sasaoka et al. Speech enhancement based on adaptive filter with variable step size for wideband and periodic noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210319