CN106504763A

CN106504763A - Multi-target Speech Enhancement Method Based on Microphone Array Based on Blind Source Separation and Spectral Subtraction

Info

Publication number: CN106504763A
Application number: CN201611191478.2A
Authority: CN
Inventors: 于鸿洋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-12-22
Filing date: 2016-12-21
Publication date: 2017-03-15

Abstract

The invention discloses a microphone array multi-target speech enhancement method based on blind source separation and spectral subtraction, which comprises the following steps: collecting multi-channel multi-target signals through a microphone array; performing band-pass filter processing on the collected single-channel signals respectively to obtain Shield non-speech segment noise and interference, and pre-emphasis processing; then perform speech windowing and framing processing to obtain frame signals, and use short-time Fourier transform to convert each frame to the frequency domain, and extract the amplitude spectrum and phase spectrum of each frame; Detect the start endpoint and end endpoint of the speech signal, estimate the noise power spectrum; reduce the background noise of the speech frame based on spectral subtraction; combine the output signal after spectral subtraction with the phase spectrum, and perform short-time Fourier inverse transform to obtain the speech signal in the time domain ; Finally, perform blind source separation to obtain each target signal. The implementation method of the invention is simple, has low resource requirements, low calculation complexity, and can realize multi-target signal enhancement.

Description

Multi-target Speech Enhancement Method Based on Microphone Array Based on Blind Source Separation and Spectral Subtraction

技术领域technical field

本发明属于信号处理技术和计算机语音信号处理技术领域，具体涉及一种基于阵列麦克风的语音增强方法。The invention belongs to the technical fields of signal processing technology and computer voice signal processing, and in particular relates to a voice enhancement method based on an array microphone.

背景技术Background technique

语音增强的目标是从含有噪声的语音信号中提取尽可能纯净的原始语音，抑制背景噪声，提高语音的质量并提高听者的舒适度，使听者不会感到疲劳。它在解决噪声污染、改善语音质量、提高语音可懂度等方面发挥着越来越重要的作用。语音增强技术是语音信号处理发展到实用阶段后需要迫切解决的问题。语音识别中抗噪声干扰是提高识别率的一个重要环节。随着语音识别应用的不断扩大并进入实用阶段，急切需要采用更为有效的语音增强技术，加强语音识别特征，使语音易于识别。语音信号是一种复杂的非线性信号，如何从各种混合语音信号中，特别是从同声道语音干扰中分离出所需要的语音信号是一个很难的数字信号处理问题，任何算法都不可能将噪声完全滤除，都很难在所有噪声都存在的情况下保持较高的主观及客观的评价性能。The goal of speech enhancement is to extract the original speech as pure as possible from the noisy speech signal, suppress background noise, improve the quality of speech and improve the comfort of the listener so that the listener will not feel tired. It is playing an increasingly important role in solving noise pollution, improving speech quality, and improving speech intelligibility. Speech enhancement technology is a problem that needs to be solved urgently after the speech signal processing develops to the practical stage. Anti-noise interference in speech recognition is an important link to improve the recognition rate. With the continuous expansion of speech recognition applications and entering the practical stage, there is an urgent need to adopt more effective speech enhancement technology to strengthen speech recognition features and make speech easy to recognize. Speech signal is a kind of complex nonlinear signal, how to separate the required speech signal from various mixed speech signals, especially from co-channel speech interference is a difficult digital signal processing problem, and any algorithm is impossible If the noise is completely filtered out, it is difficult to maintain high subjective and objective evaluation performance in the presence of all noise.

基于麦克风阵列语音增强方法典型的工作流程如图1所示，具体流程主要包括下列步骤：A typical workflow of the speech enhancement method based on the microphone array is shown in Figure 1, and the specific process mainly includes the following steps:

1)根据需求，设计满足需求的麦克风阵列结构。1) According to the requirements, design a microphone array structure that meets the requirements.

2)多通道语音信号采集系统，用于采集多通道语音信号；2) multi-channel voice signal acquisition system, used to collect multi-channel voice signals;

3)对采集的多通道语音信号进行预处理、语音激活检测、通道延时估计、目标信号方位估计等等的预处理操作。3) Perform preprocessing operations such as preprocessing, voice activation detection, channel delay estimation, and target signal orientation estimation on the collected multi-channel voice signals.

4)利用阵列语音增强算法进行语音增强，得到较为纯净的语音信号。4) The array speech enhancement algorithm is used for speech enhancement to obtain a relatively pure speech signal.

步骤1)中，设计一个合适的麦克风阵列结构是非常重要的。In step 1), it is very important to design an appropriate microphone array structure.

麦克风阵列拓扑结构可以分为一维线性阵列(包括等距阵列、嵌套线性阵以及非等距阵列)、二维面阵(包括均匀和非均匀圆形阵、方阵)与三维立体阵列。实际中，应用较多的有均匀线阵列、嵌套线阵、均匀面阵等。研究表明，阵列拓扑结构对麦克风阵列语音系统的影响较大。且阵列拓扑结构的设计与多通道信号模型的选择有密切关系。Microphone array topology can be divided into one-dimensional linear array (including equidistant array, nested linear array and non-equidistant array), two-dimensional area array (including uniform and non-uniform circular array, square array) and three-dimensional stereo array. In practice, uniform line arrays, nested line arrays, and uniform area arrays are widely used. The research shows that the array topology has a great influence on the microphone array speech system. And the design of array topology is closely related to the choice of multi-channel signal model.

根据声源距离阵列的远近，可以将声音信号模型分为远场模型和近场模型。两者的区别在于：远场模型使用平面波模型，它忽略各个通道接收信号的幅度差，信源相对阵列来说只有一个入射角度，各个阵元之间的延时长度呈线性关系；近场模型使用球面波形，它考虑接收信号间的幅度差，且对于每一个阵元来说都有一个入射角度，各个阵元间的延时长度没有明显关系。近场和远场的划分没有一个绝对的标准，通常认为当信源和阵列中心的距离远远大于信号波长时该信源处在远场；反之则为近场。According to the distance of the sound source from the array, the sound signal model can be divided into a far-field model and a near-field model. The difference between the two is: the far-field model uses the plane wave model, which ignores the amplitude difference of the received signal of each channel, the source has only one incident angle relative to the array, and the delay length between each array element is linear; the near-field model Using a spherical waveform, it considers the amplitude difference between received signals, and there is an incident angle for each array element, and the delay length between each array element has no obvious relationship. There is no absolute standard for the division of near field and far field. It is generally believed that when the distance between the source and the center of the array is much greater than the signal wavelength, the source is in the far field; otherwise, it is in the near field.

通常，麦克风阵列可以看作是一个用来进行空间采样的装置，与时间采样相似，阵列采样频率必须足够高才不会引起空间模糊，避免空间混叠。对于一个等距线阵，空间采样率定义为：即空间采样频率U_s由麦克风阵间距d决定。考虑到同一信号的相邻采样差别为一个相移，定义归一化的空间频率为：其中λ表示波长，Φ表示入射角度。为避免空间混叠，要求归一化频率U满足：此时入射角度对应范围是-90^°≤Φ≤90^°，因此相邻麦克风间隔(麦克风阵间距)应该为： Generally, a microphone array can be regarded as a device for spatial sampling. Similar to time sampling, the array sampling frequency must be high enough to avoid spatial ambiguity and avoid spatial aliasing. For an equidistant linear array, the spatial sampling rate is defined as: That is, the spatial sampling frequency U _s is determined by the microphone array spacing d. Considering that the difference between adjacent samples of the same signal is a phase shift, the normalized spatial frequency is defined as: Where λ represents the wavelength and Φ represents the angle of incidence. In order to avoid spatial aliasing, the normalized frequency U is required to satisfy: At this time, the corresponding range of incident angle is -90 ^° ≤ Φ ≤ 90 ^° , so the distance between adjacent microphones (microphone array distance) should be:

上述空间采样定理揭示了麦克风阵间距、信号频率和来波方向(入射角度Φ)三者之间关系。如果不满足空间采样定理，则会出现空间混叠现象。The above spatial sampling theorem reveals the relationship between the microphone array spacing, signal frequency and incoming wave direction (incident angle Φ). If the spatial sampling theorem is not satisfied, spatial aliasing occurs.

对于一个均匀线性麦克风阵列，定义r_m为声源到第m个麦克风阵列中心的直线距离。则第m个麦克风输出的离散信号可以表示为：x_m[n]＝s[n-Δn_m]+η_m[n]，其中s[n]为声源信号，Δn_m为第m个麦克风接收到的信号与声源信号之间的样本点延迟，η_m[n]为第m个麦克风接收到的噪声信号。令Δτ_m为第m个麦克风接收到的信号与声源信号之间的时间延迟，则有如下关系：其中f_s为时间采样频率，c为声波在空间中的传播速度。由此可以建立麦克风阵列输出的阵列信号矩阵：For a uniform linear microphone array, define r _m as the linear distance from the sound source to the center of the mth microphone array. Then the discrete signal output by the mth microphone can be expressed as: x _m [n]=s[n-Δn _m ]+η _m [n], where s[n] is the sound source signal, and Δn _m is the mth microphone The sample point delay between the received signal and the sound source signal, η _m [n] is the noise signal received by the mth microphone. Let Δτ _m be the time delay between the signal received by the mth microphone and the sound source signal, then the relationship is as follows: Among them, f _s is the time sampling frequency, and c is the propagation speed of the sound wave in space. From this, the array signal matrix output by the microphone array can be established:

x₁[n]＝s[n-Δn₁]+η₁[n]x ₁ [n]=s[n-Δn ₁ ]+η ₁ [n]

x₂[n]＝s[n-Δn₂]+η₂[n]x ₂ [n]=s[n-Δn ₂ ]+η ₂ [n]

x_N[n]＝s[n-Δn_N]+η_N[n]x _N [n]=s[n-Δn _N ]+η _N [n]

N为阵列麦克风的阵元个数。N is the number of array elements of the array microphone.

在步骤3)中，可视不同的增强方法或曾或减。In step 3), different enhancement methods can be used or subtracted.

在预处理中，预加重和预滤波是由语音信号特性所决定的。预滤波的目的有两个：①抑制输入信号各频域分量中频率超出f_s/2的所有分量，以防止混叠干扰；②抑制50Hz的电源工频干扰。这样预滤波器必须是一个带通滤波器，设其上、下截止频率分别为f_H和f_L，则f_H＝3400Hz，f_L＝60～100Hz，采样频率f_s＝16000Hz。In pre-processing, pre-emphasis and pre-filtering are determined by the characteristics of the speech signal. There are two purposes of pre-filtering: ① suppress all components whose frequency exceeds f _s /2 in each frequency domain component of the input signal to prevent aliasing interference; ② suppress 50Hz power frequency interference. Such a pre-filter must be a band-pass filter, and its upper and lower cut-off frequencies are respectively f _H and f _L , then f _H =3400Hz, f _L =60-100Hz, and sampling frequency f _s =16000Hz.

由于语音信号的平均功率谱受声门激励和口鼻辐射影响，高频端大约在800Hz以上按6dB/倍频跌落，所以在求语音信号频谱时，频率越高的相应成分越小，高频部分的频谱比低频部分的难求，为此要在预处理中进行预加重处理。预加重的目的是提升高频部分，使信号的频谱变得平坦，保持在低频到高频的整个频带中，能用同样的信噪比求频谱，以便于频谱分析或声道参数分析。预加重可由提升高频特性的预加重数字滤波器来实现，它一般是一阶数字滤波器，基于其工作原理，可以得到对应的加重方式为：s′(n)＝s(n)-αs(n+1)，为了恢复原信号，需要对做过预加重的信号频谱进行去加重处理，即s′′(n)＝s′(n)+βs′(n+1)其中，s(n)表示声源信号，s′(n)表示加重处理后的信号，s″(n)表示去加重处理后的信号，与β为加重因子，一般取-0.8～0.95。Since the average power spectrum of the speech signal is affected by glottal excitation and mouth and nose radiation, the high-frequency end drops at a rate of 6dB/octave above 800Hz. Therefore, when calculating the speech signal spectrum, the higher the frequency, the smaller the corresponding component, and the high-frequency Part of the spectrum is more difficult to find than the low frequency part, so pre-emphasis should be done in the pre-processing. The purpose of pre-emphasis is to enhance the high-frequency part, make the spectrum of the signal flat, and keep it in the entire frequency band from low frequency to high frequency. The same signal-to-noise ratio can be used to calculate the spectrum, so as to facilitate spectrum analysis or channel parameter analysis. Pre-emphasis can be realized by a pre-emphasis digital filter that improves high-frequency characteristics. It is generally a first-order digital filter. Based on its working principle, the corresponding emphasis method can be obtained as: s′(n)=s(n)-αs (n+1), in order to restore the original signal, it is necessary to de-emphasize the signal spectrum that has been pre-emphasized, that is, s''(n)=s'(n)+βs'(n+1) where, s( n) represents the sound source signal, s'(n) represents the signal after emphasis processing, and s"(n) represents the signal after de-emphasis processing, and β are aggravating factors, generally -0.8 to 0.95.

由于语音信号是一种非平稳的时变信号，其产生过程与发声器官的运动紧密相关。而发声器官的状态速度较声音振动的速度缓慢的多，因此语音信号可以认为是短时平稳的。研究发现，在5～50ms的范围内，语音频谱特征和一些物理特征参数基本保持不变。因此可以将平稳过程中的处理方法和理论引入到语音信号的短时处理当中，将语音信号划分为很多短时的语音段，每个短时的语音段称为一个分析帧。这样，对每一帧信号处理就相当于对特征固定的持续信号进行处理。帧既可以是连续的，也可以采用交叠分帧，一般帧长取10～30ms。取数据时，前一帧和后一帧的交迭部分称为帧移，帧移与帧长之比一般取为0～1/2。对取出的语音帧要经过加窗处理，即用一定的窗函数w(n)与信号相乘，从而形成加窗语音。加窗的主要作用在于减少由分帧处理带来的频谱泄露，这是因为，分帧是对语音信号的突然截断，相当于语音信号的频谱与矩形窗函数频谱的周期卷积。由于矩形窗频谱的旁瓣较高，信号的频谱会产生“拖尾”，即频谱泄露。为此，可采用汉明窗，因为汉明窗旁瓣最低，可以有效地克服泄露现象，具有更平滑的低通特性，得到的频谱比较平滑。Since the speech signal is a non-stationary time-varying signal, its generation process is closely related to the movement of vocal organs. The state speed of the vocal organ is much slower than the speed of sound vibration, so the speech signal can be considered to be short-term stable. The study found that within the range of 5-50ms, the speech spectrum characteristics and some physical characteristic parameters basically remain unchanged. Therefore, the processing methods and theories in the stationary process can be introduced into the short-term processing of the speech signal, and the speech signal is divided into many short-term speech segments, and each short-term speech segment is called an analysis frame. In this way, processing each frame signal is equivalent to processing a continuous signal with fixed characteristics. Frames can be continuous or overlapped and divided into frames. Generally, the frame length is 10-30ms. When fetching data, the overlapping part of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is generally taken as 0~1/2. The extracted voice frame needs to be windowed, that is, the signal is multiplied by a certain window function w(n) to form a windowed voice. The main function of windowing is to reduce the spectrum leakage caused by framing processing. This is because framing is a sudden truncation of the speech signal, which is equivalent to the periodic convolution of the spectrum of the speech signal and the spectrum of the rectangular window function. Due to the high side lobe of the spectrum of the rectangular window, the spectrum of the signal will produce "tailing", that is, spectrum leakage. For this reason, the Hamming window can be used, because the Hamming window has the lowest side lobe, can effectively overcome the leakage phenomenon, has smoother low-pass characteristics, and obtains a smoother spectrum.

阵元间时间延时的估计在整个麦克风阵列语音增强算法中有很重要的作用：它和信号频率共同决定了波束的指向性以及用于对声源的方位估计。时间延时估计精度直接影响语音处理系统的性能。由于麦克风阵列对语音信号的空间采样，使得麦克风接收到的信号相对与参考麦克风而言都有一定的延时。为使波束形成的输出的最大指向对准目标信号源，保持各个麦克风接收到的期望语音信号同步是解决该问题的重要手段。典型的时延估计方法有广义互相关时延估计方法，基于自适应滤波的时延估计方法、自适应特征分解、高阶累积量估计方法等等。其中广义互相关时延估计方法应用最为普遍。假设一对麦克风接收到语音信号模型为：x₁(t)＝s(t)+η₁、x₂(t)＝s(t-D)+η₂，其中s(t)为声源信号，x₁(t)和x₂(t)分别是两个麦克风接收的信号，D为两个麦克风之间的声音传播延时，η₁和η₂为加性背景噪声。假设s(t)、η₁、η₂互不相关，这里忽略信号幅度衰减。则x₁(t)和x₂(t)之间的广义互相关函数R₁₂(τ)为：The estimation of the time delay between array elements plays a very important role in the entire microphone array speech enhancement algorithm: together with the signal frequency, it determines the directivity of the beam and is used to estimate the direction of the sound source. The accuracy of time delay estimation directly affects the performance of the speech processing system. Due to the spatial sampling of the speech signal by the microphone array, the signal received by the microphone has a certain delay relative to the reference microphone. In order to align the maximum direction of the beamformed output to the target signal source, maintaining the synchronization of the desired voice signals received by each microphone is an important means to solve this problem. Typical time delay estimation methods include generalized cross-correlation time delay estimation method, time delay estimation method based on adaptive filtering, adaptive eigendecomposition, high-order cumulant estimation method and so on. Among them, the generalized cross-correlation time delay estimation method is most commonly used. Assume that a pair of microphones receive the speech signal model as follows: x ₁ (t)=s(t)+η ₁ , x ₂ (t)=s(tD)+η ₂ , where s(t) is the sound source signal, x ₁ (t) and x ₂ (t) are the signals received by the two microphones respectively, D is the sound propagation delay between the two microphones, and η ₁ and η ₂ are additive background noises. Assuming that s(t), η ₁ , and η ₂ are not correlated with each other, signal amplitude attenuation is ignored here. Then the generalized cross-correlation function R ₁₂ (τ) between x ₁ (t) and x ₂ (t) is:

其中X₁(ω)和X₂(ω)分别为x₁(t)和x₂(t)的傅立叶变换，ψ₁₂(ω)为广义互相关加权函数。针对不同的情况选择不同的加权函数，使得R₁₂(τ)具有比较尖锐的峰值，则峰值处即为两个麦克风之间的时延。Where X ₁ (ω) and X ₂ (ω) are Fourier transforms of x ₁ (t) and x ₂ (t) respectively, and ψ ₁₂ (ω) is a generalized cross-correlation weighting function. Different weighting functions are selected for different situations, so that R ₁₂ (τ) has a relatively sharp peak, and the peak is the time delay between the two microphones.

语音激活检测又称语音检测、语音端点检测，用于精确地确定输入语音的起点和终点，以保证语音处理系统良好的性能，对于语音和噪声的处理方法不同，如果不能判断当前语音帧是含噪语音帧或是噪声帧的话，就不能进行适当的处理。在语音增强系统中，为了得到更多的背景噪声特性，语音端点检测更注重于如何准确的检测出无音段。语音知识的学习和噪声源信息估计的积累都依赖于准确的端点检测。通常的语音激活检测是基于语音帧来进行的，语音帧的长度在10～30ms不等。语音激活检测的方法可以综述为：从输入信号中提取一个或一系列的对比特征参数，然后将其和一个或一系列的门限阈值进行比较。如果超过门限则表示当前为有音段，否则就表示当前为无音段。Voice activation detection, also known as voice detection and voice endpoint detection, is used to accurately determine the starting point and end point of the input voice to ensure the good performance of the voice processing system. The processing methods for voice and noise are different. If it cannot be judged whether the current voice frame contains Noisy speech frames or noisy frames cannot be properly processed. In the speech enhancement system, in order to obtain more background noise characteristics, the speech endpoint detection focuses more on how to accurately detect the silent segment. Both the learning of phonetic knowledge and the accumulation of noise source information estimation rely on accurate endpoint detection. Common voice activation detection is performed based on voice frames, and the length of the voice frames ranges from 10 to 30 ms. The method of voice activation detection can be summarized as: extracting one or a series of contrast feature parameters from the input signal, and then comparing it with one or a series of threshold thresholds. If it exceeds the threshold, it means that there is currently a sound segment, otherwise it means that there is currently no sound segment.

语音检测一般有两个步骤：Speech detection generally has two steps:

第一步：基于语音信号的特征。用能量、过零率、熵(entropy)、音高等参数，以及它们的衍生参数来判断信号流中的语音/非语音段。The first step: based on the characteristics of the speech signal. Use parameters such as energy, zero-crossing rate, entropy, pitch, and their derivative parameters to judge speech/non-speech segments in the signal stream.

第二步：在信号流中检测到语音信号后，判断此处是语音的开始点或是结束点。在语音系统中，由于信号多变的背景和自然对话模式而更容易使句中有停顿(非语音)，特别是在爆发声母前重是无声间隙。因此这种开始或结束的判断尤为重要。Step 2: After the voice signal is detected in the signal stream, it is judged whether this is the start point or the end point of the voice. In the speech system, it is easier to have mid-sentence pauses (non-speech) due to the variable background of the signal and natural dialogue patterns, especially the silent gap before the burst initial. Therefore, this judgment of beginning or end is particularly important.

目前语音端点检测所采取的方法大体可以分为两类：At present, the methods adopted for voice endpoint detection can be roughly divided into two categories:

第一类是噪声环境下基于HMM模型的语音信号端点检测的方法，该方法要求背景噪声保持平稳且信噪比较高。The first type is the method of speech signal endpoint detection based on HMM model in noisy environment, which requires the background noise to be stable and the signal-to-noise ratio is high.

第二类方法是基于信号的短时能量进行检测的算法，它通过对背景噪声能量的统计，定出能量门限，利用能量门限来确定语音信号起始点。The second type of method is based on the short-term energy detection algorithm of the signal. It determines the energy threshold through the statistics of the background noise energy, and uses the energy threshold to determine the starting point of the speech signal.

步骤4)中，利用语音增强算法获取较为纯净的语音信号。In step 4), a relatively pure speech signal is obtained by using a speech enhancement algorithm.

语音增强技术主要可以分为基于单通道的方法和多通道阵列麦克风的方法。单通道语音增强方法种类繁多，大都基于各种噪声消除方法结合语音信号的特征来研究具有针对性的算法，其理论成熟也是最简单有效的是谱减法(SS：Spectral Subtraction)语音增强。单个传感器拾音会受到场地、距离、应用场合的限制，因此拾音效果将大打折扣，后续的语音增强也就会困难重重。Speech enhancement technology can be mainly divided into methods based on single channel and methods based on multi-channel array microphones. There are many kinds of single-channel speech enhancement methods, most of which are based on various noise cancellation methods combined with the characteristics of the speech signal to study targeted algorithms. The theory is mature and the simplest and most effective is spectral subtraction (SS: Spectral Subtraction) speech enhancement. The sound pickup by a single sensor will be limited by the venue, distance, and application occasions, so the sound pickup effect will be greatly reduced, and subsequent speech enhancement will be difficult.

谱减法的基本原理是：在频域将带噪语音的功率谱减去噪声的功率谱，得到语音的功率谱估计，开方后就得到语音幅度估计，将其相位恢复后再采用逆傅立叶变换恢复时域信号。考虑到人耳对相位的感觉不灵敏，相位恢复时所采用的相位是带噪语音的相位信息。由于语音是短时平稳的，所以在短时谱幅度估计中认为它是平稳随机信号。The basic principle of spectral subtraction is: in the frequency domain, the power spectrum of the noisy speech is subtracted from the power spectrum of the noise to obtain the power spectrum estimate of the speech. Recover the time domain signal. Considering that the human ear is not sensitive to the phase, the phase used in the phase recovery is the phase information of the noisy speech. Since speech is short-term stationary, it is considered to be a stationary random signal in short-term spectrum amplitude estimation.

假设s(n)、η(n)和x(n)分别代表语音、噪声和带噪语音，S(ω)、Γ(ω)和X(ω)分别表示其短时谱。假设s(n)、η(n)不相关且噪声为加性噪声。于是得到信号的加性模型：x(n)＝s(n)+η(n)，经过加窗处理后的信号分别表示为x_w(n)，s_w(n)，η_w(n)，则有：x_w(n)＝s_w(n)+η_w(n)，对其做傅立叶变换，得：X_w(ω)＝S_w(ω)+Γ_w(ω)，因此对功率谱有：|X_w(ω)|²＝|S_w(ω)|²+根据观测数据估计|X_w()|²，其余各项必须近似为统计均值。由于s(n)、η(n)独立，则互功率统计均值为0，所以原始语音的估值为：其中，估计值不能保证是非负的，这是因为在估计噪声时存在误差，当估计噪声平均功率大于某帧带噪语音功率时，该帧得出的估计值就会出现为负的情况，这些负值可以通过改变它们的符号使之变为正值，也可以直接给它们置零。将恢复相位并做短时傅立叶反变换IFFT就能得到语音信号的时域估计： Suppose s(n), η(n) and x(n) represent speech, noise and noisy speech respectively, and S(ω), Γ(ω) and X(ω) represent their short-term spectrum respectively. Assume that s(n), η(n) are uncorrelated and the noise is additive noise. Then the additive model of the signal is obtained: x(n)=s(n)+η(n), and the signals after windowing processing are respectively expressed as x _w (n), s _w (n), η _w (n) , then: x _w (n) = s _w (n) + η _w (n), do Fourier transform on it, get: X _w (ω) = S _w (ω) + Γ _w (ω), so for The power spectrum is: |X _w (ω)| ² ＝|S _w (ω)| ² + |X _w ()| ² is estimated from observed data, and the remaining terms must be approximated by statistical means. Since s(n) and η(n) are independent, the statistical mean value of the cross power is 0, so the estimate of the original speech is: Among them, the estimated value It cannot be guaranteed to be non-negative, because there is an error in estimating the noise. When the estimated noise average power is greater than the noisy speech power of a certain frame, the estimated value obtained by the frame There will be negative cases. These negative values can be changed to positive values by changing their signs, or they can be directly set to zero. Will The time domain estimation of the speech signal can be obtained by recovering the phase and doing the inverse short-time Fourier transform IFFT:

当前，麦克风阵列语音增强算法主要有波束形成、子空间分解、盲源分离等。其中，盲源分离(BSS)是指在不知道或无法或得源信号和混合方式的情况下仅由观测信号恢复源信号的过程。即盲源分离可以不依赖当前事件的先验条件，可以用较少的麦克风就能进行语音增强，该算法的中心问题是解决多个人语音相互干扰混叠情况下将各说话人的语音分离出来，达到对各个目标语音增强的目的。Currently, microphone array speech enhancement algorithms mainly include beamforming, subspace decomposition, and blind source separation. Among them, Blind Source Separation (BSS) refers to the process of recovering the source signal only from the observed signal when the source signal and the mixing method are not known or cannot be obtained. That is, blind source separation does not depend on the prior conditions of the current event, and can perform speech enhancement with fewer microphones. The central problem of this algorithm is to separate the speech of each speaker in the case of mutual interference and aliasing of multiple people's speech , to achieve the purpose of enhancing the voice of each target.

独立分量分析(ICA)是盲信号分离的有效方法之一，属于线性瞬时混合盲信号处理，该方法不依赖于源信号类型相关的详细知识或信号传输系统特性的精确辨识，是一种有效的冗余取消技术。该方法根据代价函数的不同，可以得到不同的ICA算法，如信息最大化(infomax)算法、Fast ICA算法、最大熵(M E)和最小互信息(MM I)算法、极大似然(ML)算法等。其基本原理为：将所获得信号看为目标信号经过一个线性变换混合而成，为了获得目标信号就需要找到一个逆线性变换将获得的信号分解开来，从而达到信源分离的目的。Independent component analysis (ICA) is one of the effective methods for blind signal separation, which belongs to linear instantaneous mixed blind signal processing. This method does not depend on the detailed knowledge of the source signal type or the precise identification of the signal transmission system characteristics. Redundancy Cancellation Technology. According to different cost functions, this method can obtain different ICA algorithms, such as information maximization (infomax) algorithm, Fast ICA algorithm, maximum entropy (ME) and minimum mutual information (MM I) algorithm, maximum likelihood (ML) algorithm etc. The basic principle is: the obtained signal is regarded as the target signal mixed through a linear transformation. In order to obtain the target signal, it is necessary to find an inverse linear transformation to decompose the obtained signal, so as to achieve the purpose of source separation.

在无噪声的情况下，用X＝[x₁(t) x₂(t) … x_N(t)]′表示麦克风阵列接收到的一组观察信号，其中t为时间或是样本序号，N为麦克风个数，假设其由独立成分线性混合而成，即其中A为某个未知的满秩矩阵。所以其信号模型的向量表达式为X＝AS。In the case of no noise, use X=[x ₁ (t) x ₂ (t) ... x _N (t)]' to represent a set of observation signals received by the microphone array, where t is the time or sample number, N is the number of microphones, assuming that it is linearly mixed from independent components, that is, where A is an unknown full-rank matrix. So the vector expression of its signal model is X=AS.

在有噪声的情况下，假设噪声是加性噪声。则其信号模型的表达式为：X＝AS+Γ，其中Γ＝[η₁ η₂ … η_N]是噪声向量。对X＝AS+Γ做变换可得：X＝A(S+Γ₀)，Γ＝AΓ₀，因此可以得出，带噪信号模型任然是基本的ICA模型，只是独立成分由S变换为在ICA基本信号模型下，假设待求分离矩阵为W，分离后信号矩阵为Y，则有如下表达式：Y＝WX＝WAS。ICA的最终目的是寻找一个最优或是较优的分离矩阵W使得分离后信号矩阵Y中各个信号相互独立并尽可能的逼近源信号。In the presence of noise, it is assumed that the noise is additive. Then the expression of its signal model is: X=AS+Γ, where Γ=[η ₁ η ₂ ... η _N ] is the noise vector. Transform X=AS+Γ to get: X=A(S+Γ ₀ ), Γ = AΓ ₀ , Therefore, it can be concluded that the noisy signal model It is still the basic ICA model, but the independent components are transformed from S to Under the basic signal model of ICA, assuming that the separation matrix to be obtained is W, and the signal matrix after separation is Y, the following expression is given: Y=WX=WAS. The ultimate goal of ICA is to find an optimal or better separation matrix W so that each signal in the separated signal matrix Y is independent of each other and as close as possible to the source signal.

当前，基于麦克风阵列的语音增强都是对单个目标进行，从而限制了阵列拾音装置的有效拾音效果，且传统的单目标增强并不能满足实际应用的需求。At present, speech enhancement based on a microphone array is performed on a single target, which limits the effective sound pickup effect of the array pickup device, and the traditional single-target enhancement cannot meet the needs of practical applications.

发明内容Contents of the invention

本发明为了解决目前在基于阵列语音信号的多目标增强的技术问题，提出了一种基于盲源分离与谱减法的麦克风阵列多目标语音增强方法。In order to solve the current technical problem of multi-target enhancement based on array speech signals, the present invention proposes a microphone array multi-target speech enhancement method based on blind source separation and spectral subtraction.

本发明的基于盲源分离与谱减法的麦克风阵列多目标语音增强方法，包括下列步骤：The microphone array multi-target voice enhancement method based on blind source separation and spectral subtraction of the present invention comprises the following steps:

步骤1：通过二维面阵的麦克风阵列采集带噪语音信号，得到麦克风阵列的各通道的采集信号，其中麦克风阵列数目大于或等于4；Step 1: collect the noisy speech signal through the microphone array of the two-dimensional area array, and obtain the acquisition signals of each channel of the microphone array, wherein the number of microphone arrays is greater than or equal to 4;

步骤2：分别对各通道的采集信号执行步骤201～205：Step 2: Execute steps 201-205 on the collected signals of each channel respectively:

步骤201：对采集信号进行带通滤波处理，屏蔽非语音段噪声和干扰；再对带通滤波后的信号进行预加重处理，分帧、加窗处理，得到帧信号；Step 201: Perform band-pass filtering processing on the collected signal to shield non-speech segment noise and interference; then perform pre-emphasis processing on the band-pass filtered signal, divide into frames, and add window processing to obtain a frame signal;

然后对每帧信号进行频域转换，即对各帧信号进行短时傅立叶变换，并计算每帧的功率谱；同时计算并保留每帧的相位谱，以备谱减法过程中的相位恢复；Then perform frequency domain conversion on each frame signal, that is, perform short-time Fourier transform on each frame signal, and calculate the power spectrum of each frame; at the same time, calculate and retain the phase spectrum of each frame for phase recovery in the process of spectrum subtraction;

步骤203：对每帧的帧信号进行语音检测，判定当前帧是语音帧还是噪声帧，基于噪声帧估计噪声功率谱；Step 203: Perform speech detection on the frame signal of each frame, determine whether the current frame is a speech frame or a noise frame, and estimate the noise power spectrum based on the noise frame;

步骤204：基于谱减法去除语音帧的功率谱中的噪声功率谱，得到每帧的语音功率谱估计；Step 204: remove the noise power spectrum in the power spectrum of the speech frame based on spectral subtraction, and obtain the speech power spectrum estimation of each frame;

步骤205：对语音功率谱估计开方，并基于对应帧的相位谱进行相位恢复后，再进行短时傅立叶反变换，得到语音帧的时域估计信号；Step 205: Estimate the square root of the speech power spectrum, perform phase recovery based on the phase spectrum of the corresponding frame, and then perform inverse short-time Fourier transform to obtain the time domain estimation signal of the speech frame;

步骤2中是对单通道的采集信号进行信号预处理，将采集信号划分为很多短时的语音段(带噪)，即帧信号，再分别对每帧的帧信号进行谱减法处理，以降低语音帧的背景噪声。In step 2, signal preprocessing is performed on the single-channel acquisition signal, and the acquisition signal is divided into many short-term speech segments (with noise), that is, frame signals, and then spectral subtraction is performed on the frame signals of each frame to reduce the Background noise for speech frames.

步骤3：对所有通道的语音帧的时域估计信号采用盲源分离法进行信源分离处理，得到不同信源的目标信号；Step 3: Carry out source separation processing to the time-domain estimation signals of the speech frames of all channels using the blind source separation method to obtain target signals of different sources;

步骤4：对同一信源的目标信号进行去加重、去窗、帧重组处理，得到不同信源的目标语音信号。Step 4: De-emphasis, de-windowing, and frame reorganization are performed on the target signal of the same source to obtain target speech signals of different sources.

综上所述，由于采用了上述技术方案，本发明的有益效果是：(1)解决了传统的单通道语音增强方法处理环境背景噪声，算法简单，对资源需求不高的技术问题；(2)不再依赖阵列信号处理算法进行空间滤波，不需要考虑宽带波束算法，降低了算法结构的复杂度；(3)利用盲源分离算法实现了对目标信号增强，不再单一或是轮换的进行单目标信号增强。In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are: (1) solve the traditional single-channel speech enhancement method to process environmental background noise, the algorithm is simple, and the technical problems of low demand for resources; (2) ) no longer rely on the array signal processing algorithm for spatial filtering, and do not need to consider the broadband beam algorithm, which reduces the complexity of the algorithm structure; (3) uses the blind source separation algorithm to realize the enhancement of the target signal, no longer single or rotating Single target signal enhancement.

附图说明Description of drawings

图1是传统语音增强系统示意图。FIG. 1 is a schematic diagram of a traditional speech enhancement system.

图2是本发明具体实施方式的实现系统示意图。Fig. 2 is a schematic diagram of an implementation system of a specific embodiment of the present invention.

图3是语音检测的流程图。Fig. 3 is a flowchart of speech detection.

图4是谱减法单通道语音增强方法流程图。Fig. 4 is a flow chart of a method for spectral subtraction single-channel speech enhancement.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面结合实施方式和附图，对本发明作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the implementation methods and accompanying drawings.

参见图2，本发明的多目标语音增强方法，首先对二维面阵的麦克风阵列采集的各单通道信号(语音信号)进行信号预处理，将单通道的语音信号划分为很多短时的语音段，得到帧信号，以便于后续的语音激活检测、谱减法处理。其中信号预处理包括带通滤波、预加重处理，交叠分帧、汉明窗加窗处理。Referring to Fig. 2, the multi-target speech enhancement method of the present invention, at first carries out signal preprocessing to each single-channel signal (speech signal) that the microphone array of two-dimensional area array gathers, and the speech signal of single-channel is divided into a lot of short-term speech segment to obtain frame signals for subsequent speech activation detection and spectral subtraction processing. The signal preprocessing includes band-pass filtering, pre-emphasis processing, overlapping framing, and Hamming window processing.

对单通道的帧信号分别进行语音激活检测、谱减法处理，再对同一语音帧的所有通道进行盲源分离，得到不同源信号的目标信号，再对应信号预处理的逆反操作，对同一源信号的目标信号进行去加重，去汉明窗处理，帧重组得到各目标语音信号，实现对多目标语音的增强处理。Perform speech activation detection and spectral subtraction processing on the single-channel frame signals, and then perform blind source separation on all channels of the same speech frame to obtain target signals of different source signals, and then correspond to the inverse operation of signal preprocessing to perform the same source signal The target signal is de-emphasized, the Hamming window is removed, and the frame is reorganized to obtain each target speech signal, so as to realize the enhancement processing of multi-target speech.

针对室内环境中的声源特性及噪声场特性，采用散射噪声场模型和近场声源模型，对实际环境中的多通道带噪语音信号进行建模。通过64个麦克风组成的8×8的平面阵列来采集空间中的语音信号。According to the characteristics of the sound source and noise field in the indoor environment, the multi-channel noisy speech signal in the actual environment is modeled by using the diffuse noise field model and the near-field sound source model. Speech signals in the space are collected through an 8×8 planar array composed of 64 microphones.

用X＝[x₁(t) x₂(t) … x_j(t) … x_N(t)]′表示各通道输出的带噪语音信号，其中j表示麦克风通道序号。Use X=[x ₁ (t) x ₂ (t) ... x _j (t) ... x _N (t)]' to represent the noisy speech signal output by each channel, where j represents the serial number of the microphone channel.

则对各通道输出的带噪语音信号进行信号预处理后得到的阵列信号(帧信号)为X_pw，则X_pw＝[x_1pw(n) x_2pw(n) … _jpw(n) … x_Npw(n)]′，其中n＝1,2,……L，L为帧长，w为帧号。Then the array signal (frame signal) obtained after signal preprocessing is carried out to the noisy speech signal output by each channel is X _pw , then X _pw =[x _1pw (n) x _2pw (n) ... _jpw (n) ... x _Npw (n)]', where n=1,2,...L, L is the frame length, w is the frame number.

对帧信号X_pw作短时傅立叶变换，得到幅度谱|X_pw(ω)和相位谱Φ_pw(ω)。其中ω为频率采样点，是角频率从0到2π的N等分均匀采样。所以有：Make short-time Fourier transform on the frame signal X _pw to obtain the amplitude spectrum |X _pw (ω) and the phase spectrum Φ _pw (ω). Among them, ω is the frequency sampling point, which is the uniform sampling of N equal parts of the angular frequency from 0 to 2π. So have:

|X_pw|＝[|X_1pw(ω)| |X_2pw(ω)| … |X_jpw(ω)| … |X_Npw(ω)|]′|X _pw |＝[|X _1pw (ω)| |X _2pw (ω)| … |X _jpw (ω)| … |X _Npw (ω)|]′

利用|X_pw|＝[|X_1pw(ω)| |X_2pw(ω)| … |X_jpw(ω)| …|X_Npw(ω)|]′，按照图3所示流程图进行语音起始端点和结束端点的检测，即判断当前帧是噪声帧还是语音帧，并利用判断结果进行谱减消噪。其中，语音起始端点(起始帧)和结束端点(结束帧)的检测的具体过程为：Utilize |X _pw |=[|X _1pw (ω)| |X _2pw (ω)| ... |X _jpw (ω)| ...|X _Npw (ω)|] ', carry out speech starting according to the flow chart shown in Figure 3 The detection of the start endpoint and the end endpoint is to judge whether the current frame is a noise frame or a voice frame, and use the judgment result to perform spectrum reduction and noise reduction. Wherein, the specific process of the detection of speech start endpoint (start frame) and end endpoint (end frame) is:

采用公式计算每一帧的语音能量，其中N为帧长，w为帧的编号，1≤w≤L，L为帧数，ω为每一帧中的各点；use the formula Calculate the voice energy of each frame, where N is the frame length, w is the number of the frame, 1≤w≤L, L is the number of frames, and ω is each point in each frame;

初始化门限阈值T，通过对背景噪声能量的统计，设置门限阈值T的初始值Initialize the threshold T, and set the initial value of the threshold T through the statistics of background noise energy

然后基于门限阈值T对每帧进行类别判定，判定当前帧是噪声帧还是语音帧，同时基于最近的k帧的噪声帧对门限阈值T进行更新：Then classify each frame based on the threshold T, determine whether the current frame is a noise frame or a speech frame, and update the threshold T based on the noise frame of the latest k frames:

a.计算当前帧的语音能量M_w，若M_w大于T，则判定当前帧为语音帧，否则判定为噪声帧；a. Calculate the voice energy M _w of the current frame, if M _w is greater than T, then determine that the current frame is a voice frame, otherwise it is determined to be a noise frame;

b、若当前帧为噪声帧，则基于最近的k(经验值，通常取值为大于等于10)帧噪声帧对门限阈值T进行更新：b. If the current frame is a noise frame, the threshold T is updated based on the nearest k (experience value, usually greater than or equal to 10) frame noise frame:

b1：计算最近k帧噪声帧的平均语音能量EMN、语音能量最大值和能量最小值EAX，EMIN；b1: Calculate the average speech energy EMN, speech energy maximum value and energy minimum value EAX, EMIN of the noise frame of the latest k frames;

b2：根据公式T＝min[a×(EAX-EMIN)+EMN,b×EMN](0<a<1,1<b<10)得到更新后的门限阈值T；b2: According to the formula T=min[a×(EAX-EMIN)+EMN,b×EMN](0<a<1,1<b<10) to obtain the updated threshold T;

c、若当前帧为语音帧，则判定所有帧是否处理完毕，若是，则端点检测完毕，否则，继续对下一帧重复步骤a～c。c. If the current frame is a speech frame, it is determined whether all frames have been processed, and if so, the endpoint detection is completed; otherwise, continue to repeat steps a to c for the next frame.

进一步的，还可以利用短时过零率对语音帧和噪声帧的判定结果进行检验，以防止误判。Furthermore, the short-term zero-crossing rate can also be used to check the judgment results of the speech frame and the noise frame, so as to prevent misjudgment.

参见图4，基于检测出的所有噪声帧，可以估计得到噪声功率谱，然后基于谱减法除去各语音帧的估计噪声，即用语音帧的功率谱减去当前估计得到的噪声功率谱，得到语音帧的语音功率谱估计，再对语音功率谱估计开方，并基于各语音帧的相位谱进行相位恢复后，再进行短时傅立叶反变换，得到语音帧的时域估计信号，即单通道的增强语音信号。Referring to Figure 4, based on all detected noise frames, the noise power spectrum can be estimated, and then the estimated noise of each speech frame is removed based on spectral subtraction, that is, the power spectrum of the speech frame is subtracted from the currently estimated noise power spectrum to obtain the speech The speech power spectrum of each frame is estimated, and then the square root of the speech power spectrum is estimated, and the phase is restored based on the phase spectrum of each speech frame, and then the short-time inverse Fourier transform is performed to obtain the time domain estimation signal of the speech frame, that is, the single-channel Enhance speech signal.

上述完成后，利用自然梯度ICA来完成盲源分离，其具体处理过程为：After the above is completed, the natural gradient ICA is used to complete the blind source separation, and the specific processing process is as follows:

(1)若当前观测信号X(多个单通道增强语音信号序列)的均值不为零，那么就先从观测信号X中减去其均值；(1) If the mean value of the current observed signal X (multiple single-channel enhanced speech signal sequences) is not zero, then first subtract its mean value from the observed signal X;

(2)选择一矩阵B，使协方差矩阵E{VV^T}为单位矩阵I，其中V＝BX，向量V的各个分量之间是不相关的，且具有单位方差；(2) select a matrix B, make covariance matrix E{VV ^T } be identity matrix I, wherein V=BX, be irrelevant between each component of vector V, and have unit variance;

(3)基于奇异值分解的白化处理：首先估计X的方差R_x＝E{XX^T}，R_x是一个实Hermitian阵；其次对R_x进行奇异值分解其中U＝[u₁,u₂,…,u_n]的列是R_x的左奇异向量，白化处理的目的是为了减弱语音信号混合后的相关性；(3) Whitening processing based on singular value decomposition: first estimate the variance R _x = E{XX ^T } of X, and R _x is a real Hermitian matrix; secondly, perform singular value decomposition on R _x The column of U=[u ₁ , u ₂ ,..., u _n ] is the left singular vector of R _x , and the purpose of whitening processing is to weaken the correlation of speech signals after mixing;

(4)由于σ₁≥σ₂≥…≥σ_m>0；σ_m+1＝…σ_n；(m≤n)从而估计出源信号的数目为m；(4) Since σ ₁ ≥σ ₂ ≥...≥σ _m >0; σ _m+1 =...σ _n ; (m≤n), the number of source signals is estimated to be m;

(5)最后进行正交变换：(5) Finally, perform an orthogonal transformation:

U＝[u₁,₂,…,_n]U＝[u ₁ , ₂ ,…, _n ]

＝BX，U_m＝[u₁,₂,…,_m] =BX, U _m =[u ₁ , ₂ ,..., _m ]

根据公式得到回复信号Y，其中，P既可称为性能矩阵，又可称为收敛矩阵，W表示独立分量分析ICA中的分离矩阵，A表示ICA中的混合矩阵，S表示源信号。According to the formula Get a reply signal Y, where, P can be called both the performance matrix and the convergence matrix, W represents the separation matrix in ICA, A represents the mixing matrix in ICA, and S represents the source signal.

以上所述，仅为本发明的具体实施方式，本说明书中所公开的任一特征，除非特别叙述，均可被其他等效或具有类似目的的替代特征加以替换；所公开的所有特征、或所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以任何方式组合。The above is only a specific embodiment of the present invention. Any feature disclosed in this specification, unless specifically stated, can be replaced by other equivalent or alternative features with similar purposes; all the disclosed features, or All method or process steps may be combined in any way, except for mutually exclusive features and/or steps.

Claims

1. based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction, it is characterised in that including following Step：

Step 1：Noisy Speech Signal gathered by the microphone array of two-dimensional array, each passage for obtaining microphone array is adopted Collection signal, wherein microphone array column number are more than or equal to 4；

Step 2：Collection signal execution step 201～205 to each passage respectively：

Step 201：Bandpass filtering treatment, shielding non-speech segment noise and interference are carried out to gathering signal；Again to bandpass filtering after Signal carry out preemphasis process, framing, windowing process obtain frame signal；

Short time discrete Fourier transform is carried out to the frame signal of every frame, and calculates the power spectrum and phase spectrum of each frame；

Step 203：Speech detection is carried out to the frame signal of every frame, judges that present frame is speech frame or noise frame, based on noise Frame estimating noise power is composed；

Step 204：Noise power spectrum in the power spectrum of speech frame is removed based on spectrum-subtraction, the phonetic speech power spectrum for obtaining every frame is estimated Meter；

Step 205：To phonetic speech power Power estimation evolution, and after the phase spectrum based on corresponding frame carries out phase recovery, then carry out short When inverse fourier transform, obtain speech frame time domain estimate signal；

Step 3：Signal carries out information source separating treatment using blind source separating method to be estimated to the time domain of the speech frame of all passages, is obtained The echo signal of various information source；

Step 4：The echo signal of same information source is carried out postemphasising, goes window, frame reorganization, the target language of various information source is obtained Message number.

2. the method for claim 1, it is characterised in that the microphone array is classified as 8*8 rectangle plane arrays, each battle array Unit is uniformly distributed on the whole, can carry out Subarray partition, and different submatrixs can work independently.

3. method as claimed in claim 1 or 2, it is characterised in that in step 3, using self adaptation natural gradient blind source separating Carry out information source separating treatment.