CN108597531B

CN108597531B - A method to improve dual-channel blind signal separation by multi-source activity detection

Info

Publication number: CN108597531B
Application number: CN201810265485.5A
Authority: CN
Inventors: 王泽林; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2021-05-28
Anticipated expiration: 2038-03-28
Also published as: CN108597531A

Abstract

The invention discloses a method for improving dual-channel blind signal separation through multi-sound source activity detection. This method performs blind signal separation based on the dual-channel TRINICON algorithm, and compares the power before and after preliminary processing. If the signal power of one output channel is significantly lower than that of the other output channel, the target sound source to be suppressed in this signal can be determined. In the active state, according to this, it can be judged whether each target sound source in each segment of data is in the active state. The TRINICON algorithm is modified by using the results of multi-sound source activity judgment, and the filter coefficients are updated with the data of the target sound source activity, so as to achieve the purpose of removing interference and improving the performance of speech separation. The method of the invention can effectively improve the separation performance of the TRINICON method in the scene of discontinuous interlaced mixing and sparse mixing.

Description

A method to improve dual-channel blind signal separation by multi-source activity detection

技术领域technical field

本发明涉及盲信号分离的技术领域，具体涉及一种基于双通道系统的通过多声源活动检测来改进盲信号分离性能的方法，在基于TRINICON(Triple-N ICA forconvolutive mixtures)结构的频域盲信号分离过程中加入多声源活动检测的算法。The invention relates to the technical field of blind signal separation, in particular to a method for improving blind signal separation performance through multi-sound source activity detection based on a dual-channel system. An algorithm for multi-source activity detection is added to the signal separation process.

背景技术Background technique

分离未知时间序列的卷积混合的问题在多个领域都有重要应用。一个重要的例子是所谓的鸡尾酒会问题，即将单个语音信号在混响的声学环境中从多个说话者的混合声中提取出来。由于混响的存在，这个分离问题的源信号在被传声器阵列采集前，先通过一个线性多输入多输出(MIMO)系统进行滤波。盲信号分离(BSS)是利用传声器阵列，不需要先验信息，基于不同声源信号相互统计独立的基本假设，即可将多个声源的信号进行分离的算法(S.Makino,H.Sawada,and T.W.Lee,Blind Speech Separation.Springer Netherlands,2007,pp.169-192.)。The problem of separating convolutional mixtures of unknown time series has important applications in several fields. An important example is the so-called cocktail party problem, where a single speech signal is extracted from the mixture of multiple speakers in a reverberant acoustic environment. Due to the presence of reverberation, the source signal of this separation problem is filtered through a linear multiple-input multiple-output (MIMO) system before being collected by the microphone array. Blind Signal Separation (BSS) is an algorithm that uses a microphone array to separate the signals of multiple sound sources without prior information and based on the basic assumption that the signals of different sound sources are statistically independent of each other (S. Makino, H. Sawada). , and T.W. Lee, Blind Speech Separation. Springer Netherlands, 2007, pp. 169-192.).

基于独立分量分析(ICA)的盲信号分离是从干扰语音信号中提取所需语音信号的有效方法。但是频域ICA方法面临各频点排序模糊的问题，这需要基于频率间相关性信息的适当补充修复机制来修正。目前，该问题可以通过将ICA扩展到多变量情况，进行独立矢量分析(IVA)算法来解决，或者利用基于宽带标准的TRINICON方案来解决。TRINICON方法可以使用二阶统计量(SOS)在频域中有效地实现。Blind signal separation based on independent component analysis (ICA) is an effective method to extract desired speech signals from interfering speech signals. However, the frequency domain ICA method faces the problem of ambiguous ordering of each frequency point, which needs to be corrected by an appropriate supplementary repair mechanism based on the correlation information between frequencies. Currently, this problem can be solved by extending ICA to the multivariate case, performing the Independent Vector Analysis (IVA) algorithm, or by utilizing the TRINICON scheme based on the broadband standard. The TRINICON method can be efficiently implemented in the frequency domain using second order statistics (SOS).

频域的离线TRINICON方法通常在声源进行连续混合时具有最佳性能。在实际应用中，语音信号有时会间断检错或者稀疏地进行混合，而离线的算法不考虑语音的活动状态，将每一段语音都平等地计入计算中，这样会导致性能恶化，尤其是当各个声源处于传声器阵列中线的一侧时，需要生成非因果滤波器。Offline TRINICON methods in the frequency domain generally have the best performance when the sound sources are continuously mixed. In practical applications, the speech signal sometimes detects errors intermittently or is sparsely mixed, and the offline algorithm does not consider the activity state of the speech and counts each segment of speech equally into the calculation, which will lead to performance degradation, especially when Acausal filters need to be generated when each sound source is on one side of the midline of the microphone array.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了提高频域TRINICON算法在间断交错混合及稀疏混合的情景下的性能，提供一种通过多声源活动检测来改进双通道盲信号分离的方法。The purpose of the present invention is to improve the performance of the frequency-domain TRINICON algorithm in the scenarios of discontinuous interlaced mixing and sparse mixing, and to provide a method for improving dual-channel blind signal separation through multi-sound source activity detection.

本发明为解决上述技术问题采取的技术方案是：The technical scheme that the present invention takes for solving the above-mentioned technical problems is:

一种通过多声源活动检测来改进双通道盲信号分离的方法，包括如下步骤：A method for improving dual-channel blind signal separation through multi-sound source activity detection, comprising the following steps:

(1)对频域TRINICON算法中的滤波器矩阵参数进行初始化；(1) Initialize the filter matrix parameters in the frequency domain TRINICON algorithm;

(2)传声器接收到的声源信号输入双通道系统，将每个通道的输入信号分块，再用短时傅立叶变换将每一块的输入信号变换到频域；(2) The sound source signal received by the microphone is input into the two-channel system, the input signal of each channel is divided into blocks, and the short-time Fourier transform is used to transform the input signal of each block into the frequency domain;

(3)由Y ^(k)＝X ^(k) W ^(k)计算每一块的输出信号，其中Y ^(k)、X ^(k)和W ^(k)分别为频域的输出信号、频域的输入信号及频域的滤波器系数矩阵，上标(k)为短时离散傅立叶变换中频点的序号；然后计算各输出信号的功率谱密度矩阵Φ_yy；(3) Calculate the output signal of each block by Y ^(k) = X ^(k) W ^(k) , where Y ^(k) , X ^(k) and W ^(k) are the output signal in the frequency domain, the The filter coefficient matrix of the input signal and the frequency domain, the superscript (k) is the sequence number of the intermediate frequency point of the short-time discrete Fourier transform; then calculate the power spectral density matrix Φ _yy of each output signal;

(4)按照自然梯度下降法更新滤波器系数，使用逆短时傅立叶变换将滤波器系数矩阵变换回时域并将时域的滤波器系数大于滤波器长度减1的部分置零；(4) update the filter coefficients according to the natural gradient descent method, use the inverse short-time Fourier transform to transform the filter coefficient matrix back to the time domain and set the filter coefficients in the time domain to zero if the filter coefficients are greater than the filter length minus 1;

(5)使用步骤(4)得到的滤波器系数，重复第(3)和第(4)步骤，直至达到最大迭代次数；(5) using the filter coefficients obtained in step (4), repeat steps (3) and (4) until the maximum number of iterations is reached;

(6)得到初步的滤波器系数收敛结果，利用重叠保留法对输入信号进行卷积得到第一通道的初步输出信号和第二通道的初步输出信号；(6) obtain preliminary filter coefficient convergence result, utilize overlapping retention method to convolve the input signal to obtain the preliminary output signal of the first channel and the preliminary output signal of the second channel;

(7)先根据以下两个公式分别计算两个通道中各块的输出和输入功率：(7) Calculate the output and input power of each block in the two channels according to the following two formulas:

其中E_x(m)为输入功率，E_yp(m)为初步输出信号的功率，m为信号分块的序号，k_u和k_l分别表示计入计算的频点的上限和下限，下标p是输出通道的序号，下标1和2分别代表第一通道和第二通道；where E _x (m) is the input power, E _yp (m) is the power of the preliminary output signal, m is the sequence number of the signal block, ku and _k _l represent the upper and lower limits of the frequency points included in the calculation, respectively, and the subscript p is the serial number of the output channel, the subscripts 1 and 2 represent the first channel and the second channel respectively;

然后对每一块信号的活动状态进行判断，并得到修正的权重函数；Then judge the active state of each signal, and get the revised weight function;

(8)根据步骤(1)重新对权重函数进行初始化，修正后的梯度表达式为：(8) Re-initialize the weight function according to step (1), and the corrected gradient expression is:

其中in

N_sig是整个信号的块数，ε_p(m)是修正权重，当第m块中第p个声源的状态为活动时，ε_p(m)等于1，否则，ε_p(m)等于0；N _sig is the block number of the whole signal, ε _p (m) is the modification weight, when the state of the p-th sound source in the m-th block is active, ε _p (m) is equal to 1, otherwise, ε _p (m) is equal to 0;

根据以上修正后的梯度表达式，重复第(3)和第(4)步骤，即重新进行一次迭代，这次迭代中只用第一通道要抑制的目标声源活动的数据更新第一通道的滤波器系数，只用第二通道要抑制的目标声源活动的数据更新第二通道的滤波器系数，直到达到最大迭代次数；According to the above modified gradient expression, repeat steps (3) and (4), that is, perform an iteration again. In this iteration, only the data of the target sound source activity to be suppressed by the first channel is used to update the data of the first channel. filter coefficient, update the filter coefficient of the second channel with only the data of the target sound source activity to be suppressed by the second channel until the maximum number of iterations is reached;

(9)得到最终的滤波器系数收敛结果，利用重叠保留法对输入信号进行卷积得到两个通道的最终输出信号。(9) The final filter coefficient convergence result is obtained, and the input signal is convolved by the overlap preservation method to obtain the final output signal of the two channels.

本发明的检测方法可近似识别每个源的活动，提取出声源各自的有效活动帧并根据声源活动检测的结果对滤波器更新过程进行调整，能有效地改善在间断交错混合及稀疏混合的场景中TRINICON方法的分离性能。The detection method of the present invention can approximately identify the activity of each source, extract the respective effective activity frames of the sound source, and adjust the filter update process according to the result of the sound source activity detection, which can effectively improve the performance of intermittent interlaced mixing and sparse mixing. The separation performance of the TRINICON method in the scene.

附图说明Description of drawings

图1为本发明实施例中的信号模型图：(a)仅需要因果滤波器，(b)需要因果和非因果滤波器。FIG. 1 is a signal model diagram in an embodiment of the present invention: (a) only causal filters are required, (b) causal and non-causal filters are required.

图2为本发明实施例中多声源活动检测的流程图。FIG. 2 is a flowchart of multi-sound source activity detection in an embodiment of the present invention.

图3为本发明加入多声源活动检测的频域盲信号分离算法的流程图。FIG. 3 is a flow chart of the frequency domain blind signal separation algorithm with multi-sound source activity detection added in the present invention.

图4为本发明实施例中最终的优化滤波器的空间响应图：(a)普通频域TRINICON输出通道1，(b)普通频域TRINICON输出通道2，(c)加入多声源活动检测的频域TRINICON输出通道1，(d)加入多声源活动检测的频域TRINICON输出通道2。Fig. 4 is the spatial response diagram of the final optimized filter in the embodiment of the present invention: (a) common frequency domain TRINICON output channel 1, (b) common frequency domain TRINICON output channel 2, (c) adding multi-sound source activity detection Frequency domain TRINICON output channel 1, (d) join frequency domain TRINICON output channel 2 for multi-sound source activity detection.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

1、双通道频域TRINICON算法1. Dual-channel frequency domain TRINICON algorithm

本发明适用于双通道的传声器阵列系统，假设声源个数为一个或两个，算法的信号模型如图1(只有一个信号时则模型中两个声源只有一个发声)。The present invention is suitable for a two-channel microphone array system, assuming that the number of sound sources is one or two, and the signal model of the algorithm is shown in Figure 1 (when there is only one signal, only one of the two sound sources in the model emits sound).

TRINICON算法的代价函数为：The cost function of the TRINICON algorithm is:

其中，

是第q通道的D维的多元概率密函数的估计，

是所有输出通道(这里Q＝2)的联合概率密度的估计，m是块序号，j＝0,…,N-1长度为N的块中时移的序号，β(i,m)是一个权重函数，根据

归一化。in,

is the estimate of the D-dimensional multivariate probability density function of the qth channel,

is the estimate of the joint probability density of all output channels (here Q=2), m is the block sequence number, j=0,...,N-1 The sequence number of the time shift in blocks of length N, β(i,m) is a weight function, according to

Normalized.

使用SOS的TRINICON的代价函数表示为(H.Buchner,R.Aichner,andW.Kellermann,"A generalization of blind source separation algorithms forconvolutive mixtures based on second-order statistics,"IEEE Transactions onSpeech&Audio Processing,vol.13,no.1,pp.120-134,2004)：The cost function of TRINICON using SOS is expressed as (H. Buchner, R. Aichner, and W. Kellermann, "A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics," IEEE Transactions on Speech&Audio Processing, vol. 13, no .1, pp. 120-134, 2004):

其中bdiag操作是将一个分块矩阵的非对角子矩阵都置零。The bdiag operation is to zero out the off-diagonal submatrices of a block matrix.

滤波器系数W根据自然梯度下降的方法来进行更新：The filter coefficients W are updated according to the natural gradient descent method:

其中in

j_iter是迭代的次数，μ为步长。j _iter is the number of iterations, μ is the step size.

在频域的方法中，滤波器矩阵和信号分别表示为：In the frequency domain method, the filter matrix and the signal are expressed as:

和and

其中L是滤波器的长度，F_4L×4L是傅立叶变换矩阵。带下划线的变量为频域的变量。where L is the length of the filter and F _4L×4L is the Fourier transform matrix. Variables underlined are variables in the frequency domain.

输出信号的功率谱密度矩阵Φ_yy和输入信号的功率谱密度矩阵Φ_xx分别表示为The power spectral density matrix Φ _yy of the output signal and the power spectral density matrix Φ _xx of the input signal are expressed as

和and

其中不同维度的约束矩阵G的作用是防止各个频率点之间的去耦以避免频域盲信号分离中的排序模糊问题和圆周卷积效应。应用到频域计算时，将G近似为单位阵，则可以在每个频点上单独进行迭代(式中省略了块序号m)：Among them, the role of the constraint matrix G of different dimensions is to prevent the decoupling between each frequency point to avoid the ordering ambiguity problem and the circular convolution effect in the frequency domain blind signal separation. When applied to the frequency domain calculation, if G is approximated as a unit matrix, it can be iterated independently at each frequency point (the block number m is omitted in the formula):

其中上标(k)是STFT中频率点的序号。每步迭代完成后，分界矩阵必须转换回时域，并将l>L-1的w_qq,l的值置零，以避免圆周卷积效应。The superscript (k) is the serial number of the frequency point in the STFT. After each iteration, the demarcation matrix must be converted back to the time domain and the value of w _qq,l for l>L-1 is zeroed to avoid the circular convolution effect.

2、加入多声源活动检测的双通道频域TRINICON方法2. Add dual-channel frequency domain TRINICON method for multi-sound source activity detection

在离线的TRINICON方法中，当信号不是连续混合，尤其是当无语音段或者只有一个声源活动的时间较长时，对功率谱密度矩阵Φ的估计会有偏差，从而错误地引导了分离系统收敛的方向，导致语音分离的效果不佳。这种情况在图1(b)所示的需要非因果滤波器的状况下更加明显。In the offline TRINICON method, when the signals are not continuously mixed, especially when there is no speech segment or only one sound source is active for a long time, the estimation of the power spectral density matrix Φ is biased, thus misleading the separation system Convergence direction, resulting in poor speech separation effect. This situation is more pronounced in the situation shown in Figure 1(b) where acausal filters are required.

本发明将离线TRINICONO执行两次，并使用第一次的收敛的初步结果来进行一个粗糙的多声源活动估计，并根据估计的结果调整第二次的迭代过程，从而得到改进的语音分离效果。对于需要非因果滤波器的情况，一般的算法采用一个位移L/2的单位冲激响应来进行初始化，但是在频域算法中这种初始化的方法收敛特性较差，本实施例采用的方法是采用一个位移点数略微大于两个传声器之间的最大声程差的单位冲激响应来对滤波器进行初始化。The present invention executes off-line TRINICONO twice, uses the first convergence preliminary result to perform a rough multi-sound source activity estimation, and adjusts the second iterative process according to the estimated result, thereby obtaining an improved speech separation effect . For the case where a non-causal filter is required, the general algorithm uses a unit impulse response with a displacement of L/2 for initialization, but in the frequency domain algorithm, this initialization method has poor convergence characteristics. The method used in this embodiment is: The filter is initialized with a unit impulse response with a displacement point slightly larger than the maximum sound path difference between the two microphones.

首先要计算各块中输入和输出的功率：First calculate the power input and output in each block:

和and

其中k_u和k_l是计入计算的频点的上下限(下标u是upper bound的缩写，表示上限，下标l是lower bound的缩写，表示下限)，可以根据信号的频谱特性自行调整以更好地估计信号功率和节省计算量。需要注意的是在低频段盲信号分离的效果一般很差，所以低频段一般不计入计算。Among them, k _u and k _l are the upper and lower limits of the frequency points included in the calculation (the subscript u is the abbreviation of upper bound, indicating the upper limit, and the subscript l is the abbreviation of lower bound, indicating the lower limit), which can be adjusted according to the spectral characteristics of the signal. To better estimate signal power and save computation. It should be noted that the effect of blind signal separation in the low frequency band is generally poor, so the low frequency band is generally not included in the calculation.

计算了各个通道的输入和输出功率后，根据第一次处理前后的功率对比，可以对各块中两个声源的活动情况进行一个大致的判断。具体的判断步骤见图2。假设在输出通道y1中声源s1被抑制。首先如果输入信号的功率小于设定的最低功率阈值E_min(该阈值可以通过MCRA等方法估计的平稳噪声功率来确定)，则判断为两个声源都不发声，并设定ε₁(m)＝0，ε₂(m)＝0。如果功率大于阈值，则判断为至少有一个声源发声，进入下一步判断：若通道y1处理前后的功率比大于通道y2处理前后的功率比乘上系数λ，则认为声源s1不活动，声源s2处于活动状态，并设定ε₁(m)＝0，ε₂(m)＝1；若通道y2处理前后的功率比大于通道y1处理前后的功率比乘上系数λ，则认为声源s2不活动，声源s1处于活动状态，并设定ε₁(m)＝1，ε₂(m)＝0；如果以上情况都不符合，则认为两个声源都活动，并设定ε₁(m)＝1，ε₂(m)＝1。其中系数λ可以根据具体声源和使用场景进行调整，一般是大于1的实数。After calculating the input and output power of each channel, according to the power comparison before and after the first processing, a rough judgment can be made on the activity of the two sound sources in each block. The specific judgment steps are shown in Figure 2. Assume that the sound source s1 is suppressed in the output channel y1. First, if the power of the input signal is less than the set minimum power threshold E _min (the threshold can be determined by the stationary noise power estimated by methods such as MCRA), then it is judged that the two sound sources do not emit sound, and ε ₁ (m )=0, ε ₂ (m)=0. If the power is greater than the threshold, it is judged that there is at least one sound source emitting sound, and the next step is to judge: if the power ratio before and after processing of channel y1 is greater than the power ratio before and after processing of channel y2 multiplied by the coefficient λ, then it is considered that the sound source s1 is inactive and the sound Source s2 is active, and set ε ₁ (m)=0, ε ₂ (m)=1; if the power ratio before and after processing of channel y2 is greater than the power ratio before and after processing of channel y1 multiplied by the coefficient λ, it is considered that the sound source s2 is inactive, sound source s1 is active, and set ε ₁ (m)=1, ε ₂ (m)=0; if none of the above conditions are met, consider both sound sources active, and set ε ₁ (m)=1, ε ₂ (m)=1. The coefficient λ can be adjusted according to the specific sound source and usage scenario, and is generally a real number greater than 1.

对声源的活动状态进行判断后，式(9)的梯度被修正为After judging the active state of the sound source, the gradient of equation (9) is modified as

其中in

式中N_sig是整个信号的块数，ε_p(m)是修正权重，当第m块中第p个声源的状态为活动时，ε_p(m)等于1，否则，ε_p(m)等于0。where N _sig is the block number of the entire signal, ε _p (m) is the correction weight, when the state of the p-th sound source in the m-th block is active, ε _p (m) is equal to 1, otherwise, ε _p (m ) is equal to 0.

使用修正后的梯度表达式(12)来更新系数，即只使用声源s1活动的数据来更新输出通道y1的滤波器，只使用使用声源s2活动的数据来更新输出通道y2的滤波器。这样可选取有效的帧分别对两个滤波器进行更新，改善收敛性能，有效地提高语音分离的效果。需要注意的是上述的多声源活动检测在初始的信扰比为10dB以下时会比较准确，但也并不是完全精确的，在本实施例中只需要大致地判断各个声源活动的时间段即可。The coefficients are updated using the modified gradient expression (12), that is, the filter of output channel y1 is updated using only the data of sound source s1 activity, and the filter of output channel y2 is updated using only the data of sound source s2 activity. In this way, an effective frame can be selected to update the two filters respectively, so as to improve the convergence performance and effectively improve the effect of speech separation. It should be noted that the above-mentioned multi-sound source activity detection is relatively accurate when the initial signal-to-interference ratio is below 10dB, but it is not completely accurate. In this embodiment, it is only necessary to roughly determine the time period of each sound source activity. That's it.

调整后整个算法的流程如图3。使用传声器接收到的输入信号x1，x2，先使用一般的频域TRINICON方法进行一次语音分离，然后使用第一次频域TRINICON的分离结果，按前文所述的多声源活动检测方式，确定两个声源各自的活动时间段；接着以两个声源各自的活动时间段分别更新两个滤波器系数，进行调整后的第二次频域TRINICON的语音分离，得到最终的输出信号y1和y2。The flow of the whole algorithm after adjustment is shown in Figure 3. Using the input signals x1 and x2 received by the microphone, first use the general frequency domain TRINICON method to perform a speech separation, and then use the separation result of the first frequency domain TRINICON, according to the multi-sound source activity detection method described above, determine the two. The respective active time periods of the two sound sources; then update the two filter coefficients with the respective active time periods of the two sound sources, and perform the adjusted second frequency-domain TRINICON speech separation to obtain the final output signals y1 and y2 .

图4展示了在一组仿真数据中使用普通频域TRINICON方法和采用加入了多声源活动检测的频域TRINICON方法各自收敛得到的滤波器的空间响应。声源位置为-70度和-15度，房间混响时间0.2秒，传声器阵列单元间距10厘米。Figure 4 shows the spatial responses of the filters converged using the normal frequency-domain TRINICON method and the frequency-domain TRINICON method with multi-source activity detection added to a set of simulated data. The sound source positions are -70 degrees and -15 degrees, the room reverberation time is 0.2 seconds, and the microphone array unit spacing is 10 cm.

可以看到加入了多声源活动检测后，滤波器优化得到的谷点更加准确，且深度更深，对声源的抑制更加明显。It can be seen that after adding multi-sound source activity detection, the valley points obtained by filter optimization are more accurate and deeper, and the suppression of sound sources is more obvious.

表1给出了多组数据的仿真结果，对比了不同情景下本发明对分离效果的提升。仿真的输入信号由纯净的语音信号经过Image Model工具声生成的房间冲激响应卷积得到。测试的指标有信号干扰比(signal to interference ratio,SIR)和信号失真比(signalto distortion ratio,SDR)，测试工具为BSSEVAL。表中给出的评分为4次实验下的两个通道的评分的平均。SIR越高，说明干扰声被抑制得越好，目标语音更纯净；SDR越高，说明输出的语音对比纯净语音的失真越小。Table 1 presents the simulation results of multiple sets of data, and compares the improvement of the separation effect of the present invention in different scenarios. The simulated input signal is obtained by convolving the room impulse response generated by the sound of the Image Model tool with the pure speech signal. The indicators tested are signal to interference ratio (SIR) and signal to distortion ratio (SDR), and the test tool is BSSEVAL. The scores given in the table are the average of the scores of the two channels under 4 experiments. The higher the SIR, the better the interference sound is suppressed and the purer the target speech; the higher the SDR, the smaller the distortion of the output speech compared to the pure speech.

可以看到在大部分情况下本发明对分离效果的提升是非常明显的，SIR的提升可以达到5dB至10dB以上，同时可以保持音质不严重失真。It can be seen that the improvement of the separation effect of the present invention is very obvious in most cases, the improvement of SIR can reach more than 5dB to 10dB, and at the same time, the sound quality can be kept from serious distortion.

表1不同情景下普通方法和本发明处理结果的SIR和SDR评分Table 1 SIR and SDR scores of the common method and the processing results of the present invention under different scenarios

实施例Example

1、测试条件：1. Test conditions:

设定房间大小为6m×6m×4m，混响时间为250ms，传声器阵列单元间距为10cm，放置于靠近房间中心的位置。声源离阵列中心1.5m，分别位于-70度和-15度方向。两个声源的纯净信号来自TIMIT数据库，各自发声的时长占录制信号总时长的60％，其中有约总时长三分之一的时长两个声源同时发声。两个声源的信号的信扰比(SIR)设置为0dB，即发声功率基本相同。另外，整个信号中加入了比声源信号低30dB的白噪声。信号的采样频率为16000Hz。使用Image Model根据上述条件生成房间冲激响应并用其与纯净语音卷积得到输入信号。The size of the room is set to 6m×6m×4m, the reverberation time is 250ms, the spacing of the microphone array units is 10cm, and it is placed close to the center of the room. The sound source is 1.5m away from the center of the array, at -70 degrees and -15 degrees, respectively. The pure signals of the two sound sources come from the TIMIT database, and the sounding time of each of them accounts for 60% of the total duration of the recorded signal, of which about one-third of the total duration is the two sound sources sounding simultaneously. The signal-to-interference ratio (SIR) of the signals of the two sound sources is set to 0 dB, that is, the sound power is basically the same. In addition, white noise 30dB lower than the sound source signal is added to the whole signal. The sampling frequency of the signal is 16000Hz. Use the Image Model to generate the room impulse response according to the above conditions and use it to convolve with the pure speech to obtain the input signal.

2、算法流程：2. Algorithm process:

1)算法初始化：将(5)式中的w₁₁、w₂₂设为位移点数为10个点的移位单位冲激响应(即w_11,10＝1，其余值为0)，w₁₂、w₂₁设为零矢量；步长μ设为0.02，最大迭代次数设置为200，滤波器长度L设置为1024。帧移设为L的一半。权重函数设置为1/N_sig。1) Algorithm initialization: set w ₁₁ and w ₂₂ in formula (5) as the displacement unit impulse response with 10 displacement points (that is, w _11,10 =1, and the remaining values are 0), w ₁₂ , w ₂₁ is set to the zero vector; the step size μ is set to 0.02, the maximum number of iterations is set to 200, and the filter length L is set to 1024. Frame shift is set to half of L. The weight function is set to 1/N _sig .

2)输入传声器接收到的信号x₁、x₂，按照(6)式将输入信号分成长度为4L，帧移为L/2的块，并用短时傅立叶变换变换到频域。2) Input the signals x ₁ and x ₂ received by the microphone, divide the input signal into blocks with a length of 4L and a frame shift of L/2 according to formula (6), and transform them into the frequency domain with short-time Fourier transform.

3)由Y ^(k)＝X ^(k) W ^(k)计算每一块的输出信号，并根据(7)式计算输出信号的功率谱密度矩阵Φ_yy。3) Calculate the output signal of each block by Y ^(k) = X ^(k) W ^(k) , and calculate the power spectral density matrix Φ _yy of the output signal according to formula (7).

4)根据(3)式和(9)式，按照自然梯度下降法更新滤波器系数W，使用逆短时傅立叶变换变换回时域并将w大于L-1的部分置零。4) According to equations (3) and (9), update the filter coefficient W according to the natural gradient descent method, use the inverse short-time Fourier transform to transform back to the time domain and set the part of w greater than L-1 to zero.

5)使用更新得到的滤波器系数W，重复第3)步和第4)步，直至达到最大迭代次数。5) Using the updated filter coefficients W , repeat steps 3) and 4) until the maximum number of iterations is reached.

6)得到初步的滤波器系数收敛结果W，利用重叠保留法对输入信号进行卷积得到两个通道的初步输出信号y₁₀和y₂₀。6) Obtain a preliminary filter coefficient convergence result W , and convolve the input signal with the overlap-preserving method to obtain preliminary output signals y ₁₀ and y ₂₀ of two channels.

7)根据(10)式及(11)式分别计算两个通道各块的输出和输入功率，并根据图2的流程，对每一块信号的活动状态进行判断，并得到修正的权重函数B(m)。7) Calculate the output and input power of each block of the two channels according to formula (10) and formula (11) respectively, and according to the process of Figure 2, judge the active state of each block signal, and obtain the revised weight function B ( m).

8)根据步骤1)重新对权重函数进行初始化，将步骤4)中的(9)式改为(12)式，并重复第3)步和第4)步，即重新进行一次迭代，这次迭代中只用声源s1活动的数据更新通道y1的滤波器系数，只用声源s2活动的数据更新通道y2的滤波器系数，直到达到最大迭代次数。8) Re-initialize the weight function according to step 1), change formula (9) in step 4) to formula (12), and repeat steps 3) and 4), that is, perform an iteration again, this time In the iteration, only the active data of the sound source s1 is used to update the filter coefficient of the channel y1, and only the active data of the sound source s2 is used to update the filter coefficient of the channel y2 until the maximum number of iterations is reached.

9)得到最终的滤波器系数收敛结果W，利用重叠保留法对输入信号进行卷积得到最终的两个通道的输出信号y₁和y₂。9) The final filter coefficient convergence result W is obtained, and the input signal is convolved by the overlap preservation method to obtain the final output signals y ₁ and y ₂ of the two channels.

Claims

1. A method for improving dual channel blind signal separation by multiple source activity detection, comprising the steps of:

(1) initializing filter matrix parameters in a frequency domain TRINICON algorithm;

(2) inputting sound source signals received by a microphone into a dual-channel system, partitioning input signals of each channel, and transforming the input signals of each block into a frequency domain by using short-time Fourier transform;

(3) byY ^(k)＝X ^(k) W ^(k)Calculating an output signal of each block, whereinY ^(k)、X ^(k)AndW ^(k)the filter coefficient matrixes of the frequency domain, the frequency domain output signal and the frequency domain input signal are respectively, and the superscript (k) is the serial number of the frequency point in the short-time discrete Fourier transform; then calculating the power spectral density matrix phi of each output signal_yy；

(4) Updating the filter coefficient according to a natural gradient descent method, transforming the filter coefficient matrix back to a time domain by using inverse short-time Fourier transform, and setting the part of the filter coefficient of the time domain, which is larger than the filter length minus 1, to zero;

(5) repeating the steps (3) and (4) by using the filter coefficient obtained in the step (4) until the maximum iteration number is reached;

(6) obtaining a preliminary filter coefficient convergence result, and performing convolution on the input signal by using an overlap preservation method to obtain a preliminary output signal of the first channel and a preliminary output signal of the second channel;

(7) firstly, the output and input power of each block in the two channels are respectively calculated according to the following two formulas:

wherein E_x(m) is input power, E_yp(m) is the power of the preliminary output signal, m is the number of signal blocks, k_uAnd k_lRespectively representing the upper limit and the lower limit of the frequency points which are counted and calculated, wherein subscript p is the serial number of an output channel, and subscripts 1 and 2 respectively represent a first channel and a second channel;

then judging the activity state of each block of signal and obtaining a modified weight function;

(8) re-initializing the weight function according to the step (1), wherein the modified gradient expression is as follows:

wherein

N_sigIs the number of blocks, epsilon, of the entire signal_p(m) is a correction weight, ε when the status of the p-th sound source in the m-th block is active_p(m) equals 1, otherwise,. epsilon_p(m) is equal to 0;

repeating the steps (3) and (4) according to the modified gradient expression, namely, repeating an iteration, wherein in the iteration, the filter coefficient of the first channel is updated only by the data of the target sound source activity to be suppressed by the first channel, and the filter coefficient of the second channel is updated only by the data of the target sound source activity to be suppressed by the second channel until the maximum iteration number is reached;

(9) and obtaining a final filter coefficient convergence result, and performing convolution on the input signal by using an overlap preservation method to obtain final output signals of the two channels.

2. The method of claim 1, wherein in step (1), the filter matrix parameters are initialized with a unity impulse response with a number of displacement points greater than the maximum path difference between adjacent microphones.

3. A method for improving the two-channel blind signal separation by multi-source activity detection as claimed in claim 1, wherein in step (2), the input signal is divided into blocks of length four times the filter length and frame-shifted by half the filter length.

4. The method for improving the dual-channel blind signal separation through multi-sound-source activity detection according to claim 1, wherein in the step (7), the specific process of judging the activity state of each signal is as follows:

after calculating the input power for each channel and the power of the preliminary output signal, the input power is compared to the power of the preliminary output signal for both channels, assuming that there are two sources and that the first source is suppressed in the first channel:

firstly, if the power of the input signal is less than the set minimum power threshold value E_minIf it is determined that all the sound sources are not sounding, ε is set₁(m)＝0，ε₂(m) ═ 0; if the power of the input signal is greater than the threshold value E_minIf so, judging that at least one sound source generates sound, and entering the next judgment:

if the power ratio of the first channel input power and the preliminary output signal is greater than the power ratio of the second channel input power and the preliminary output signal multiplied by the factor lambda, then the first sound source is considered inactive, the second sound source is considered active, and epsilon is set₁(m)＝0，ε₂(m) 1; if the power ratio of the second channel input power to the preliminary output signal is greater than the power ratio of the first channel input power to the preliminary output signal multiplied by the factor lambda, the second sound source is considered inactive, the first sound source is active, and epsilon is set₁(m)＝1，ε₂(m) ═ 0; if none of the above conditions are met, both sound sources are considered to be active and ε is set₁(m)＝1，ε₂(m)＝1。