Open AccessArticle

A Multi-Source Separation Approach Based on DOA Cue and DNN

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

National Space Science Center, The Chinese Academy of Sciences, Beijing 100190, China

Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 106344, Taiwan

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 6224; https://doi.org/10.3390/app12126224

Submission received: 8 May 2022 / Revised: 17 June 2022 / Accepted: 17 June 2022 / Published: 19 June 2022

(This article belongs to the Special Issue Immersive 3D Audio: From Architecture to Automotive)

Download

Browse Figures

Figure 1
Schematic diagram for the proposed method for multi-source separation. "> Figure 2
The TF spectrums of different signals: (a) the original speech signal; (b) the recorded signal with <math display="inline"><semantics> <mrow> <msub> <mi>T</mi> <mrow> <mn>60</mn> </mrow> </msub> <mo>=</mo> <mn>300</mn> <mrow> <mo> </mo> <mi>ms</mi> </mrow> </mrow> </semantics></math>. "> Figure 3
Logarithmic spectrum of an RIR signal with <math display="inline"><semantics> <mrow> <msub> <mi>T</mi> <mrow> <mn>60</mn> </mrow> </msub> <mo>=</mo> <mn>600</mn> <mo> </mo> <mi>ms</mi> <mo>.</mo> </mrow> </semantics></math> "> Figure 4
The TF spectrum of the dereverberated signal. "> Figure 5
Schematic diagram for sparse and non-sparse component points. The rectangular boxes with a single color represent sparse component points; Rectangular boxes with multiple colors represent non-sparse component points. "> Figure 6
The normalized statistical histogram of DOA estimates: (a) is for total TF points; (b) is for sparse component points. "> Figure 7
Illustration of network structure. "> Figure 8
Average PESQ of the separated signals from sparse component points, non-sparse component points and the whole TF points. "> Figure 9
Average PESQ of signals separated by the proposed method with or without dereverberation. "> Figure 10
Average STOI of signals separated by the proposed method with or without dereverberation. "> Figure 11
Average PESQ in different reverberant rooms. "> Figure 12
Average SDR and SIR in different reverberant rooms: (a) average SDR; (b) average SIR. "> Figure 13
MUSHRA test results under the ideal acoustic case with 95% confidence intervals. "> Figure 14
MUSHRA test results under the reverberant case with 95% confidence intervals. ">

Review Reports Versions Notes

Abstract

Multiple sound source separation in a reverberant environment has become popular in recent years. To improve the quality of the separated signal in a reverberant environment, a separation method based on a DOA cue and a deep neural network (DNN) is proposed in this paper. Firstly, a pre-processing model based on non-negative matrix factorization (NMF) is utilized for recorded signal dereverberation, which makes source separation more efficient. Then, we propose a multi-source separation algorithm combining sparse and non-sparse component points recovery to obtain each sound source signal from the dereverberated signal. For sparse component points, the dominant sound source for each sparse component point is determined by a DOA cue. For non-sparse component points, a DNN is used to recover each sound source signal. Finally, the signals separated from the sparse and non-sparse component points are well matched by temporal correlation to obtain each sound source signal. Both objective and subjective evaluation results indicate that compared with the existing method, the proposed separation approach shows a better performance in the case of a high-reverberation environment.

Keywords:

multi-source separation; dereverberation; deep neural network; direction of arrival

1. Introduction

Sound source separation is a regular research field in audio signal processing, which has been studied dating back to the middle of the twentieth century. Sound source separation plays an important role in many areas, such as automatic speech recognition (ASR) [1,2] and speech coding [3].

In the early years, researchers mainly focused on the sound source methods based on a statistical method. The hidden Markov model (HMM) can describe the short-time stationary characteristics of a speech signal which is generally used in speech separation [4]. The relationship between the mixed signal and the sound source signal is modeled by multiple independent HMMs, and then the model parameters are calculated to estimate the separated signals [5]. Subsequently, the separation method based on computational auditory scene analysis (CASA) was proposed in [6]. Firstly, auditory peripheral analysis is performed on mixed signals, and acoustic features such as the amplitude spectrum of the signal are obtained. The features are segmented to obtain the intermediate feature. Then, the sound source signal is separated by acoustic recombination. However, it is difficult for an auditory scene model to fully describe the real acoustic scene, so the CASA-based separation algorithm is limited in a complex environment. Later, the sound source separation method based on beamforming, which is divided into a fixed beamforming algorithm [7] and an adaptive beamforming algorithm, became popular [8]. The main idea of this method is to obtain the separation matrix. Recently, the improved beamforming method has been proposed. ‘Single source zone’ detection is introduced into linearly constrained minimum variance beamforming to achieve multi-source separation [9]. However, the performance of this method is poor in a high-reverberation environment. At the same time, researchers also pay attention to the sound source separation method based on independent component analysis (ICA), which assumes that the observed signals are statistically independent [10]. As an observed signal, the mixed signal is decomposed into the linear sum of statistically independent components through a series of linear transformations aiming to obtain the separated signals. The separation approach based on ICA is suitable for the case of low-reverberation conditions. Nevertheless, when the number of sound sources is more than that of microphones, it is difficult to obtain the separated signal with a high quality.

In the last few years, approaches have been proposed based on the sparsity of a speech signal. First, the W-disjoint orthogonality (W-DO) hypothesis was proposed in [11], which means that there is at most one sound source dominant in some TF points when there are two or more sound sources. The core of the W-DO-based source separation is to find these TF points dominated by only one sound source, which can be achieved using different methods. For example, the direction of arrival (DOA) of a sound source provides guidance for finding these TF points in sound source separation [12]. Compared with some previous methods, the separation method based on the W-DO hypothesis has lower computational complexity [13]. However, with an increasing number of sound sources, the W-DO characteristic becomes weakened so that the performance of the W-DO-based approach decreases. Subsequently, improved methods have been proposed, which focus on both TF components dominated by one sound source (i.e., sparse components) and the other remaining TF components (i.e., non-sparse components). The separation method proposed in [14] uses a clustering algorithm with a dynamic threshold for the sparse component separation. Then, the non-sparse component is recovered by using the “local-zone stationarity” assumption.

In recent years, deep neural networks (DNNs) have been introduced into sound source separation [15,16], which have dramatically improved separation performance [17]. The TF mask is generally taken as the training target for a neural network, which represents the energy ratio of each sound source at a TF point. In a DNN-based separation method, an expectation maximization (EM) algorithm and a multichannel Wiener filter can also be combined to model the sound source spectrum and spatial information [18], which greatly improves the quality of the separated speech signal. In addition, some methods combining more than one DNN have been investigated in recent years. A DNN-based method combining a dereverberation module and a separation module for reverberant single-channel speaker separation was provided in [19]. D. Wang proposed the two-stage network with an RNN and a DNN to separate multiple sound sources in reverberant environments [20]. Compared with previous methods based on a DNN, these proposed separation modules achieve better performance in reverberant environments.

Nevertheless, sound source separation in reverberant environments is still a difficult problem which needs to be further investigated. In reverberant environments, the recorded signal contains the direct component of the sound source and the reverberant component from room and obstacle reflection. In the TF domain, it can be considered that a TF point is composed of a direct component and a reverberant component. The reverberant component changes the spectrum structure of the sound source signal and reduces the quality of the speech signal, which thus leads to the low quality of the separated source signals.

A dereverberation module is introduced as a pre-processing step in the proposed method, which is proposed in [21]. Then, the dereverberation model aims to extract the clean mixed signal from the reverberant mixed signal recorded by sound field microphone. The multi-source separation framework for the dereverberated mixed signals is divided into two parts: sparse component point recovery and non-sparse component point recovery. In previous work, we proposed a multi-source localization algorithm, which can be used to provide necessary information for multi-source separation, such as sound source number and DOA estimates of sound sources [22]. We first determine the sparse and non-sparse component points by estimating the DOA of each TF point and that of real sound sources. The sparse component points are only dominated by the direct component of a sound source. In contrast, the non-sparse component points are composed of the components of multiple sound sources and/or the component of room reflection. Subsequently, we adopt different separation methods for sparse and non-sparse component points. Specifically, we first propose a sparse component point recovery method based on a DOA cue, where the DOA cue refers to the DOA estimates of sound sources. Subsequently, the TF mask of each non-sparse component point is estimated through regression training of the DNN, and the separated component from non-sparse component points can be further obtained. Finally, the post-processing methods including separated signal matching and smoothing are needed to obtain the separated signals from each sound source. The block diagram of the proposed multi-source separation algorithm is shown in Figure 1, where “Format Transformation” means that the A-format signal recorded by a sound field microphone is transformed into the B-format signal [23]. “W-channel” means the W-channel signal in B-format. In [24], we discussed a preliminary idea for multi-source separation. Different from [24], a DOA cue is used to realize sparse component point recovery; We focus on the effect of the dereverberation module on separation performance. In addition, a lot of experiments have been performed to verify the effectiveness of the proposed method.

The remainder of the paper is organized as follows: The dereverberation model in the proposed multi-source separation method is introduced in Section 2. Section 3 presents the multiple-sound-source separation method. The performance evaluation for the proposed method is shown in Section 4, and the conclusion is drawn in Section 5.

2. Mixed Signal Dereverberation

The reverberant component always exists in a speech signal recorded in a real environment. However, the reverberant component can lead to the degradation of signal quality which makes it difficult to separate the original sound source signal from the reverberant mixed signal and leads to a poor performance of the separation algorithm. Therefore, it is very important to restrain the effect of reverberation on sound source separation.

In this paper, a multi-source separation method for the reverberant environment, which consists of a dereverberation module proposed in [21] and a separation module, is proposed. The dereverberation module is used to remove the reverberation component from the recorded signal and maintain the direct component of the source. The dereverberation module is implemented in the TF domain, which also consists of two stages. The first stage is to learn the spectrum structure and obtain the dictionary matrix; in the second stage, a non-negative matrix factorization (NMF) model is used to represent the signal spectrum and obtain the optimal estimation of the sound source signal.

2.1. Recorded Signal Model

In the time domain, the signals recorded by the microphone can be modeled by the source signal convolved with the room impulse response (RIR). However, for speech signals, we often work within the TF domain rather than in the time domain [25]. Short-Time Fourier Transform (STFT) is utilized to obtain the time-frequency representation of the recorded signal. Here,

s [k, n]

x [k, n]

and

m [k, n]

are the T-F domain coefficients of clean signals

s (t)

, reverberant signals

x (t)

, and the room impulse response (RIR)

h (t)

. So,

x [k, n]

can be written as [21]:

x [k, n] = \sum_{h = 0}^{H^{'}} s [k, n - h] m [k, h]

(1)

where

H^{'}

is the reverberation order which reflects the reverberation level. The expected value of the

{| x [k, n] |}^{2}

is denoted by

X_{k, n} \in ℝ_{0, +}, X = {(X_{k, n})}_{K \times T}

, which can be expressed as follows:

X_{k, n} = \sum_{h = 0}^{H} S_{k, n - h} M_{k, h}

(2)

where

S_{k, n} = {| s [k, n] |}^{2}

and

M_{k, h} = {| m [k, n] |}^{2}

are the square of the magnitude of

s [k, n]

and

m [k, n]

, respectively;

H =

\min {H^{'} - 1, n - 1}

;

n = 1, \dots, T

and

k = 1, \dots, K

are the time index and the frequency index, respectively.

Subsequently, we aim to build a spectrogram model of the sound source signal using the NMF method to recover the sound source signal from the recorded signal. We assume that the non-negative spectrogram of sound source signal

S = {(S_{k, n})}_{K \times T}, S_{k, n} \in ℝ_{0, +}

can be represented by two non-negative matrices

W = {(W_{k, l})}_{K \times L}, W_{k, h} \in ℝ_{0, +}

and

V = {(V_{l, n})}_{L \times T}, V_{l, n} \in ℝ_{0, +}

, where

L < m i n {K, T}

m i n {\cdot}

is the minimum value. Then,

S

can be written as follows by an NMF model:

S ≅ W V

(3)

where “

≅

” represents the approximate equation.

W

and

V

are the dictionary matrix and activation matrix, respectively. Next, we use

W V

to replace

S

for convenience.

Thereafter, Formula (2) can be rewritten as the following formula, which is the model of the recorded signal:

X_{k, n} = \sum_{h = 0}^{H} \sum_{l = 1}^{L} W_{k, l} V_{l, n - h} M_{k, h}

(4)

2.2. Cost Function

We obtain the model of the recorded signal

X

in Section 2.1 and use

Y =

{(Y_{k, n})}_{K \times T}, Y_{k, n} \in ℝ_{0, +}

to denote the spectrogram of the recorded signal with reverberation in this subsection. Considering that the NMF model represents an approximate equality relationship between

S

and

W V

, the recorded signal model

X

is also an approximate representation for the recorded signal

Y

. There are different methods which can be used to measure the accuracy of the approximate representation, such as the Euclidean distance and so on. In this subsection, we employ the generalized

β

divergence as the measurement method to obtain the optimal matrices

W

V

and

M =

{(M_{k, h})}_{K \times H}, M_{k, h} \in ℝ_{0, +}

, so that

X

can be used to provide an accurate approximation for

Y

[25]. Therefore, the spectrogram of sound source signal

S

can be obtained by the product of the optimal matrices

W

and

V

The generalized

β

divergence of

X

and

Y

is defined as:

Q_{β} (Y ∥ X) = \sum_{k = 1}^{K} \sum_{n = 1}^{T} (Y_{k, n} \cdot \frac{Y_{k, n}^{β - 1} - X_{k, n}^{β - 1}}{β (β - 1)} + X_{k, n}^{β - 1} \cdot \frac{X_{k, n} - Y_{k, n}}{β})

(5)

where

Y_{k, n}

is the spectrogram of the recorded signal at the

k_{t h}

frequency point and the

n_{t h}

frame, and

β

is the value of divergence.

However, there exist many ways to construct the matrices

W

V

and

M

. The penalty functions for

V

and

M

are introduced to narrow the range of alternative ways and obtain the optimal

W

V

and

M

more efficiently [21]. Hence, the cost function combining the generalized

β

divergence and the penalty functions is defined as follows:

f (W, V, M) = Q_{β} (Y ∥ X) + G_{v} (V) + G_{m} (M)

(6)

where

G_{v} (V)

and

G_{m} (M)

are the penalty functions for

V

and

M

, respectively;

G_{v} (V)

is determined based on the TF spectrum structure of the recorded signal, whereas

G_{m} (M)

is determined based on the logarithmic spectrum structure of the RIR.

First, we determine the penalty function

G_{v} (V)

by analyzing the characteristics of the TF spectrum of the recorded signal with reverberation. An example is shown in Figure 2, where Figure 2a is the TF spectrum of the clean sound source signal sampled at 16 kHz. The recorded signal with reverberation is simulated by Roomsim software [26]. A sound field microphone is located at the center of the simulated room, and the reverberation time (

T_{60}

) is 300 ms. Figure 2b shows the TF spectrum of the recorded signal (i.e., W-channel signal in B-format) with

T_{60} = 300 ms

. It can be clearly seen from Figure 2 that the harmonic structure in the TF spectrum of the clean sound source signal is more obvious, while that of the recorded signal with reverberation is fuzzier, and the formant peak is not clear.

Therefore, a penalty function is added to the activation matrix

V

to highlight the harmonic structure in the TF spectrum and enhance the sparsity of the recorded signal. Since the

ℓ_{1}

norm can be used to represent the sparsity of a matrix, we use it to construct the penalty function for

V

. In a non-negative matrix, the

ℓ_{1}

norm can be represented as the sum of the total elements in the matrix. By weighting the elements in the non-negative matrix

V

and summing the weighted elements, the penalty function for

V

can be expressed as:

G_{v} (V) = \sum_{l = 1}^{L} \sum_{n = 1}^{T} η_{n}^{v} V_{l, n}

(7)

where

η_{n}^{v}

is a non-negative parameter for

V

The penalty function for

V

can speed up the convergence of the cost function

f (W, V, M)

and make the estimated spectrogram closer to the spectrogram of the sound source signal

S

Next, the penalty function for

M

is determined according to the characteristics of the spectrum of RIR. The logarithmic spectrum of a simulated RIR with

T_{60} = 600 ms

presents a smooth attenuation structure, which is exhibited in Figure 3. In Figure 3, the energy decays over time. The penalty function for

M

is added to promote the attenuation structure [27], which can be expressed as:

G_{m} (M) = \sum_{k = 1}^{K} η_{k}^{m} ∥ U M_{k}^{'} ∥_{2}^{2}

(8)

where

η_{k}^{m}

is a non-negative parameter for

M

;

M_{k} \in ℝ_{0, +}^{H}

is the

k_{t h}

row of

M

, and

U \in ℝ^{(H - 1) \times H}

represents a finite difference matrix. In addition, “

{(\cdot)}^{'}

” means the transpose of matrix, and “

∥ \cdot ∥_{2}^{2}

” is the square of the

ℓ_{2}

norm.

Therefore, the cost function (i.e., Formula (6)) can be rewritten as:

f (W, V, M) = Q_{β} (Y ∥ X) + \sum_{l = 1}^{L} \sum_{n = 1}^{T} η_{n}^{v} V_{l, n} + \sum_{k = 1}^{K} η_{k}^{m} ∥ U M_{k}^{'} ∥_{2}^{2}

(9)

In Section 2.3, the optimization for

W

V

and

M

is introduced to minimize the cost function

f (W, V, M)

and obtain the optimal estimation for

S

2.3. Optimization

The optimization process is divided into two parts. The first part is to initialize the cost function

f (W, V, M)

to obtain

f_{w} (W)

only with respect to

W

and then obtain the optimal

\hat{W} \in ℝ_{0, +}^{K \times L}

through iteration. The second part is to put the optimal

\hat{W}

into the cost function

f (W, V, M)

and obtain the optimal

\hat{V}

and

\hat{M}

by minimizing the cost function.

First,

J_{w} (W, \tilde{W})

J_{v} (V, \tilde{V})

and

J_{m} (M, \tilde{M})

are defined as the auxiliary functions for

Q_{β} (Y ∥ X)

G_{v} (V)

and

G_{m} (M)

, respectively [21]. To initialize the cost function,

M_{k, 1} = 1

M_{k, n} = 0, n = 2, \dots, T, k = 1, \dots, K

and

η_{n}^{v} = 0, n = 1, \dots, T

. So, we obtain

f_{w} (W)

only with respect to

W

. According to the definition of the auxiliary function [28], in the case of

W \in ℝ_{0}^{+}

and

\tilde{W} \in ℝ_{0}^{+}

, there exists

J_{w} (W, \tilde{W}) \geq f_{w} (W)

. The auxiliary function of

f_{w} (W)

is defined as

J_{w} (W, \tilde{W})

J_{w} (W, \tilde{W}) = \sum_{k, n, l, h} \frac{{\tilde{W}}_{k, l} V_{l, n - h} M_{k, h}}{{\tilde{X}}_{k, n} \cdot n} {\overset{ˇ}{Q}}_{β} (Y_{k, n} ∥ {\tilde{X}}_{k, n} \frac{W_{k, l}}{{\tilde{W}}_{k, l}}) + \sum_{k, n, l, h} \frac{\partial {\hat{Q}}_{β} (Y_{k, n} ∥ {\tilde{X}}_{k, n})}{\partial X_{k, n}} (W_{k, l} - {\tilde{W}}_{k, l}) V_{l, n - h} M_{k, h} + \sum_{k, n} {\hat{Q}}_{β} (Y_{k, n} ∥ {\tilde{X}}_{k, n})

(10)

where

{\tilde{X}}_{k, n} = \sum_{h = 0}^{H} \sum_{l = 1}^{L} {\tilde{W}}_{k, l} V_{l, n - h} M_{k, h}

;

{\hat{Q}}_{β} (\cdot)

is an upper convex function, and

{\overset{ˇ}{Q}}_{β} (\cdot)

is a concave function

; Q_{β} (\cdot)

can be decomposed into the sum of

{\hat{Q}}_{β} (\cdot)

and

{\overset{ˇ}{Q}}_{β} (\cdot)

The iteration optimization for

J_{w} (W, \tilde{W})

can be used to optimize

f_{w} (W)

. Specifically, the update equation for

W

can be obtained by setting the gradient of

J_{w} (W, \tilde{W})

as zero:

W_{k, l}^{[t]} = W_{k, l}^{[t - 1]} \frac{m a x {{(\sum_{k, n} {(X_{k, n}^{[t - 1]})}^{β - 2} Y_{k, n} V_{l, h} M_{k, n - h})}^{τ}, ζ}}{{(\sum_{k, n} {(X_{k, n}^{[t - 1]})}^{β - 1} V_{l, h} M_{k, n - h})}^{τ}}

(11)

where “

{(\cdot)}^{[t]}

” represents the

t_{t h}

iteration,

τ

is the parameter determined by

β

m a x {\cdot}

represents the maximum value, and

ζ

is a small constant in order to avoid the elements in W from being zero.

The same process is conducted for

V

as follows:

V_{l, h}^{[t]} = V_{l, h}^{[t - 1]} \frac{m a x {{(\sum_{k, n} {(X_{k, n}^{[t - 1]})}^{β - 2} Y_{k, n} V_{l, h} M_{k, n - l} - η_{n}^{v})}^{τ}, ζ}}{{(\sum_{k, n} {(X_{k, n}^{[t - 1]})}^{β - 1} W_{k . l} M_{k, n - l})}^{τ}}

(12)

Subsequently, we define a diagonal matrix and a vector to obtain the update function for

M

. The diagonal matrix

A^{(k)} = {(a_{h \times h}^{(k)})}_{H \times H}, a_{h \times h}^{(k)} \in ℝ_{0, +}, k = 1, \dots, K

is defined as:

a_{h, h}^{(k)} = {\begin{matrix} \sum_{l, n} W_{k, l} U_{l, n - h} \frac{{(X_{k, n}^{[t - 1]})}^{β - 1}}{M_{k, h}^{[t - 1]}}, β > 1 \\ \sum_{l, n} \frac{W_{k, l} U_{l, n - h}}{M_{k, h}^{[t - 1]}}, o t h e r \end{matrix}

(13)

The vector

b^{(k)} = {(b_{h}^{(k)})}_{H \times 1}, b_{h}^{(k)} \in ℝ_{0, +}

is defined as:

b_{h}^{(k)} = {\begin{matrix} \sum_{l, n} W_{k, l} U_{l, n - h} Y_{k, n} {(X_{k, n}^{[t - 1]})}^{β - 2}, β \leq 2 \\ \sum_{l, n} W_{k, l} U_{l, n - h} Y_{k, n}, o t h e r \end{matrix}

(14)

Similar to Formula (12), Formula (15) is used to solve for

M_{k}^{[t]}, k = 1, \dots K

, which is combined to obtain the optimal

\hat{M}

[A^{(k)} + 2 η_{k}^{m} U^{'} U] M_{k}^{[t]} = b^{(k)}

(15)

Therefore, we can obtain the optimal

\hat{W}

\hat{V}

and

\hat{M}

. Then, a time-varying gain coefficient

Z_{k, n}

is utilized to avoid the error caused by the simple definition

\hat{S} = \hat{W} \hat{V}

, where

\hat{S}

is the estimated spectrogram of the sound source signal.

Z_{k, n} = \frac{\sum_{l} {\hat{W}}_{k, l} {\hat{V}}_{l, n}}{\sum_{l, h} {\hat{W}}_{k, l} {\hat{V}}_{l, n - h} {\hat{M}}_{k, h}}

(16)

Eventually, the estimated spectrogram of the sound source signal

\hat{S}

is obtained by:

{\hat{S}}_{k, n} = Z_{k, n} Y_{k, n}

(17)

To obtain the estimated spectrum of the dereverberated signal, the phase information of the recorded signal is used to reconstruct that of the dereverberated signal.

Thereafter, the spectrum of the dereverberation signal shown in Figure 4 is taken as an example to show the effect of the dereverberation module, while the spectrum of the source signal and the recorded signal are shown in Figure 2a,b, respectively.

It can be observed that compared to the recorded signal, the harmonic structure in the TF spectrum of the dereverberated signal is clearer and the energy between adjacent harmonics is obviously reduced. In addition, the spectrum of the dereverberated signal is more similar to that of the sound source signal.

3. Multiple Sound Source Separation Framework

The multiple-sound-source separation framework is proposed based on the sparsity of the speech signal. Previous studies have proved the existence of sparse and non-sparse component points of a speech signal in the TF domain [29,30]. Specifically, the sparse component points refer to the TF points which are only dominated by the direct component from one sound source. In contrast, the non-sparse component points are the remaining TF points which are composed of the components of multiple sound sources and/or the component of room reflection. Figure 5 is the schematic diagram of the sparse and non-sparse component points, where each rectangular box represents a TF point, and a different color represents a different sound source. As shown in Figure 5, the TF points (

n, k_{1}

), (

n, k_{2}

) and (

n, k_{3}

) are dominated by the direct component of sound sources

s_{2}

s_{3}

and

s_{4}

, respectively. In contrast, the TF points (

n, k_{4}

), (

n, k_{5}

) and (

n, k_{6}

) are the non-sparse component points where there is no dominant component.

Subsequently, different separation methods for sparse and non-sparse component points are designed, and a multi-source separation method combining sparse and non-sparse component point recovery is proposed in this paper. In view of the relationship between the DOA estimates of the sparse component points and the real DOAs of sound sources, a separation method based on the distribution of DOA estimates is also proposed in this paper. Considering that more than one sound source is active simultaneously in a non-sparse component point, the deep neural network is adopted to estimate the IRM which is used for non-sparse component point recovery.

3.1. Sparse Component Point Recovery

In this paper, the DOA of sound source is used to determine the sparse component points and obtain the separated component from the mixed signals. To obtain the DOA estimates of real sound sources, we explore the distribution of DOA estimates of total TF points. For the mixed signals recorded by the sound-field microphone, DOA estimates of total TF points can be directly calculated from the B-format recordings [31,32]. The DOA estimation method proposed in [22] is employed to obtain DOA estimates of real sound sources. To be specific, the single source time–frequency point in the mixed signal is detected by diffuseness measurement; the active sound intensity obtained by the sound field microphone recordings is used to obtain DOA estimates of the single source time–frequency points. So, the sound sources number and the multi-source DOA estimates can be estimated by peak searching on the statistical histogram formed by the DOA estimates of the single source time–frequency point. Theoretically, if the TF point is dominated by the direct component of the sound source

s_{i},

the DOA estimate of this TF point may be same as that of the

i_{t h}

sound source. Therefore, the TF point whose DOA estimate is closest to the DOA of the real sound source is considered the sparse component point. In fact, to achieve a satisfactory tolerance, the restriction on DOA estimates of TF points is relaxed. The TF point whose DOA estimates are within the range of the DOA estimate of source

\pm Δ

is the sparse component point. The remaining TF points are detected as non-sparse component points. So, the set of sparse component points and non-sparse component points are defined as follows:

𝓡_{s} : = {(n, k) | θ_{n, k} \in [ω_{i} - Δ, ω_{i} + Δ], i = 1, \dots, N}

(18)

\begin{matrix} 𝓡_{n s} : = {(n, k) | θ_{n, k} \notin [ω_{i} - Δ, ω_{i} + Δ], i = 1, \dots, N} \end{matrix}

(19)

where

ω_{i}

and

θ_{n, k}

denote the DOA estimate of the

i_{t h}

real sound source and the TF point (

n, k

), respectively.

𝓡_{s}

and

𝓡_{n s}

represent the set of sparse component points and non-sparse component points, respectively. For convenience,

n

and

k

are omitted in

θ_{n, k}

in the following.

For example, Roomsim was used to obtain the recorded signal. A sound field microphone was utilized to record a mixed signal with three sound sources and

T_{60} = 300 ms

. The real DOAs of the three sound sources are

100 °

160 °

and

250 °

. Figure 6a is the normalized statistical histogram of the DOA estimates of the total TF points obtained by the DOA estimation algorithm proposed in [22]. The horizontal axis is the DOA estimate, and the vertical axis is the normalized statistical amplitude. There are three peaks which represent the DOAs of the three real sound sources and are delineated by the black stems. From Figure 6a, it is clear that the distribution of DOA estimates is relatively concentrated around the real DOAs of the sound sources. If the DOA estimate of a TF point is closest to the real DOA of a sound source, the TF point can be considered to be dominated by this sound source, which is determined to be a sparse component point. Figure 6b shows the distribution of DOA estimates of

𝓡_{s}

where

Δ = 10 °

According to the above analysis, the separated component from the sparse component point is the component of the dominant sound source. Hence, the sparse component point recovery can be achieved by determining the dominant sound source from the total real sound sources. The difference

D_{i}

between the DOA estimate of the sparse component point and

ω_{i}, i = 1, \dots, N

is calculated to determine the dominant sound source at a sparse component point

(n, k) \in 𝓡_{s}

D_{i} = θ - ω_{i}, i = 1, \dots, N

(20)

where

θ

is the DOA estimate of a sparse component point.

The dominant sound source of a sparse component point is determined as the

I_{t h}

sound source with the minimum difference

D_{I}

D_{I} = m i n {θ - ω_{i}, i = 1, \dots, N}

(21)

Hence, the separated component from the sparse component point is from the dominant sound source.

3.2. Non-Sparse Component Point Recovery

In Section 3.1, we can obtain the separated components from the sparse component points. However, if we only focus on sparse component point recovery and ignore the non-sparse component point, the separated signals lose a lot of TF components. In particular, the number of non-sparse component points in the recorded signal increases with the increase in the number of sound sources. If only sparse component point recovery is performed, it is difficult to acquire a good perceptual quality of the separated signal. Hence, the non-sparse component point recovery is very important for improving the quality of the separated speech. Therefore, in this subsection, the recovery method for the non-sparse component point is introduced.

The non-sparse component points are composed of the superposition of the components of multiple sound sources and/or reflection components. Since reflection components in the recorded signal have been mostly removed by the method introduced in Section 2, the non-sparse component points are considered to be mainly composed of multiple sound source components in this subsection. Therefore, the non-sparse component point recovery problem can be transformed to calculate the proportion of each sound source at each TF point, which can be denoted by an ideal ratio mask (IRM).

In recent years, deep neural networks (DNNs) have achieved excellent results in many fields, and they are very suitable for solving the regression problem. In addition, calculating the IRM for each TF point can be regarded as a regression problem. Therefore, a DNN is employed to predict the IRM and to achieve non-sparse component point recovery. The amplitude of the TF domain coefficient of one frame mixed signal is input into the DNN. The training target of the network is the IRM for each sound source, which can be formulated as follows:

IRM (n, k) = \frac{{| S_{i} (n, k) |}^{2}}{\sum_{i = 1}^{N} {| S_{i} (n, k) |}^{2}}

(22)

where

S_{i} (n, k)

is the TF domain coefficient of the

i_{t h}

sound source; “

| \cdot |

” represents the amplitude of a complex, and

N

is the number of sound sources.

The neural network structure is shown in Figure 7. The neural network consists of two full connection layers and two convolution layers. The first full connection layer and the two convolutional layers are followed by a Relu active function, a batch-normalization layer, and a dropout layer. The second full connection layer is followed by a sigmoid layer.

To train the neural network effectively, it is necessary to set the training parameters for the DNN. The parameter configuration is depicted in Table 1.

In addition, we employ the Adam optimizer to train the neural network. The network is trained based on the minimum root mean square error (RMSE) criterion. The RMSE cost function

F

can be expressed as:

F (IRM (n, :); Θ) = \sqrt{\frac{1}{M} \sum_{k = 1}^{M} {[I R M (n, k) - ℒ (𝓰 (n))]}^{2}}

(23)

where

M

is the frequency channel number,

Θ

is the parameters of DNN,

ℒ (\cdot)

represents prediction operation of DNN, and

𝓰

is the input feature vector (i.e., the magnitudes of TF representation of total TF points in one frame).

In the test phase, the trained network with

Θ

minimizing RMSE can be used to predict the IRM, namely, the proportion of the energy of each sound source at each TF point.

3.3. Separated Signal Match and Post-Process

In Section 3.1 and Section 3.2, the separated components from sparse and non-sparse component points can be obtained, respectively. However, we need to determine whether the separated components from the sparse component points and the non-sparse component points come from the same sound source. For sparse component point recovery, the DOAs of the sound sources are used to determine which sound source the separated sparse component comes from. For non-sparse component point recovery, the

i_{t h}

sound source is selected from the N sound sources as the target sound source in advance. Then, according to Formula (22), the DNN is used to predict the IRM for the target sound source to obtain the separated components of the target sound source from the non-sparse component points. In the case of multiple (at least three) sound sources, we cannot predict the IRMs of all the sound sources simultaneously. Hence, we repeat the above process for N times, and N DNN models are obtained to estimate the IRMs for the N sound sources, respectively. The separated components of the total sound sources from the non-sparse component points are then obtained. However, the target sound source is set randomly, not according to the DOA of the sound source. Therefore, the separated components of each sound source from the sparse component points

𝓡_{s}

and the non-sparse component points

𝓡_{n s}

need to be matched to obtain the separated components coming from the same sound source.

Considering that a temporal correlation exists in the speech signals from the same sound source, the correlation coefficient can be used to determine whether the separated components from the sparse component points and the non-sparse component points are matched. The sparse component separated signal of one sound source consists of the total separated components of the sound source from the sparse component points. In the same way, the non-sparse component separated signal of one sound source consists of the total separated components of the sound source from the non-sparse component points. The correlation coefficient between the different separated signals can be defined by the following formula:

ρ_{i, j} = \frac{COV (o_{i}, o_{j}^{'})}{\sqrt{D (o_{i})} \sqrt{D (o_{j}^{'})}}, i = 1, \dots, N a n d j = 1, \dots, N

(24)

where

o_{i}

represents the sparse component separated signal of the

i_{t h}

source, and

o_{j}^{'}

represents the non-sparse component separated signal of the

j_{t h}

source. Here,

ρ_{i, j}

is the correlation coefficient between the sparse component separated signals of the

i_{t h}

source and the non-sparse component separated signals of the

j_{t h}

source.

COV (\cdot)

and

D (\cdot)

are covariance and variance, respectively.

Therefore, if the sparse component separated signal comes from the

i_{t h}

sound source, the following formula can be used to determine that the non-sparse component separated signal from the

j_{t h}

sound source comes from the same sound source.

m (i) = a r g \max_{j} {ρ_{i, j}, j = 1, \dots, N}

(25)

Eventually, we obtain N separated signals

{\tilde{s}}_{i} = o_{i} + o_{m (i)}^{'}, i = 1, \dots, N

4. Performance Evaluation

For evaluating the performance of the proposed separation method, a set of objective and subjective experiments are presented.

4.1. Experiment Conditions

Firstly, the two sound sources scene were taken as an example to elaborate on the preparation of the data set for training the DNN. From the Nippon Telegraph and Telephone (NTT) database, 200 speech segments sampled by 16 kHz, including 100 male speaker segments and 100 female speaker segments, were selected as source signals. We randomly selected two speech segments from these 200 speech segments as two sound source signals

s_{1}

and

s_{2}

. Roomsim software was then used to record two source signals with reverberation, respectively [26]. The reverberant mixed signal

s_{m i x}

was obtained by linear addition of two recorded signals in the time domain. Thereafter, the dereverberation model used the method described in Section 1. Hence, we obtained one group of training data (i.e.,

s_{1}

s_{2}

and

s_{m i x}

). The same method was utilized to generate 400 groups of data to form the training set. Another 50 and 40 groups of data constituted the validation set and the test set, respectively. The codes of the proposed methods were written in MATLAB (R2019b version). The experiments were implemented on a desktop with an Intel i7 CPU with 2.9 GHz and 16 GB of memory without parallel processing. In the training and testing phases, a NVIDIA GeForce RTX 2060 GPU was used.

Subsequently, the mixed speech signals for multiple-sound-source separation were generated as follows. Speech segments from NTT speech database were selected as sound source signals which were sampled at 16 kHz and included eight females and eight males. In this paper, Roomsim software was used to simulate acoustic rooms with the size of

6.25 \times 4.75 \times 2.5 m^{3}

where the sound speed was 340 m/s. The sound-field microphone was located at the center of the simulated rooms. The distance between the microphone and sound source was 1 m.

Performance evaluation consisted of objective and subjective evaluations. In objective evaluation, the perceptual evaluation of speech quality (PESQ) [33] and the short-time objective intelligibility (STOI) [34] were utilized to measure the perceptual quality and intelligibility of separated signals, respectively. The PESQ score is between −0.5 and 4.5. A higher PESQ score means a higher perceptual quality and a higher STOI score means higher intelligibility. The signal-to-distortion ratio (SDR) and signal-to-interference ratio (SIR) were used to measure the distortion and interference level of the separated signals, respectively. Higher SDR and SIR scores mean lower distortion and interference level, respectively. In subjective evaluation, the multiple stimuli with hidden reference and anchor (MUSRHA) measurement method was employed to further evaluate the separated speech signal quality [35]. A high MUSHRA score means better listening quality. The reference methods were the separation algorithm based on independent component analysis (ICA) [10], the separation algorithm based on expectation maximization (EM) [36], the joint sparse and non-sparse components separation method (JSNCS) [14], and the separation method based on linearly constrained minimum variance beamforming (LCMV) [9].

4.2. Objective Evaluation

There are five objective evaluation tests for analyzing the performance of the proposed separation method. The first test was performed in an ideal acoustic room, and the dereverberation was not conducted. First, we only used the DOA estimates of sound sources to obtain the separated signal from the sparse component points, ignoring the non-sparse component points. The DNN was used to obtain the separated signal from the non-sparse component points. The multiple-sound-source separation method combing sparse and non-sparse component point recovery was utilized to obtain the separated signal from the whole TF points. Then, we compared the average PESQ of these separated signals. Average PESQ scores are presented with 95% confidence intervals. Conditions “Sparse” and “Non-sparse” represent the separated signals from only the sparse and the non-sparse component points, respectively; condition “Joint-proposed” is the signals separated from the whole TF points (sparse and non-sparse component points). When the source number is three and four, the source separation is

30 °

and

50 °

, respectively. Moreover, the original sound source signal serves as the reference signal.

The average PESQ results in Figure 8 show that the quality of the signal separated by the proposed method combining sparse and non-sparse component point recovery is higher than that of the separated signal from only sparse or non-sparse component points, which confirms the performance advantage of our proposed method.

The second test was conducted in a reverberant room with

T_{60} = 300 ms

. The aim of this test was to evaluate the performance of the proposed method with or without dereverberation. The average PESQ scores of the signals separated by the above two methods are shown in Figure 9 with 95% confidence intervals. When the source number is two and three, source separation is

50 °

and

70 °

, respectively. Condition “With Dereverberation” is the proposed method. Condition “Without Dereverberation” means that we directly separated the reverberant mixed signal without dereverberation.

In theory, the dereverberation algorithm can maintain the components which benefit source separation (i.e., the direct components of the sound sources) and reduce the damage of reverberation to separation performance. As illustrated in Figure 9, the proposed method with dereverberation can achieve better performance than the proposed method without dereverberation. In the case of N = 3, the improvement of average PESQ is more significant. Specifically, the improvement of average PESQ can reach 0.14. Therefore, it proves that the NMF-based dereverberation method can help improve the perceived quality of separated signals.

In general, the above tests verify the validity and rationality of the proposed separation method. Next, we present a series of experiments to compare the performance of our proposed method with the reference methods.

The third test was performed in two reverberant rooms (i.e., Room1 with

T_{60} = 300 ms

and Room2 with

T_{60} = 450 ms

) to analyze the intelligibility of the separated signals from different separation methods. The test results are shown in Figure 10. Condition “Proposed” is the proposed separation method; condition “JSNCS” is the reference method proposed in [14]. Condition “LCMV” represents the separation method proposed in [9].

It can be found that the intelligibility of the separated signal obtained by the proposed method is higher than that of the reference method, especially in a high reverberant environment. It proves that our proposed method can achieve a satisfying quality of the separated signal. Compared to the reference method, our proposed method utilizes a dereverberation model and a DNN for non-sparse component point recovery, which benefits the separation performance in reverberant conditions.

The fourth test was conducted in Room1 with

T_{60} = 300 ms

and Room2 with

T_{60} = 450 ms

. Source separation was

50 °

and

70 °

for the case of

N = 2

and

N = 3

, respectively. Statistical results with 95% confidence intervals are shown in Figure 11. Condition “ICA” and “EM” mean the two reference methods proposed in [10] and [35], respectively. Condition “LCMV” is the reference method in [9].

It is obvious in Figure 11 that the proposed method achieves the best PESQ scores in total test conditions, which proves that the performance of the proposed method performs better than the reference methods.

To compare the average SDR and SIR of the signals separated by the proposed method and the reference methods, the fifth test was conducted in an echoic room, Room1 and Room2. The sound source number was three, and the source separation was

60 °

. The BSS EVAL toolbox was employed in this test to measure the SDR and the SIR of the separated signals obtained by different methods [26]. Statistical results of average SDR and SIR are shown in Figure 12. A high SDR or SIR score indicates a high quality of the separated signal. Condition “Mix” refers to the mixed signal (i.e., W-channel signal recorded by sound-field microphone in this test).

It can be observed from Figure 12 that both average SDR and SIR of the separated signal obtained by the proposed approach are the highest, which indicates the lowest distortion in the separated signal. Consequently, the proposed separation approach can maintain satisfying performance and is robust in different environments.

4.3. Subjective Evaluation

The MUSHRA listening test was conducted to evaluate the perceptual quality of the separated signals [36]. A separated signal with a higher perceptual quality can earn a higher score. There are two MUSHRA listening tests in this subsection. In the two listening tests, the sound source number was set as

{2, 3, 4}

, and the source separation was

70 °

. There were 15 listeners participating in the listening tests. The first test was conducted under the ideal acoustic case to compare the proposed separation approach with the reference methods. Condition “Proposed” represents the proposed approach. Conditions “JSNCS” and “EM” mean the two reference methods, respectively. The sound source signal serves as the hidden reference, which is noted by condition “Hidden Ref”. Condition “Anchor” refers to the original mixed signal (i.e., W-channel signal recorded by sound-field microphone). Results of the first MUSHRA test are presented in Figure 13 with 95% confidence intervals.

It can be observed from Figure 13 that the proposed method achieves a higher score than other reference approaches in the ideal acoustic environment. To be specific, the MUSHRA test scores of the proposed approach achieve over or nearly 80, indicating “Excellent” or “Good” subjective perceptual quality, which proves that the satisfying perceptual quality of separated signals can be obtained by our proposed method. Moreover, the MUSHRA test scores decrease with the increase in sound source number.

The second test was conducted under the reverberant case where

T_{60}

is 300 ms. MUSHRA test results are shown in Figure 14 with 95% confidence intervals.

It can be seen from Figure 14 that our proposed multi-source separation method greatly improves the quality of the separated signal. Compared with the other reference methods, the proposed method attains a higher MUSHRA test score in the reverberant environment. Moreover, the gaps in the test scores in the case of

N = 2

are relatively small. Yet, the complexity of the EM method is high. In the case of

N = 4

, the advantage of the proposed algorithm is proven more obviously.

From the above objective and subjective evaluations, it is demonstrated that the proposed approach maintains a preferable quality of separated signals and can be adapted to various acoustics environments.

5. Conclusions

In this paper, a novel multi-source separation approach based on a DOA cue and a DNN is presented. Unlike the existing SCNR-based separation method, an NMF-based dereverberation algorithm is employed first to remove the reverberant component which can degrade the quality of the separated signal. Next, the multiple-sound-source separation framework is divided into two parts: sparse and non-sparse component point recovery. For sparse component points, a DOA cue is used to determine the dominant sound source to recover the sound source components. To achieve a better perceptual quality for the separated sound signals, a DNN is utilized to complete non-sparse component point recovery. Then, signal matching and post-processing are used to obtain the separated signals. Evaluation results show that the proposed “dereverberation and separation” framework can achieve a satisfactory quality of separated signal. Objective and subjective experiments show that the proposed separation approach outperforms the existing approaches especially in reverberant conditions. In the future, the current system needs to be further extended to speech separation conditions with more sound sources and background noise.

Author Contributions

Data curation, M.J.; Funding acquisition, M.J.; Investigation, Y.Z. and X.J.; Methodology, Y.Z. and M.J.; Software, Y.Z.; Supervision, M.J.; Writing—original draft, T.-W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Cooperative Research Project of BJUT-NTUT, grant number No. NTUT-BJUT-110-05. And The APC was funded by the Cooperative Research Project of BJUT-NTUT.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Acknowledgments

The authors would like to thank the reviewers for their helpful comments. This work was supported by the Cooperative Research Project of BJUT-NTUT (No. NTUT-BJUT-110-05).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. End-to-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949. [Google Scholar]
Petridis, S.; Pantic, M. Deep complementary bottleneck features for visual speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2304–2308. [Google Scholar]
Asaei, A.; Taghizadeh, M.J.; Haghighatshoar, S.; Raj, B.; Bourlard, H.; Cevher, V. Binary Sparse Coding of Convolutive Mixtures for Sound Localization and Separation via Spatialization. IEEE Trans. Signal Process. 2016, 64, 567–579. [Google Scholar] [CrossRef] [Green Version]
Mysore, G.J.; Smaragdis, P.; Raj, B. Non-negative hidden Markov modeling of audio with application to source separation. In Proceedings of the 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA’10), St. Malo, France, 27–30 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 140–148. [Google Scholar]
Rennie, S.J.; Hershey, J.R.; Olsen, P.A. Single channel speech separation and recognition using loopy belief propagation. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 3845–3848. [Google Scholar]
Szabób, T.; Denham, S.L.; István, W. Computational Models of Auditory Scene Analysis: A Review. Front. Neurosci. 2016, 10, 524:1–524:16. [Google Scholar]
Gorlow, S.; Marchand, S. Informed Audio Source Separation Using Linearly Constrained Spatial Filters. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 3–13. [Google Scholar] [CrossRef] [Green Version]
Schwartz, O.; Gannot, S.; Emanuel, A.P.H. Multi-speaker LCMV Beamformer and Postfilter for Source Separation and Noise Reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 940–951. [Google Scholar] [CrossRef]
Liu, J.; Yang, Q.; Jia, M.; Zhang, X. Multiple Sound Source Separation by Jointing Single Source Zone Detection and Linearly Constrained Minimum Variance. In Proceedings of the 2020 9th International Conference on Computing and Pattern Recgnition (ICCPR), Xiamen, China, 30 October–1 November 2020; pp. 141–145. [Google Scholar]
Comon, P.; Jutten, C. Handbook of Blind Source Separation: Independent Component Analysis and Applications; Academic Press: Cambridge, MA, USA; Burlington, VT, USA, 2010. [Google Scholar]
Yilmaz, O.; Rickard, S. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 2004, 52, 1830–1847. [Google Scholar] [CrossRef]
Loesch, B.; Yang, B. Source number estimation and clustering for underdetermined blind source separation. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC), Seattle, WA, USA, 14–17 September 2008. [Google Scholar]
Bosi, M.; Goldberg, R.E. Introduction to Digital Audio Coding and Standards; Kluwer Academic Publishers: Amsterdam, The Netherlands, 2003; pp. 399–400. [Google Scholar]
Jia, M.; Sun, J.; Bao, C.; Ritz, C. Separation of Multiple Speech Sources by Recovering Sparse and Non-Sparse Components from B-Format Microphone Recordings. Speech Commun. 2018, 96, 184–196. [Google Scholar] [CrossRef]
Wang, Y.; Narayanan, A.; Wang, D.L. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Proc. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [Green Version]
Huang, P.-S.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P. Deep learning for monaural speech separation. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1581–1585. [Google Scholar]
Wang, D.; Chen, J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef]
Nugraha, A.A.; Liutkus, A.; Vincent, E. Multichannel Audio Source Separation with Deep Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1652–1664. [Google Scholar] [CrossRef] [Green Version]
Sun, Y.; Wang, W.; Chambers, J.; Naqvi, S.M. Two-Stage Monaural Source Separation in Reverberant Room Environments Using Deep Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 125–139. [Google Scholar] [CrossRef] [Green Version]
Delfarah, M.; Wang, D.L. Deep Learning for Talker-Dependent Reverberant Speaker Separation: An Empirical Study. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1839–1848. [Google Scholar] [CrossRef] [PubMed]
Ibarrola, F.J.; Spies, R.D.; Persia, L.E.D. Switching Divergences for Spectral Learning in Blind Speech Dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 881–891. [Google Scholar] [CrossRef]
Jia, X.; Jia, M.; Li, L.; Jia, Y. Multiple sound source localization by using diffuseness estimation. In Proceedings of the 2020 7th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, 18–20 December 2020; pp. 881–884. [Google Scholar]
Jia, M.; Sun, J.; Bao, C. Real-time Multiple Sound Source Localization and Counting Using A Soundfield Microphone. J. Ambient Intell. Humaniz. Comput. 2017, 8, 829–844. [Google Scholar] [CrossRef]
Jia, X.; Jia, M.; Gao, S.; Zhang, Y. Multi-source separation based on non-negative matrix factorization and source distribution. In Proceedings of the 2021 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September 2021; pp. 1–5. [Google Scholar]
Kompass, R. A generalized divergence measure for nonnegative matrix factorization. Neural Comput. 2007, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
Campbell, D.; Palomaki, K.; Brown, G. A Matlab Simulation of “shoebox” Room Acoustics for Use in Research and Teaching. Comput. Inf. Syst. 2005, 9, 48–51. [Google Scholar]
Ibarrola, F.; Di Persia, L.; Spies, R. A bayesian approach to convolutive nonnegative matrix factorization for blind speech dereverberation. Signal Process. 2018, 151, 89–98. [Google Scholar] [CrossRef] [Green Version]
Févotte, C.; Idier, J. Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Comput. 2010, 23, 2421–2456. [Google Scholar] [CrossRef]
Shujau, M.; Ritz, C.H.; Burnett, I.S. Separation of speech sources using an Acoustic Vector Sensor. In Proceedings of the 2011 IEEE 13th International Workshop on Multimedia Signal Processing (MMSP), Hangzhou, China, 17–19 October 2011; pp. 1–6. [Google Scholar]
Jia, M.; Sun, J. Multiple Speech Source Separation Using Inter-Channel Correlation and Relaxed Sparsity. Appl. Sci.-Basel 2018, 8, 123. [Google Scholar] [CrossRef] [Green Version]
Guenl, B.; Hacihabiboglu, H.; Kondoz, A.M. Acoustic Source Separation of Convolutive Mixtures Based on Intensity Vector Statistics. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 748–756. [Google Scholar] [CrossRef]
Pulkki, V. Spatial Sound Reproduction with Directional Audio Coding. J. Audio Eng. Soc. 2007, 55, 503–516. [Google Scholar]
Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs; ITU-T Rec P.862.2; International Telecommunication Union: Geneva, Switzerland, 2005.
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; Volume 6, pp. 4214–4217. [Google Scholar]
BS.1534; Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems. International Telecommunication Union Radiocommunication Assembly: Geneva, Switzerland, 1997.
Chen, X.; Wang, W.; Wang, Y.; Zhong, X.; Alinaghi, A. Reverberant speech separation with probabilistic time-frequency masking for B-format recordings. Speech Commun. 2015, 68, 41–54. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Schematic diagram for the proposed method for multi-source separation.

Figure 2. The TF spectrums of different signals: (a) the original speech signal; (b) the recorded signal with

T_{60} = 300 ms

Figure 2. The TF spectrums of different signals: (a) the original speech signal; (b) the recorded signal with

T_{60} = 300 ms

Figure 3. Logarithmic spectrum of an RIR signal with

T_{60} = 600 ms .

Figure 3. Logarithmic spectrum of an RIR signal with

T_{60} = 600 ms .

Figure 4. The TF spectrum of the dereverberated signal.

Figure 5. Schematic diagram for sparse and non-sparse component points. The rectangular boxes with a single color represent sparse component points; Rectangular boxes with multiple colors represent non-sparse component points.

Figure 6. The normalized statistical histogram of DOA estimates: (a) is for total TF points; (b) is for sparse component points.

Figure 7. Illustration of network structure.

Figure 8. Average PESQ of the separated signals from sparse component points, non-sparse component points and the whole TF points.

Figure 9. Average PESQ of signals separated by the proposed method with or without dereverberation.

Figure 10. Average STOI of signals separated by the proposed method with or without dereverberation.

Figure 11. Average PESQ in different reverberant rooms.

Figure 12. Average SDR and SIR in different reverberant rooms: (a) average SDR; (b) average SIR.

Figure 13. MUSHRA test results under the ideal acoustic case with 95% confidence intervals.

Figure 14. MUSHRA test results under the reverberant case with 95% confidence intervals.

Table 1. The parameter configuration for DNN training.

Items	Parameter Value
Batch Size	256
Epochs	40
Learning Rate	2^−0.001
Drop Factor for learning rate	0.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Jia, M.; Jia, X.; Pai, T.-W. A Multi-Source Separation Approach Based on DOA Cue and DNN. Appl. Sci. 2022, 12, 6224. https://doi.org/10.3390/app12126224

AMA Style

Zhang Y, Jia M, Jia X, Pai T-W. A Multi-Source Separation Approach Based on DOA Cue and DNN. Applied Sciences. 2022; 12(12):6224. https://doi.org/10.3390/app12126224

Chicago/Turabian Style

Zhang, Yu, Maoshen Jia, Xinyu Jia, and Tun-Wen Pai. 2022. "A Multi-Source Separation Approach Based on DOA Cue and DNN" Applied Sciences 12, no. 12: 6224. https://doi.org/10.3390/app12126224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Source Separation Approach Based on DOA Cue and DNN

Abstract

1. Introduction

2. Mixed Signal Dereverberation

2.1. Recorded Signal Model

2.2. Cost Function

2.3. Optimization

3. Multiple Sound Source Separation Framework

3.1. Sparse Component Point Recovery

3.2. Non-Sparse Component Point Recovery

3.3. Separated Signal Match and Post-Process

4. Performance Evaluation

4.1. Experiment Conditions

4.2. Objective Evaluation

4.3. Subjective Evaluation

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI