1. Introduction
Sound source separation is a regular research field in audio signal processing, which has been studied dating back to the middle of the twentieth century. Sound source separation plays an important role in many areas, such as automatic speech recognition (ASR) [
1,
2] and speech coding [
3].
In the early years, researchers mainly focused on the sound source methods based on a statistical method. The hidden Markov model (HMM) can describe the short-time stationary characteristics of a speech signal which is generally used in speech separation [
4]. The relationship between the mixed signal and the sound source signal is modeled by multiple independent HMMs, and then the model parameters are calculated to estimate the separated signals [
5]. Subsequently, the separation method based on computational auditory scene analysis (CASA) was proposed in [
6]. Firstly, auditory peripheral analysis is performed on mixed signals, and acoustic features such as the amplitude spectrum of the signal are obtained. The features are segmented to obtain the intermediate feature. Then, the sound source signal is separated by acoustic recombination. However, it is difficult for an auditory scene model to fully describe the real acoustic scene, so the CASA-based separation algorithm is limited in a complex environment. Later, the sound source separation method based on beamforming, which is divided into a fixed beamforming algorithm [
7] and an adaptive beamforming algorithm, became popular [
8]. The main idea of this method is to obtain the separation matrix. Recently, the improved beamforming method has been proposed. ‘Single source zone’ detection is introduced into linearly constrained minimum variance beamforming to achieve multi-source separation [
9]. However, the performance of this method is poor in a high-reverberation environment. At the same time, researchers also pay attention to the sound source separation method based on independent component analysis (ICA), which assumes that the observed signals are statistically independent [
10]. As an observed signal, the mixed signal is decomposed into the linear sum of statistically independent components through a series of linear transformations aiming to obtain the separated signals. The separation approach based on ICA is suitable for the case of low-reverberation conditions. Nevertheless, when the number of sound sources is more than that of microphones, it is difficult to obtain the separated signal with a high quality.
In the last few years, approaches have been proposed based on the sparsity of a speech signal. First, the W-disjoint orthogonality (W-DO) hypothesis was proposed in [
11], which means that there is at most one sound source dominant in some TF points when there are two or more sound sources. The core of the W-DO-based source separation is to find these TF points dominated by only one sound source, which can be achieved using different methods. For example, the direction of arrival (DOA) of a sound source provides guidance for finding these TF points in sound source separation [
12]. Compared with some previous methods, the separation method based on the W-DO hypothesis has lower computational complexity [
13]. However, with an increasing number of sound sources, the W-DO characteristic becomes weakened so that the performance of the W-DO-based approach decreases. Subsequently, improved methods have been proposed, which focus on both TF components dominated by one sound source (i.e., sparse components) and the other remaining TF components (i.e., non-sparse components). The separation method proposed in [
14] uses a clustering algorithm with a dynamic threshold for the sparse component separation. Then, the non-sparse component is recovered by using the “local-zone stationarity” assumption.
In recent years, deep neural networks (DNNs) have been introduced into sound source separation [
15,
16], which have dramatically improved separation performance [
17]. The TF mask is generally taken as the training target for a neural network, which represents the energy ratio of each sound source at a TF point. In a DNN-based separation method, an expectation maximization (EM) algorithm and a multichannel Wiener filter can also be combined to model the sound source spectrum and spatial information [
18], which greatly improves the quality of the separated speech signal. In addition, some methods combining more than one DNN have been investigated in recent years. A DNN-based method combining a dereverberation module and a separation module for reverberant single-channel speaker separation was provided in [
19]. D. Wang proposed the two-stage network with an RNN and a DNN to separate multiple sound sources in reverberant environments [
20]. Compared with previous methods based on a DNN, these proposed separation modules achieve better performance in reverberant environments.
Nevertheless, sound source separation in reverberant environments is still a difficult problem which needs to be further investigated. In reverberant environments, the recorded signal contains the direct component of the sound source and the reverberant component from room and obstacle reflection. In the TF domain, it can be considered that a TF point is composed of a direct component and a reverberant component. The reverberant component changes the spectrum structure of the sound source signal and reduces the quality of the speech signal, which thus leads to the low quality of the separated source signals.
A dereverberation module is introduced as a pre-processing step in the proposed method, which is proposed in [
21]. Then, the dereverberation model aims to extract the clean mixed signal from the reverberant mixed signal recorded by sound field microphone. The multi-source separation framework for the dereverberated mixed signals is divided into two parts: sparse component point recovery and non-sparse component point recovery. In previous work, we proposed a multi-source localization algorithm, which can be used to provide necessary information for multi-source separation, such as sound source number and DOA estimates of sound sources [
22]. We first determine the sparse and non-sparse component points by estimating the DOA of each TF point and that of real sound sources. The sparse component points are only dominated by the direct component of a sound source. In contrast, the non-sparse component points are composed of the components of multiple sound sources and/or the component of room reflection. Subsequently, we adopt different separation methods for sparse and non-sparse component points. Specifically, we first propose a sparse component point recovery method based on a DOA cue, where the DOA cue refers to the DOA estimates of sound sources. Subsequently, the TF mask of each non-sparse component point is estimated through regression training of the DNN, and the separated component from non-sparse component points can be further obtained. Finally, the post-processing methods including separated signal matching and smoothing are needed to obtain the separated signals from each sound source. The block diagram of the proposed multi-source separation algorithm is shown in
Figure 1, where “Format Transformation” means that the A-format signal recorded by a sound field microphone is transformed into the B-format signal [
23]. “W-channel” means the W-channel signal in B-format. In [
24], we discussed a preliminary idea for multi-source separation. Different from [
24], a DOA cue is used to realize sparse component point recovery; We focus on the effect of the dereverberation module on separation performance. In addition, a lot of experiments have been performed to verify the effectiveness of the proposed method.
The remainder of the paper is organized as follows: The dereverberation model in the proposed multi-source separation method is introduced in
Section 2.
Section 3 presents the multiple-sound-source separation method. The performance evaluation for the proposed method is shown in
Section 4, and the conclusion is drawn in
Section 5.
2. Mixed Signal Dereverberation
The reverberant component always exists in a speech signal recorded in a real environment. However, the reverberant component can lead to the degradation of signal quality which makes it difficult to separate the original sound source signal from the reverberant mixed signal and leads to a poor performance of the separation algorithm. Therefore, it is very important to restrain the effect of reverberation on sound source separation.
In this paper, a multi-source separation method for the reverberant environment, which consists of a dereverberation module proposed in [
21] and a separation module, is proposed. The dereverberation module is used to remove the reverberation component from the recorded signal and maintain the direct component of the source. The dereverberation module is implemented in the TF domain, which also consists of two stages. The first stage is to learn the spectrum structure and obtain the dictionary matrix; in the second stage, a non-negative matrix factorization (NMF) model is used to represent the signal spectrum and obtain the optimal estimation of the sound source signal.
2.1. Recorded Signal Model
In the time domain, the signals recorded by the microphone can be modeled by the source signal convolved with the room impulse response (RIR). However, for speech signals, we often work within the TF domain rather than in the time domain [
25]. Short-Time Fourier Transform (STFT) is utilized to obtain the time-frequency representation of the recorded signal. Here,
,
and
are the T-F domain coefficients of clean signals
, reverberant signals
, and the room impulse response (RIR)
. So,
can be written as [
21]:
where
is the reverberation order which reflects the reverberation level. The expected value of the
is denoted by
, which can be expressed as follows:
where
and
are the square of the magnitude of
and
, respectively;
;
and
are the time index and the frequency index, respectively.
Subsequently, we aim to build a spectrogram model of the sound source signal using the NMF method to recover the sound source signal from the recorded signal. We assume that the non-negative spectrogram of sound source signal
can be represented by two non-negative matrices
and
, where
,
is the minimum value. Then,
can be written as follows by an NMF model:
where “
” represents the approximate equation.
and
are the dictionary matrix and activation matrix, respectively. Next, we use
to replace
for convenience.
Thereafter, Formula (2) can be rewritten as the following formula, which is the model of the recorded signal:
2.2. Cost Function
We obtain the model of the recorded signal
in
Section 2.1 and use
to denote the spectrogram of the recorded signal with reverberation in this subsection. Considering that the NMF model represents an approximate equality relationship between
and
, the recorded signal model
is also an approximate representation for the recorded signal
. There are different methods which can be used to measure the accuracy of the approximate representation, such as the Euclidean distance and so on. In this subsection, we employ the generalized
divergence as the measurement method to obtain the optimal matrices
,
and
, so that
can be used to provide an accurate approximation for
[
25]. Therefore, the spectrogram of sound source signal
can be obtained by the product of the optimal matrices
and
.
The generalized
divergence of
and
is defined as:
where
is the spectrogram of the recorded signal at the
frequency point and the
frame, and
is the value of divergence.
However, there exist many ways to construct the matrices
,
and
. The penalty functions for
and
are introduced to narrow the range of alternative ways and obtain the optimal
,
and
more efficiently [
21]. Hence, the cost function combining the generalized
divergence and the penalty functions is defined as follows:
where
and
are the penalty functions for
and
, respectively;
is determined based on the TF spectrum structure of the recorded signal, whereas
is determined based on the logarithmic spectrum structure of the RIR.
First, we determine the penalty function
by analyzing the characteristics of the TF spectrum of the recorded signal with reverberation. An example is shown in
Figure 2, where
Figure 2a is the TF spectrum of the clean sound source signal sampled at 16 kHz. The recorded signal with reverberation is simulated by Roomsim software [
26]. A sound field microphone is located at the center of the simulated room, and the reverberation time (
) is 300 ms.
Figure 2b shows the TF spectrum of the recorded signal (i.e., W-channel signal in B-format) with
. It can be clearly seen from
Figure 2 that the harmonic structure in the TF spectrum of the clean sound source signal is more obvious, while that of the recorded signal with reverberation is fuzzier, and the formant peak is not clear.
Therefore, a penalty function is added to the activation matrix
to highlight the harmonic structure in the TF spectrum and enhance the sparsity of the recorded signal. Since the
norm can be used to represent the sparsity of a matrix, we use it to construct the penalty function for
. In a non-negative matrix, the
norm can be represented as the sum of the total elements in the matrix. By weighting the elements in the non-negative matrix
and summing the weighted elements, the penalty function for
can be expressed as:
where
is a non-negative parameter for
.
The penalty function for can speed up the convergence of the cost function and make the estimated spectrogram closer to the spectrogram of the sound source signal .
Next, the penalty function for
is determined according to the characteristics of the spectrum of RIR. The logarithmic spectrum of a simulated RIR with
presents a smooth attenuation structure, which is exhibited in
Figure 3. In
Figure 3, the energy decays over time. The penalty function for
is added to promote the attenuation structure [
27], which can be expressed as:
where
is a non-negative parameter for
;
is the
row of
, and
represents a finite difference matrix. In addition, “
” means the transpose of matrix, and “
” is the square of the
norm.
Therefore, the cost function (i.e., Formula (6)) can be rewritten as:
In
Section 2.3, the optimization for
,
and
is introduced to minimize the cost function
and obtain the optimal estimation for
.
2.3. Optimization
The optimization process is divided into two parts. The first part is to initialize the cost function to obtain only with respect to and then obtain the optimal through iteration. The second part is to put the optimal into the cost function and obtain the optimal and by minimizing the cost function.
First,
,
and
are defined as the auxiliary functions for
,
and
, respectively [
21]. To initialize the cost function,
,
and
. So, we obtain
only with respect to
. According to the definition of the auxiliary function [
28], in the case of
and
, there exists
. The auxiliary function of
is defined as
:
where
;
is an upper convex function, and
is a concave function
can be decomposed into the sum of
and
.
The iteration optimization for
can be used to optimize
. Specifically, the update equation for
can be obtained by setting the gradient of
as zero:
where “
” represents the
iteration,
is the parameter determined by
,
represents the maximum value, and
is a small constant in order to avoid the elements in
W from being zero.
The same process is conducted for
as follows:
Subsequently, we define a diagonal matrix and a vector to obtain the update function for
. The diagonal matrix
is defined as:
The vector
is defined as:
Similar to Formula (12), Formula (15) is used to solve for
, which is combined to obtain the optimal
:
Therefore, we can obtain the optimal
, and
. Then, a time-varying gain coefficient
is utilized to avoid the error caused by the simple definition
, where
is the estimated spectrogram of the sound source signal.
Eventually, the estimated spectrogram of the sound source signal
is obtained by:
To obtain the estimated spectrum of the dereverberated signal, the phase information of the recorded signal is used to reconstruct that of the dereverberated signal.
Thereafter, the spectrum of the dereverberation signal shown in
Figure 4 is taken as an example to show the effect of the dereverberation module, while the spectrum of the source signal and the recorded signal are shown in
Figure 2a,b, respectively.
It can be observed that compared to the recorded signal, the harmonic structure in the TF spectrum of the dereverberated signal is clearer and the energy between adjacent harmonics is obviously reduced. In addition, the spectrum of the dereverberated signal is more similar to that of the sound source signal.
3. Multiple Sound Source Separation Framework
The multiple-sound-source separation framework is proposed based on the sparsity of the speech signal. Previous studies have proved the existence of sparse and non-sparse component points of a speech signal in the TF domain [
29,
30]. Specifically, the sparse component points refer to the TF points which are only dominated by the direct component from one sound source. In contrast, the non-sparse component points are the remaining TF points which are composed of the components of multiple sound sources and/or the component of room reflection.
Figure 5 is the schematic diagram of the sparse and non-sparse component points, where each rectangular box represents a TF point, and a different color represents a different sound source. As shown in
Figure 5, the TF points (
), (
) and (
) are dominated by the direct component of sound sources
,
and
, respectively. In contrast, the TF points (
), (
) and (
) are the non-sparse component points where there is no dominant component.
Subsequently, different separation methods for sparse and non-sparse component points are designed, and a multi-source separation method combining sparse and non-sparse component point recovery is proposed in this paper. In view of the relationship between the DOA estimates of the sparse component points and the real DOAs of sound sources, a separation method based on the distribution of DOA estimates is also proposed in this paper. Considering that more than one sound source is active simultaneously in a non-sparse component point, the deep neural network is adopted to estimate the IRM which is used for non-sparse component point recovery.
3.1. Sparse Component Point Recovery
In this paper, the DOA of sound source is used to determine the sparse component points and obtain the separated component from the mixed signals. To obtain the DOA estimates of real sound sources, we explore the distribution of DOA estimates of total TF points. For the mixed signals recorded by the sound-field microphone, DOA estimates of total TF points can be directly calculated from the B-format recordings [
31,
32]. The DOA estimation method proposed in [
22] is employed to obtain DOA estimates of real sound sources. To be specific, the single source time–frequency point in the mixed signal is detected by diffuseness measurement; the active sound intensity obtained by the sound field microphone recordings is used to obtain DOA estimates of the single source time–frequency points. So, the sound sources number and the multi-source DOA estimates can be estimated by peak searching on the statistical histogram formed by the DOA estimates of the single source time–frequency point. Theoretically, if the TF point is dominated by the direct component of the sound source
the DOA estimate of this TF point may be same as that of the
sound source. Therefore, the TF point whose DOA estimate is closest to the DOA of the real sound source is considered the sparse component point. In fact, to achieve a satisfactory tolerance, the restriction on DOA estimates of TF points is relaxed. The TF point whose DOA estimates are within the range of the DOA estimate of source
is the sparse component point. The remaining TF points are detected as non-sparse component points. So, the set of sparse component points and non-sparse component points are defined as follows:
where
and
denote the DOA estimate of the
real sound source and the TF point (
), respectively.
and
represent the set of sparse component points and non-sparse component points, respectively. For convenience,
and
are omitted in
in the following.
For example, Roomsim was used to obtain the recorded signal. A sound field microphone was utilized to record a mixed signal with three sound sources and
. The real DOAs of the three sound sources are
,
and
.
Figure 6a is the normalized statistical histogram of the DOA estimates of the total TF points obtained by the DOA estimation algorithm proposed in [
22]. The horizontal axis is the DOA estimate, and the vertical axis is the normalized statistical amplitude. There are three peaks which represent the DOAs of the three real sound sources and are delineated by the black stems. From
Figure 6a, it is clear that the distribution of DOA estimates is relatively concentrated around the real DOAs of the sound sources. If the DOA estimate of a TF point is closest to the real DOA of a sound source, the TF point can be considered to be dominated by this sound source, which is determined to be a sparse component point.
Figure 6b shows the distribution of DOA estimates of
where
.
According to the above analysis, the separated component from the sparse component point is the component of the dominant sound source. Hence, the sparse component point recovery can be achieved by determining the dominant sound source from the total real sound sources. The difference
between the DOA estimate of the sparse component point and
is calculated to determine the dominant sound source at a sparse component point
.
where
is the DOA estimate of a sparse component point.
The dominant sound source of a sparse component point is determined as the
sound source with the minimum difference
.
Hence, the separated component from the sparse component point is from the dominant sound source.
3.2. Non-Sparse Component Point Recovery
In
Section 3.1, we can obtain the separated components from the sparse component points. However, if we only focus on sparse component point recovery and ignore the non-sparse component point, the separated signals lose a lot of TF components. In particular, the number of non-sparse component points in the recorded signal increases with the increase in the number of sound sources. If only sparse component point recovery is performed, it is difficult to acquire a good perceptual quality of the separated signal. Hence, the non-sparse component point recovery is very important for improving the quality of the separated speech. Therefore, in this subsection, the recovery method for the non-sparse component point is introduced.
The non-sparse component points are composed of the superposition of the components of multiple sound sources and/or reflection components. Since reflection components in the recorded signal have been mostly removed by the method introduced in
Section 2, the non-sparse component points are considered to be mainly composed of multiple sound source components in this subsection. Therefore, the non-sparse component point recovery problem can be transformed to calculate the proportion of each sound source at each TF point, which can be denoted by an ideal ratio mask (IRM).
In recent years, deep neural networks (DNNs) have achieved excellent results in many fields, and they are very suitable for solving the regression problem. In addition, calculating the IRM for each TF point can be regarded as a regression problem. Therefore, a DNN is employed to predict the IRM and to achieve non-sparse component point recovery. The amplitude of the TF domain coefficient of one frame mixed signal is input into the DNN. The training target of the network is the IRM for each sound source, which can be formulated as follows:
where
is the TF domain coefficient of the
sound source; “
” represents the amplitude of a complex, and
is the number of sound sources.
The neural network structure is shown in
Figure 7. The neural network consists of two full connection layers and two convolution layers. The first full connection layer and the two convolutional layers are followed by a Relu active function, a batch-normalization layer, and a dropout layer. The second full connection layer is followed by a sigmoid layer.
To train the neural network effectively, it is necessary to set the training parameters for the DNN. The parameter configuration is depicted in
Table 1.
In addition, we employ the Adam optimizer to train the neural network. The network is trained based on the minimum root mean square error (RMSE) criterion. The RMSE cost function
can be expressed as:
where
is the frequency channel number,
is the parameters of DNN,
represents prediction operation of DNN, and
is the input feature vector (i.e., the magnitudes of TF representation of total TF points in one frame).
In the test phase, the trained network with minimizing RMSE can be used to predict the IRM, namely, the proportion of the energy of each sound source at each TF point.
3.3. Separated Signal Match and Post-Process
In
Section 3.1 and
Section 3.2, the separated components from sparse and non-sparse component points can be obtained, respectively. However, we need to determine whether the separated components from the sparse component points and the non-sparse component points come from the same sound source. For sparse component point recovery, the DOAs of the sound sources are used to determine which sound source the separated sparse component comes from. For non-sparse component point recovery, the
sound source is selected from the N sound sources as the target sound source in advance. Then, according to Formula (22), the DNN is used to predict the IRM for the target sound source to obtain the separated components of the target sound source from the non-sparse component points. In the case of multiple (at least three) sound sources, we cannot predict the IRMs of all the sound sources simultaneously. Hence, we repeat the above process for N times, and N DNN models are obtained to estimate the IRMs for the N sound sources, respectively. The separated components of the total sound sources from the non-sparse component points are then obtained. However, the target sound source is set randomly, not according to the DOA of the sound source. Therefore, the separated components of each sound source from the sparse component points
and the non-sparse component points
need to be matched to obtain the separated components coming from the same sound source.
Considering that a temporal correlation exists in the speech signals from the same sound source, the correlation coefficient can be used to determine whether the separated components from the sparse component points and the non-sparse component points are matched. The sparse component separated signal of one sound source consists of the total separated components of the sound source from the sparse component points. In the same way, the non-sparse component separated signal of one sound source consists of the total separated components of the sound source from the non-sparse component points. The correlation coefficient between the different separated signals can be defined by the following formula:
where
represents the sparse component separated signal of the
source, and
represents the non-sparse component separated signal of the
source. Here,
is the correlation coefficient between the sparse component separated signals of the
source and the non-sparse component separated signals of the
source.
and
are covariance and variance, respectively.
Therefore, if the sparse component separated signal comes from the
sound source, the following formula can be used to determine that the non-sparse component separated signal from the
sound source comes from the same sound source.
Eventually, we obtain N separated signals .
4. Performance Evaluation
For evaluating the performance of the proposed separation method, a set of objective and subjective experiments are presented.
4.1. Experiment Conditions
Firstly, the two sound sources scene were taken as an example to elaborate on the preparation of the data set for training the DNN. From the Nippon Telegraph and Telephone (NTT) database, 200 speech segments sampled by 16 kHz, including 100 male speaker segments and 100 female speaker segments, were selected as source signals. We randomly selected two speech segments from these 200 speech segments as two sound source signals
and
. Roomsim software was then used to record two source signals with reverberation, respectively [
26]. The reverberant mixed signal
was obtained by linear addition of two recorded signals in the time domain. Thereafter, the dereverberation model used the method described in
Section 1. Hence, we obtained one group of training data (i.e.,
,
and
). The same method was utilized to generate 400 groups of data to form the training set. Another 50 and 40 groups of data constituted the validation set and the test set, respectively. The codes of the proposed methods were written in MATLAB (R2019b version). The experiments were implemented on a desktop with an Intel i7 CPU with 2.9 GHz and 16 GB of memory without parallel processing. In the training and testing phases, a NVIDIA GeForce RTX 2060 GPU was used.
Subsequently, the mixed speech signals for multiple-sound-source separation were generated as follows. Speech segments from NTT speech database were selected as sound source signals which were sampled at 16 kHz and included eight females and eight males. In this paper, Roomsim software was used to simulate acoustic rooms with the size of where the sound speed was 340 m/s. The sound-field microphone was located at the center of the simulated rooms. The distance between the microphone and sound source was 1 m.
Performance evaluation consisted of objective and subjective evaluations. In objective evaluation, the perceptual evaluation of speech quality (PESQ) [
33] and the short-time objective intelligibility (STOI) [
34] were utilized to measure the perceptual quality and intelligibility of separated signals, respectively. The PESQ score is between −0.5 and 4.5. A higher PESQ score means a higher perceptual quality and a higher STOI score means higher intelligibility. The signal-to-distortion ratio (SDR) and signal-to-interference ratio (SIR) were used to measure the distortion and interference level of the separated signals, respectively. Higher SDR and SIR scores mean lower distortion and interference level, respectively. In subjective evaluation, the multiple stimuli with hidden reference and anchor (MUSRHA) measurement method was employed to further evaluate the separated speech signal quality [
35]. A high MUSHRA score means better listening quality. The reference methods were the separation algorithm based on independent component analysis (ICA) [
10], the separation algorithm based on expectation maximization (EM) [
36], the joint sparse and non-sparse components separation method (JSNCS) [
14], and the separation method based on linearly constrained minimum variance beamforming (LCMV) [
9].
4.2. Objective Evaluation
There are five objective evaluation tests for analyzing the performance of the proposed separation method. The first test was performed in an ideal acoustic room, and the dereverberation was not conducted. First, we only used the DOA estimates of sound sources to obtain the separated signal from the sparse component points, ignoring the non-sparse component points. The DNN was used to obtain the separated signal from the non-sparse component points. The multiple-sound-source separation method combing sparse and non-sparse component point recovery was utilized to obtain the separated signal from the whole TF points. Then, we compared the average PESQ of these separated signals. Average PESQ scores are presented with 95% confidence intervals. Conditions “Sparse” and “Non-sparse” represent the separated signals from only the sparse and the non-sparse component points, respectively; condition “Joint-proposed” is the signals separated from the whole TF points (sparse and non-sparse component points). When the source number is three and four, the source separation is and , respectively. Moreover, the original sound source signal serves as the reference signal.
The average PESQ results in
Figure 8 show that the quality of the signal separated by the proposed method combining sparse and non-sparse component point recovery is higher than that of the separated signal from only sparse or non-sparse component points, which confirms the performance advantage of our proposed method.
The second test was conducted in a reverberant room with
. The aim of this test was to evaluate the performance of the proposed method with or without dereverberation. The average PESQ scores of the signals separated by the above two methods are shown in
Figure 9 with 95% confidence intervals. When the source number is two and three, source separation is
and
, respectively. Condition “With Dereverberation” is the proposed method. Condition “Without Dereverberation” means that we directly separated the reverberant mixed signal without dereverberation.
In theory, the dereverberation algorithm can maintain the components which benefit source separation (i.e., the direct components of the sound sources) and reduce the damage of reverberation to separation performance. As illustrated in
Figure 9, the proposed method with dereverberation can achieve better performance than the proposed method without dereverberation. In the case of N = 3, the improvement of average PESQ is more significant. Specifically, the improvement of average PESQ can reach 0.14. Therefore, it proves that the NMF-based dereverberation method can help improve the perceived quality of separated signals.
In general, the above tests verify the validity and rationality of the proposed separation method. Next, we present a series of experiments to compare the performance of our proposed method with the reference methods.
The third test was performed in two reverberant rooms (i.e., Room1 with
and Room2 with
) to analyze the intelligibility of the separated signals from different separation methods. The test results are shown in
Figure 10. Condition “Proposed” is the proposed separation method; condition “JSNCS” is the reference method proposed in [
14]. Condition “LCMV” represents the separation method proposed in [
9].
It can be found that the intelligibility of the separated signal obtained by the proposed method is higher than that of the reference method, especially in a high reverberant environment. It proves that our proposed method can achieve a satisfying quality of the separated signal. Compared to the reference method, our proposed method utilizes a dereverberation model and a DNN for non-sparse component point recovery, which benefits the separation performance in reverberant conditions.
The fourth test was conducted in Room1 with
and Room2 with
. Source separation was
and
for the case of
and
, respectively. Statistical results with 95% confidence intervals are shown in
Figure 11. Condition “ICA” and “EM” mean the two reference methods proposed in [
10] and [
35], respectively. Condition “LCMV” is the reference method in [
9].
It is obvious in
Figure 11 that the proposed method achieves the best PESQ scores in total test conditions, which proves that the performance of the proposed method performs better than the reference methods.
To compare the average SDR and SIR of the signals separated by the proposed method and the reference methods, the fifth test was conducted in an echoic room, Room1 and Room2. The sound source number was three, and the source separation was
. The BSS EVAL toolbox was employed in this test to measure the SDR and the SIR of the separated signals obtained by different methods [
26]. Statistical results of average SDR and SIR are shown in
Figure 12. A high SDR or SIR score indicates a high quality of the separated signal. Condition “Mix” refers to the mixed signal (i.e., W-channel signal recorded by sound-field microphone in this test).
It can be observed from
Figure 12 that both average SDR and SIR of the separated signal obtained by the proposed approach are the highest, which indicates the lowest distortion in the separated signal. Consequently, the proposed separation approach can maintain satisfying performance and is robust in different environments.
4.3. Subjective Evaluation
The MUSHRA listening test was conducted to evaluate the perceptual quality of the separated signals [
36]. A separated signal with a higher perceptual quality can earn a higher score. There are two MUSHRA listening tests in this subsection. In the two listening tests, the sound source number was set as
, and the source separation was
. There were 15 listeners participating in the listening tests. The first test was conducted under the ideal acoustic case to compare the proposed separation approach with the reference methods. Condition “Proposed” represents the proposed approach. Conditions “JSNCS” and “EM” mean the two reference methods, respectively. The sound source signal serves as the hidden reference, which is noted by condition “Hidden Ref”. Condition “Anchor” refers to the original mixed signal (i.e., W-channel signal recorded by sound-field microphone). Results of the first MUSHRA test are presented in
Figure 13 with 95% confidence intervals.
It can be observed from
Figure 13 that the proposed method achieves a higher score than other reference approaches in the ideal acoustic environment. To be specific, the MUSHRA test scores of the proposed approach achieve over or nearly 80, indicating “Excellent” or “Good” subjective perceptual quality, which proves that the satisfying perceptual quality of separated signals can be obtained by our proposed method. Moreover, the MUSHRA test scores decrease with the increase in sound source number.
The second test was conducted under the reverberant case where
is 300 ms. MUSHRA test results are shown in
Figure 14 with 95% confidence intervals.
It can be seen from
Figure 14 that our proposed multi-source separation method greatly improves the quality of the separated signal. Compared with the other reference methods, the proposed method attains a higher MUSHRA test score in the reverberant environment. Moreover, the gaps in the test scores in the case of
are relatively small. Yet, the complexity of the EM method is high. In the case of
, the advantage of the proposed algorithm is proven more obviously.
From the above objective and subjective evaluations, it is demonstrated that the proposed approach maintains a preferable quality of separated signals and can be adapted to various acoustics environments.