Open AccessArticle

Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance

Yuewei Zhang

Huanbin Zou

and

Jie Zhu

^1,*

Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Xiaohongshu Inc., Shanghai 200020, China

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2919; https://doi.org/10.3390/app15062919

Submission received: 23 December 2024 / Revised: 2 March 2025 / Accepted: 6 March 2025 / Published: 7 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Figure 1
Overall architecture of unidirectional information interaction-based dual-branch network (UniInterNet). "> Figure 2
(a) The detail of the two-dimensional convolutional (Conv2d) block. (b) The detail of the two-dimensional deconvolutional (DeConv2d) block. "> Figure 3
The diagram of time–frequency sequence modeling (TFSM) block. During temporal sequence modeling, a causal gated recurrent unit (GRU) layer is employed in the speech branch, while a non-causal bidirectional GRU (BiGRU) layer is utilized in the noise branch. "> Figure 4
Structure of the unidirectional interaction module. "> Figure 5
Visualization of the spectrum of the following: (a) noisy speech; (b) clean speech; (c) enhanced speech by SiNet; (d) enhanced speech by BiInterNet; (e) enhanced speech by UniInterNet-CausalNoise; (f) enhanced speech by UniInterNet. The noise type is open area cafeteria noise. "> Figure 6
Visualization of the spectrum of the following: (a) noisy speech; (b) clean speech; (c) enhanced speech by UniInterNet w/o EncInter; (d) enhanced speech by UniInterNet w/o RecInter; (e) enhanced speech by UniInterNet w/o DecInter; (f) enhanced speech by UniInterNet. The noise type is public square noise. ">

Versions Notes

Abstract

It has been demonstrated that interactive speech and noise modeling outperforms traditional speech modeling-only methods for speech enhancement (SE). With a dual-branch topology that simultaneously predicts target speech and noise signals and employs bidirectional information communication between the two branches, the quality of the enhanced speech is significantly improved. However, the dual-branch topology greatly increases the model complexity and deployment cost, thus limiting its practicality. In this paper, we propose UniInterNet, a unidirectional information interaction-based dual-branch network to achieve noise modeling-assisted SE without any increase in complexity. Specifically, the noise branch still receives information from the speech branch to achieve more accurate noise modeling. Subsequently, the noise modeling results are utilized to assist the learning of the speech branch during backpropagation, while the speech branch no longer receives the auxiliary information from the noise branch, so only the speech branch is required during model deployment. Experimental results demonstrate that under the causal inference condition, the performance of UniInterNet only marginally decreases compared to the corresponding bidirectional information interaction scheme, while the model inference complexity is reduced by about 75%. With comparable overall performance, UniInterNet also outperforms previous interactive speech and noise modeling-based benchmarks in terms of causal inference and model complexity. Furthermore, UniInterNet surpasses other existing competitive methods.

Keywords:

speech enhancement; dual-branch network; noise modeling assistance; unidirectional information interaction

1. Introduction

Speech collected by microphones is often contaminated by various types of environmental noises, resulting in inevitable degradation in the speech’s perceptual quality and intelligibility. As a solution, an increasing amount of research has focused on the speech enhancement (SE) technique, which aims to suppress the noise components and recover the clean speech. SE is one of the most prominent speech front-end signal processing techniques, and it is widely used in scenarios such as telecommunication systems, hearing aid devices, and automatic speech recognition (ASR).

Recently, deep learning (DL)-based SE methods have illustrated outstanding performance, especially when the noise is non-stationary or the signal-to-noise ratio (SNR) is low. Some DL-based methods directly predict the enhanced speech from raw noisy speech by an end-to-end way in the time domain [1,2,3]. For example, DEMUCS [1] leverages a deep neural network (DNN) with multiple temporal convolutional and recurrent layers to capture both local and global features of the signal, achieving direct mapping from noisy to clean speech. Nevertheless, more methods realize SE in the time–frequency (TF) domain [4,5,6,7,8,9,10]. The TF domain methods employ a DNN model to operate the noisy spectrum, suppress the noise components, and estimate the enhanced spectrum. In the past few years, a plethora of DNN models of different structures have been studied, such as convolutional neural network (CNN) [11], recurrent neural network (RNN) [12], and more recently, Transformer [13,14]. The CNN excels at extracting high-level features but primarily focuses on local TF patterns. RNN is able to capture long-distance contexts, but it performs poorly in the analysis of local feature information. As for Transformer, although it has demonstrated outstanding performance in SE, it simultaneously results in significant computational costs and memory consumption. This limits its practicality, particularly in real-time processing applications. In recent years, the convolutional recurrent network (CRN) has been introduced to SE, which effectively combines the strengths of both CNN and RNN.

By incorporating a symmetric convolutional encoder–decoder (CED) and a recurrent module, a CRN [4] is proposed to receive a noisy magnitude spectrum and predict the enhanced one. To further correct the noisy phase and improve SE performance, the CRN is subsequently extended to its complex-valued version, known as DCCRN [7]. In addition, the gated linear unit (GLU) is employed to replace the convolutional and deconvolutional layers in the CED, resulting in the GCRN [15]. The GCRN utilizes two parallel decoders, respectively, to estimate the real and imaginary parts of the enhanced complex-valued speech spectrum. DPCRN [16], on the other hand, employs intra-chunk long short-term memory (LSTM) and inter-chunk LSTM layers to achieve more comprehensive sequence modeling. More recently, FRCRN [17] proposes a convolutional recurrent encoder–decoder (CRED) structure to boost the feature representation capability of the CED. ICCRN [18] abandons the frequency downsampling operation of the convolutional encoder and implements SE in the cepstral space. SICRN [19] enhances CRN by combining a state space model and the inplace convolution operation. In general, the CRN structure is commonly utilized in SE, and CRN-based models have exhibited excellent performance. Therefore, we also adopt the CRN structure as the network backbone in constructing the proposed DNN model.

Moreover, many researchers have enhanced the SE technique from various other perspectives. CTS-Net [8] decouples the SE task into coarse magnitude spectrum estimation and fine-grained complex-valued spectrum refinement in a sequential manner, thus proposing a two-stage paradigm. Later, GaGNet [20] modifies the coarse and fine-grained estimation in CTS-Net from sequential to parallel processing, which prevents error accumulation and ensures hierarchical optimization toward the complex-valued spectrum. Meanwhile, some works implement SE using multi-domain processing techniques. For example, CompNet [21] enhances the noisy speech sequentially in both the time and TF domains, and FDFNet [22] processes the speech spectrum sequentially in the Fourier transform and discrete cosine transform domains. Moreover, several full-band and sub-band fusion SE models have been proposed to integrate local sub-band spectrum features with global cross-band dependencies for better performance, including FullSubNet [23], FullSubNet+ [24], and LFSFNet [25]. Based on these models, Inter-SubNet [26] discards the full-band model and utilizes a sub-band interaction method to supplement global spectrum patterns for the sub-band model. Furthermore, generative SE approaches, such as NASE [27], have also garnered increasing attention.

All of the aforementioned methods take the clean speech as the predicting target, so they can be referred as the speech-prediction methods. However, these methods lack explicit analysis and modeling for the characteristics of noise components. In addition, predicting the clean speech from low-SNR signals can be challenging, potentially resulting in poor quality of the enhanced speech. Consequently, the noise-prediction method [28] attempts to predict the pure noise, which is then subtracted from the noisy speech to indirectly obtain the enhanced speech. But the benefit is still limited. Motivated by the speech-prediction and noise-prediction methods, some subsequent methods [29,30,31] propose a dual-branch network to simultaneously predict the target speech and noise signals. Additionally, the bidirectional information interaction module is typically introduced between the speech and noise branches to further optimize the speech and noise modeling process [29]. Although the interactive speech and noise modeling-based dual-branch network has illustrated excellent performance, it significantly increases the model complexity. Unfortunately, the huge model complexity may be intolerable in many resource-constrained devices.

In this work, we propose a unidirectional information interaction-based dual-branch network, abbreviated as UniInterNet. In the UniInterNet, intermediate features only flow from the speech branch to noise branch, thereby promoting the noise modeling. While the information flow from the noise branch to speech branch is discarded. in this way, the speech branch does not depend on any information from the noise branch when generating the enhanced speech. Therefore, the noise branch is utilized to assist the learning of speech branch by backpropagating the noise modeling results during the model training stage, and it is not needed in the deployment phase. Compared to the previous dual-branch methods [29,31], the proposed UniInterNet not only retains the benefit of noise prediction for SE but also reduces the deployment cost by avoiding the use of dual-branch network in the model inference stage. Moreover, previous methods typically employ the same network structure to construct the speech and noise branches. In contrast, we adopt a more complicated, and even non-causal network structure for the noise branch, while limiting the complexity and ensuring the causality of the speech branch. As a result, the noise modeling accuracy is guaranteed, and the computational cost and algorithm delay during the model inference phase are also satisfactory.

In addition, we adopt short-time discrete cosine transform (STDCT) [32] instead of the most commonly used short-time Fourier transform (STFT) for TF transformation. Unlike the complex-valued STFT spectrum, the STDCT spectrum is real-valued and implicitly contains both the magnitude and phase information. This escapes the phase estimation problem in the STFT-based methods, thereby simplifying the design of DNN and improving the computational efficiency.

The contributions of this study are as follows: (1) A unidirectional information interaction scheme is proposed to optimize the previous bidirectional interactive speech and noise modeling-based dual-branch SE framework so as to effectively reduce the model deployment costs without significant performance degradation. (2) As both the speech and noise branches utilize the CRN as backbone, experiments illustrate that the information interactions between the two branches’ encoders, recurrent modules, and decoders are all beneficial and necessary. (3) Ensuring causal inference, experimental results demonstrate the superiority of the proposed unidirectional information interaction-based SE model over existing benchmarks. The performance evaluation results of UniInterNet on both the VoiceBank+DEMAND [33] and DNS-Challenge [34] datasets are presented.

2. Materials and Methods

2.1. Materials

In this study, we formulate the SE signal model in the TF domain using the STDCT and further define the predicting targets for the proposed SE network.

Given the noisy speech

x (n)

, it can be formulated as follows:

x (n) = s (n) + z (n)

(1)

where

s (n)

and

z (n)

denote the clean speech and additive noise, respectively. n represents the discrete time index.

In order to obtain the STDCT spectrum of signal, we need to perform framing and windowing operations on the time-domain waveform and then apply the discrete cosine transform (DCT) to each frame. Correspondingly, we can adopt inverse windowing, inverse discrete cosine transform (IDCT), and overlap-and-add operations to transform the STDCT spectrum back to the time-domain signal. The DCT and IDCT, respectively, are defined as follows:

X (μ) = c (μ) \sqrt{\frac{2}{N}} \sum_{n = 0}^{N - 1} x (n) cos [\frac{π μ (2 n + 1)}{2 N}]

(2)

x (n) = \sqrt{\frac{2}{N}} \sum_{μ = 0}^{N - 1} c (μ) X (μ) cos [\frac{π μ (2 n + 1)}{2 N}]

(3)

where

μ \in [0, 1, \dots, N - 1]

denotes N frequency bins from low to high. The parameter

c (μ)

is defined as follows:

c (μ) = \{\begin{matrix} \frac{1}{\sqrt{2}}, & μ = 0 \\ 1, & μ = 1, 2, \dots, N - 1 \end{matrix}

(4)

According to Equation (2), it can be found that DCT operates in the real-valued domain. Therefore, the STDCT spectrum is a kind of real-valued spectrum. Furthermore, with STDCT, Equation (1) can be transformed as follows:

X_{t, μ} = S_{t, μ} + Z_{t, μ}

(5)

where

X_{t, μ} \in R

S_{t, μ} \in R

, and

Z_{t, μ} \in R

denote the STDCT spectrum of noisy, clean, and noise signals, respectively. t and

μ

index the time frame and the frequency bin. In the following, we omit the time and frequency indices for brevity.

We adopt the masking method [35,36] to recover both the clean speech and pure noise from the noisy speech. Therefore, the predicting targets of speech and noise branches in UniInterNet are:

Speech branch:

$M_{s} = \frac{S}{X}$

(6)
Noise branch:

$M_{z} = \frac{Z}{X}$

(7)

where

M_{s}

and

M_{z}

, respectively, are the DCT ideal ratio mask (DCTIRM) for speech and noise estimations.

2.2. Methods

The overall network architecture of UniInterNet is shown in Figure 1. The input is the real-valued noisy spectrum X derived from the noisy speech x by STDCT. UniInterNet consists of parallel speech and noise branches, which are responsible for clean speech and pure noise modeling, respectively. Both branches employ the CRN structure. To better capture the sequential correlations along both the time and frequency dimensions, we stack multiple time–frequency sequence modeling (TFSM) blocks [22] to construct the recurrent module. A skip connection between the encoder and decoder is used in each branch, enabling a more efficient information flow. The proposed unidirectional interaction modules are inserted between the encoders, recurrent modules, and decoders of speech and noise branches. This realizes unidirectional information communication from the speech branch to the noise branch. Finally, the output of the speech branch is the estimated DCTIRM

{\hat{M}}_{s}

for target speech recovery:

\hat{s} = ISTDCT ({\hat{M}}_{s} \otimes X)

(8)

where ⊗ denotes the element-wise multiplication operator.

Similarly, the output

{\hat{M}}_{z}

of the noise branch is used to reconstruct the noise signal as follows:

\hat{z} = ISTDCT ({\hat{M}}_{z} \otimes X)

(9)

Additionally, in order to reduce the algorithm delay during practical deployment, we ensure the causality of the CRN structure in the speech branch. However, since the noise branch does not need to be deployed after model training stage, the CRN structure in the noise branch adopts a non-causal version, which is more complicated but stronger than the causal one.

The details of the encoder, decoder, TFSM block, unidirectional interaction module, and loss function are described as follows.

2.2.1. Encoder and Decoder

As shown in Figure 1, the noisy STDCT spectrum X is simultaneously fed into the speech and noise encoders, aiming to extract speech-related and noise-related feature representations. The encoders in both branches include multiple two-dimensional (2D) convolutional (Conv2d) blocks. The detail of the Conv2d block is illustrated in Figure 2a, comprising a 2D convolutional layer, a 2D batch normalization layer, and a PReLU layer.

Correspondingly, a decoder structurally symmetrical to the encoder is employed in both the speech and noise branches. The decoder processes the output of the recurrent module and reconstructs a spectrum mask of the same size as the input spectrum X. The decoders in both branches consist of multiple 2D deconvolutional (DeConv2d) blocks. As shown in Figure 2b, each DeConv2d block includes a 2D deconvolutional layer, a 2D batch normalization layer, and a PReLU layer. Note that the non-linear activation function of the last DeConv2d block is a modified Tanh function like [37].

In the speech branch, the encoder and decoder, respectively, adopt causal convolutional and deconvolutional layers, which are achieved through asymmetric zero-padding and frame-discarding operations in the time dimension. The kernel size and stride in the frequency and time dimensions are (5, 2) and (2, 1). In the noise branch, the kernel size and stride of all non-causal convolutional and deconvolutional layers are (5, 3) and (2, 1). This improves the noise modeling capability. In addition, the output channel of each convolutional layer in the encoders is {16, 32, 64, 128, 160}, and the output channel of each deconvolutional layer in the decoders is {128, 64, 32, 16, 1}.

2.2.2. Time–Frequency Sequence Modeling Block

In both the speech and noise branches, the recurrent modules are deployed to receive the outputs of their respective encoders and model the sequential dependencies within the encoded features. Different from the traditional CRN architecture, which only utilizes multiple RNN layers to capture temporal correlations in encoded features, we additionally integrate sequence modeling along the frequency dimension. Similar to the works in [16,38], we adopt a recurrent module comprising multiple time–frequency sequence modeling (TFSM) blocks [22] to simultaneously modeling the local and global context along the time and frequency dimensions.

Figure 3 illustrates the diagram of the TFSM block. The input representation I is processed by two consecutive sequence modeling sub-blocks, which are designed to apply sequence modeling in the frequency and time dimensions, respectively. To ensure causal inference, we adopt the causal TFSM block in the speech branch. Specifically, the TFSM block in the speech branch utilizes a gated recurrent unit (GRU) layer to capture sequential correlations in the time dimension. In contrast, the TFSM block in the noise branch employs a bidirectional GRU (BiGRU) layer for temporal sequence modeling. Since the sequence modeling in the frequency dimension does not affect causal inference, the BiGRU layer is used in both branches’ TFSM blocks. After the GRU or BiGRU layer, a layer normalization and a PReLU function are utilized for better performance. Additionally, the reshape operations serve the purpose of realizing sequence modeling in two different dimensions and recovering the sequence modeling result to the same shape as the input I. Eventually, a residual connection is employed to sum the original input I and the result after sequence modeling, yielding the TFSM block’s output O.

In both the speech and noise branches, we stack 3 TFSM blocks, whose hidden BiGRU&GRU units are {128, 64, 32}.

2.2.3. Unidirectional Interaction Module

A previous study [29] has demonstrated that the feature information interaction between the speech and noise branches can effectively enhance the modeling capabilities for speech and noise, leading to better recovery of the target speech and pure noise signals. Nevertheless, the bidirectional information interaction mechanism implies that during the practical deployment, the SE model must conduct forward propagation calculations in both the speech and noise branches. This results in a substantial growth in model size and computational complexity compared to conventional single-branch DNN-based methods. In this work, we introduce a unidirectional interaction module, as illustrated in Figure 4. This module exclusively facilitates unidirectional feature transfer from the speech branch to the noise branch.

In the proposed unidirectional interaction module, the intermediate feature representation

F_{s}

from the speech branch is fed into a 2D convolutional gated linear unit (ConvGLU) [15], which extracts the interactive feature information

F^{i n t e r}

as follows:

F^{i n t e r} = (W_{1} * F_{s} + b_{1}) \otimes σ (W_{2} * F_{s} + b_{2})

(10)

where ∗ represents convolution operations with kernels

W_{1} \in R^{5 \times 2}

and

W_{2} \in R^{5 \times 2}

, as well as biases

b_{1} \in R^{5 \times 2}

and

b_{2} \in R^{5 \times 2}

. ⊗ and

σ

denote element-wise multiplication and the sigmoid function, respectively.

Afterwards,

F^{i n t e r}

is concatenated along the channel dimension with the intermediate feature representation

F_{z}

from the noise branch to obtain the integrated feature

F_{z}^{i n t e g}

as follows:

F_{z}^{i n t e g} = [F_{z}; F^{i n t e r}]

(11)

Finally, the integrated feature

F_{z}^{i n t e g}

is sent to subsequent noise modeling network.

Therefore, in the model training stage, the gradient information related to noise modeling results can be backpropagated into the speech branch through the unidirectional interaction modules, thereby updating the speech branch’s parameters. This further promotes the optimization of the speech branch.

2.2.4. Loss Function

Since the proposed UniInterNet simultaneously predicts the speech and noise signals during the training process, the total loss function

L

comprises two components as follows:

L = L_{s} + L_{z}

(12)

where

L_{s}

and

L_{z}

represent the speech loss and noise loss, respectively. Furthermore, both

L_{s}

and

L_{z}

are hybrid losses. Their definitions are follows:

L_{s} = {∥s, \hat{s}∥}_{1} + {∥M_{s}, {\hat{M}}_{s}∥}_{F}^{2}

(13)

L_{z} = {∥z, \hat{z}∥}_{1} + {∥M_{z}, {\hat{M}}_{z}∥}_{F}^{2}

(14)

where

{∥\cdot, \cdot∥}_{1}

denotes the L1 loss between the target and predicted time domain waveforms, and

{∥\cdot, \cdot∥}_{F}^{2}

represents the mean square error (MSE) loss between the target and predicted TF domain masks.

3. Experimental Setup

3.1. Datasets

To evaluate the effectiveness of the proposed methods, we conduct experiments on two publicly available datasets.

3.1.1. VoiceBank+DEMAND

VoiceBank+DEMAND dataset [33] is an open-sourced dataset widely used in the SE research. It consists of a training set of 11,572 utterances from 28 speakers and a test set of 824 utterances from another 2 unseen speakers. The speakers of clean speech are selected from the VoiceBank corpus [39], with around 400 utterances per speaker. The 28 speakers in the training set include 14 males and 14 females, while the 2 unseen speakers in the test set include 1 male and 1 female. The clean audios in the training set are mixed with 10 types of noise, including 2 artificially generated noises (speech-shaped noise and babble) and 8 real noise recordings from the DEMAND database [40] (domestic noise in a kitchen, office noise in a meeting room, cafeteria noise, restaurant noise, subway station noise, car noise, metro noise, and street noise at busy traffic intersection). The mixed SNRs in the training set is {0 dB, 5 dB, 10 dB, 15 dB}. In the test set, the clean audios are mixed with five other types of noise from the DEMAND database that do not appear in the training set, namely, domestic noise in a living room, office noise in an office space, bus noise, open area cafeteria noise, and public square noise. The mixed SNRs in the test set is {2.5 dB, 7.5 dB, 12.5 dB, 17.5 dB}.

3.1.2. DNS-Challenge

The Deep Noise Suppression (DNS) challenge [34] at Interspeech 2020 provides a large dataset. It includes over 500 h of clean audios across 2150 speakers from the Librivox (https://librivox.org/ (accessed on 20 September 2024)) dataset and over 180 h of noise audios from Audioset (https://research.google.com/audioset/ (accessed on 20 September 2024)) [41] and Freesound (https://freesound.org/ (accessed on 20 September 2024)) for training. The noise dataset contains around 150 types of noise, with each type having at least 500 clips to ensure a balanced distribution. Meanwhile, the DNS-Challenge provides a non-blind test set of 150 noisy-clean audio pairs for model performance evaluation, whose mixed SNRs are randomly distributed between 0 dB and 20 dB. The clean audios in the test set are sourced from another 20 unseen speakers in the Graz University’s clean speech dataset [42], ensuring no speaker overlap between the training and test sets. Meanwhile, the noise clips in the test set are sampled from Audioset and Freesound and these are not present in the training set. Utilizing the official scripts (https://github.com/microsoft/DNS-Challenge (accessed on 20 September 2024)), we totally generate 3000 h of noisy-clean audio pairs as the training set, in which the mixed SNRs range from −5 dB to 15 dB.

3.2. Training Configurations

All utterances are resampled to 16 kHz. During STDCT and ISTDCT, a Hamming window of length 32 ms is employed, with a hop size of 8 ms between consecutive frames. The DCT point is set as 512. In the training phase, we utilize RMSprop optimizer and the dynamic learning rate strategy. The learning rate is initially set as 2 × 10⁻⁴, and it will be halved if the model performance does not improve for 8 consecutive epochs. The batch size and total training epochs are 16 and 100, respectively.

3.3. Evaluation Metrics

To quantitatively evaluate the performance of different SE models, we employ several commonly used objective metrics. In the evaluation on the VoiceBank+DEMAND dataset, we adopt four evaluation metrics: wide-band perceptual evaluation of speech quality (WB-PESQ) [43] and three mean opinion score (MOS) metrics [44] (i.e., CSIG, CBAK, and COVL). WB-PESQ is used to evaluate perceptual speech quality with a score range from −0.5 to 4.5. CSIG, CBAK, and COVL are MOS predictions of signal distortion, background noise suppression, and overall audio quality, respectively. All three MOS scores range from 1 to 5. For the DNS-Challenge dataset, we take WB-PESQ, narrow-band perceptual evaluation of speech quality (NB-PESQ) [45], short-time objective intelligibility (STOI) [46], and scale-invariant signal-to-noise ratio (SI-SNR) [47] as performance evaluation metrics. STOI measures speech intelligibility, whose score ranges from 0 to 1. SI-SNR measures the ratio of the power of the clean signal to the power of the noise in a way that is invariant to the scale of the signal. The selection of the aforementioned SE performance evaluation metrics is guided by established practices, community consensus, and the focuses and characteristics of the datasets. Higher values for all above metrics indicate better speech quality.

In addition, we employ multiply–accumulate operations (MACs) to quantify the computational complexity of the model. This value reflects the total computational costs for a single forward propagation calculation of the DNN model. We adopt the Python (version number: Python 3.7.12, creator: Guido van Rossum) tool package named thop (https://pypi.org/project/thop/ (accessed on 26 September 2024)) to calculate this metric. Additionally, the thop package also allows for the simultaneous calculation of the model size.

3.4. Baselines

With the aim of demonstrating the effectiveness and superiority of the proposed unidirectional information interaction scheme, we implement the conventional single-branch network (SiNet) and bidirectional information interaction-based dual-branch network (BiInterNet) as the baselines.

3.4.1. SiNet

Consistent with the most typical paradigm [4], we employ a single speech prediction branch for SE, abbreviated as SiNet. SiNet has exactly the same network structure and parameter settings as the causal speech branch in the proposed UniInterNet, and it also adopts the masking method as Equation (8) to predict the enhanced speech.

3.4.2. BiInterNet

BiInterNet follows the existing bidirectional interactive speech and noise modeling-based dual-branch network topology. Compared to UniInterNet, the causal speech branch in BiInterNet remains unchanged, while the noise branch is also converted to a causal version. This is because the noise branch of BiInterNet needs to be deployed during model inference. Meanwhile, similar to the previous work [29], the information flow between the speech and noise branches in BiInterNet is bidirectional. The intermediate features of the speech and noise branches flow to each other after being processed by the interaction module, as proposed in Section 2.2.3.

4. Results and Discussion

4.1. Ablation Study

We perform the ablation study on the VoiceBank+DEMAND dataset to analyze the proposed improvements.

4.1.1. Performance Gains of Unidirectional Information Interaction Scheme

As shown in Table 1, we systematically compare the proposed UniInterNet with two baselines in terms of causality, model size, computational complexity, and performance metrics. From the results in Table 1, one can make the following observations.

First, when only a single speech prediction branch is utilized, i.e., SiNet, its performance is clearly worse than that of UniInterNet. Therefore, adding a noise branch with unidirectional information flow effectively promotes the optimization of the speech branch. This results in improvements in WB-PESQ, CSIG, CBAK, and COVL by 0.15, 0.15, 0.10, and 0.15, respectively. Meanwhile, since the noise branch and unidirectional interaction module of UniInterNet are not required during the model inference stage, UniInterNet has the same deployment cost as SiNet.

Second, the dual-branch network with bidirectional information communication, namely, BiInterNet, indeed achieves higher objective evaluation scores than UniInterNet. But the number of model parameters for BiInterNet is approximately 4.28 times greater than that of UniInterNet, and the computational cost is about 3.94 times higher. This increase is due to the introduction of the noise branch and the bidirectional interaction modules. However, the performance gain of BiInterNet over UniInterNet is relatively limited. In a word, UniInterNet achieves comparable performance to BiInterNet while reducing inference complexity by about 75%.

Third, if the noise branch of UniInterNet switches to the causal setting, i.e., UniInterNet-CausalNoise in Table 1, it will lead to performance degradation. Since the causality of the noise branch has no influence on algorithm delay and computational complexity during model deployment, it is better to adopt a non-causal setting for the noise branch of UniInterNet.

Moreover, a comparison of the spectrum of baselines and the proposed UniInterNet is shown in Figure 5. One can see that there is still obvious residual noise in the spectrum processed by SiNet, and the spectrum processed by UniInterNet-CausalNoise also has similar problems. In contrast, the enhanced spectrum by BiInterNet and UniInterNet exhibit more similar TF patterns to the clean speech. Furthermore, the differences between the spectrum processed by BiInterNet and UniInterNet are not significant, although the former yields relatively more ideal results. These observations are generally consistent with the quantitative comparison results mentioned above.

4.1.2. Effects of Information Interaction at Different Parts of the Network

We subsequently investigate the effects of the unidirectional interaction modules between the speech and noise branches at different parts of the network, including the encoder, recurrent module, and decoder. By analyzing the performance variations, we aim to identify the optimal placement for the interaction module to maximize the effectiveness of the unidirectional information flow. Specifically, we have experimented with removing the unidirectional interaction modules from one of the three parts, namely, the encoder, recurrent module, and decoder, while retaining the unidirectional interaction modules in the other two parts. The experimental results are illustrated in Table 2. It can be observed that removing the unidirectional interaction module from any position in the UniInterNet leads to performance degradation. Therefore, the information interactions between the encoders, recurrent modules, and decoders are all necessary for better performance. In addition, the information interaction between decoders is the most critical, as removing the interaction module between the decoders of the speech and noise branches (i.e., UniInterNet w/o DecInter) leads to the most noticeable performance loss. The second most significant impact comes from removing the interaction module between the encoders (i.e., UniInterNet w/o EncInter). In contrast, removing the interaction module between the recurrent modules of the two branches (i.e., UniInterNet w/o RecInter) results in the least performance degradation.

To better emphasize the advantage of the unidirectional information interaction at different parts, an example of the spectrum of noisy speech, clean speech, and enhanced speech processed by UniInterNet w/o EncInter, UniInterNet w/o RecInter, UniInterNet w/o DecInter, and UniInterNet is presented in Figure 6. We can observe that UniInterNet performs best in terms of background noise suppression and target spectrum recovery. Meanwhile, the residual noise in the spectrum processed by UniInterNet w/o DecInter is more obvious than that in UniInterNet w/o EncInter and UniInterNet w/o RecInter, while the residual noise in the spectrum processed by UniInterNet w/o RecInter is relatively minimal. This trend is generally consistent with the analysis of the quantitative evaluation metrics above.

In summary, the information interaction between the two branches is beneficial for high-level feature extraction, sequence modeling analysis, and target mask reconstruction, with the greatest impact observed in target mask reconstruction.

4.2. Performance Analysis over Different Signal-to-Noise Ratios (SNRs)

As shown in Table 3, we investigate the performance of the proposed UniInterNet and baseline models across various SNRs. The performance evaluation results presented in Table 3 lead to the following findings: (1) Compared to SiNet, both BiInterNet and the proposed UniInterNet achieve significant performance improvements over different SNRs, with the performance gains in the low SNR range generally being larger than those in the high SNR range. (2) The performance metrics scores of the proposed UniInterNet are generally lower than those of BiInterNet across different SNRs, but as the SNR increases, the performance gap between UniInterNet and BiInterNet narrows overall. This implies that when the noise component is weak, the benefit of the interactive information from the noise prediction branch to the speech prediction branch may decrease. (Note: The standard deviation data for Table 3 are presented in Table A1 in the Appendix A.)

4.3. Performance Comparison with Previous Methods Based on Bidirectional Interactive Speech and Noise Prediction

To further highlight the advantages of our proposed UniInterNet over previous methods based on bidirectional interactive speech and noise prediction, we perform a comprehensive performance comparison with SN-Net [29]. SN-Net consists of parallel speech and noise prediction branches, with bidirectional information interaction modules placed between them. These modules extract speech or noise-related features, facilitating more accurate predictions of speech and noise signals.

The performance comparison results of our proposed scheme and SN-Net are presented in Table 4. For SN-Net, we not only cite the evaluation metric scores from the original paper but also reproduce this method and calculate the performance evaluation results ourselves to ensure a fair comparison. Our reproduced results closely match those reported in the original paper across all evaluation metrics, confirming the correctness of our reproduction of SN-Net. Due to the utilization of self-attention for global dependency modeling along the temporal dimension, as well as the non-causal convolutional and deconvolutional layers, SN-Net is a non-causal model. As a result, SN-Net requires the future speech information to predict the current enhanced speech frame. Moreover, the dual-branch structure with self-attention contributes to a large model size and high computational complexity in SN-Net. In contrast, the baseline model BiInterNet in this work achieves similar overall performance while ensuring causal inference. In addition, BiInterNet has fewer model parameters and lower computational cost. This demonstrates the benefits of our proposed system. Furthermore, compared to SN-Net, the proposed UniInterNet reduces the model parameters by about 77% and computational cost by about 80%, while retaining the strength of causal inference. Meanwhile, the performance degradation of UniInterNet is acceptable. Therefore, UniInterNet has significant advantages over SN-Net, especially in scenarios with limited computing resources and processing latency.

4.4. Performance Comparison with Existing Advanced Systems

We have also conducted the performance comparison with existing advanced methods on both VoiceBank+DEMAND and DNS-Challenge datasets. These benchmarks take various types of input, including time-domain waveform, STFT magnitude spectrum, STFT complex spectrum, etc.

The comparison results on the VoiceBank+DEMAND dataset are illustrated in Table 5. Eight different SE systems are selected for comparative analysis. The first category of systems directly processes the noisy waveform using DNN to predict the enhanced signal in the time domain, such as DEMUCS [1]. The second category includes methods that operate on the STFT complex-valued spectrum with a CRN (i.e., DCCRN [7]) or a diffusion model (i.e., NASE [27]). The third category consists of methods employing the two-stage technique for both coarse magnitude spectrum estimation and fine-grained complex-valued STFT spectrum refinement, including CTS-Net [8] and GaGNet [20]. In addition, multi-domain processing methods, such as CompNet [21] and FDFNet [22], as well as full-band and sub-band fusion system like LFSFNet [25], are also included. The proposed UniInterNet outperforms the benchmarks across all the evaluation metrics. It should be noted that although the performance gain of UniInterNet is relatively modest compared to FDFNet [22], the model parameters and computational cost of FDFNet are 4.43 M and 10.66 GMACs/s, respectively, which are significantly higher than those of UniInterNet.

For the DNS-Challenge dataset, we select eight advanced systems as benchmarks for comparison: NSNet [34] takes the log-power spectrum as input and employs a neural network based on RNN; SICRN [19] follows the CRN architecture while preserving the inherent frequency structure of the signal using a state space model and inplace convolution; CTS-Net [8] and GaGNet [20] adopt the two-stage framework for enhanced performance; FullSubNet+ [24], Inter-SubNet [26], and FS-CANet [48] incorporate both local sub-band spectrum features and global cross-band correlations; and SEnet+Ran-net [49] imitates human cognitive behavior by separating the regular components of noise from its random components. The results in Table 6 demonstrate that UniInterNet achieves the best scores on most of the evaluation metrics, except that GaGNet [20] performs better on SI-SNR. These findings underscore the effectiveness of the proposed UniInterNet.

The processed audio samples of the VoiceBank+DEMAND and DNS-Challenge datasets can be found at https://github.com/Zhangyuewei98/UniInterNet accessed on 18 December 2024.

5. Conclusions

In this work, we design a novel SE model architecture based on a dual-branch network and auxiliary noise prediction. A unidirectional interaction module is proposed and inserted between the two branches to transfer useful information from the speech branch to the noise branch. As a result, the noise branch is not needed during the model deployment phase, which reduces the deployment cost compared to previous bidirectional interactive speech and noise modeling-based methods. Additionally, the speech branch is designed to be causal for small algorithm delay, while the noise branch adopts a non-causal design to improve the noise modeling accuracy and better assist the optimization of the speech branch. Therefore, UniInterNet offers great advantages over existing bidirectional interactive speech and noise modeling-based benchmark in terms of inference delay. Experimental results demonstrate the effectiveness of our design, where the proposed UniInterNet maintains comparable performance to the traditional dual-branch bidirectional interactive methods while significantly reducing the model inference complexity. Moreover, we find that the unidirectional information interaction between the two branches is necessary in all processing stages, including the high-level feature extraction at the encoder, the sequence modeling analysis at the recurrent module, and the target mask estimation at the decoder. Removing the unidirectional interaction module at any one of the three parts results in performance degradation. Among the three parts, the information interaction at the decoders is the most important. Eventually, the proposed UniInterNet outperforms the previous advanced methods on both VoiceBank+DEMAND and DNS-Challenge datasets.

In the future, we plan to integrate other related tasks into this proposed framework, such as speech dereverberation, declipping, and super-resolution. This expansion is motivated by research findings indicating that TF domain speech processing algorithms based on spectrum mask prediction remain effective for the related speech processing tasks.

Author Contributions

Conceptualization, Y.Z. and H.Z.; methodology, Y.Z. and H.Z.; software, Y.Z. and H.Z.; validation, Y.Z.; formal analysis, Y.Z.; investigation, Y.Z. and H.Z.; resources, J.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., H.Z. and J.Z.; visualization, Y.Z.; supervision, H.Z. and J.Z.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the special funds of Shenzhen Science and Technology Innovation Commission, China (No. CJGJZD20220517141400002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available at [https://doi.org/10.21437/SSW.2016-24 accessed on 2 September 2024] and [https://doi.org/10.21437/Interspeech.2020-3038 accessed on 20 September 2024], reference number [33,34].

Conflicts of Interest

Author Huanbin Zou was employed by the company Xiaohongshu Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1 illustrates the standard deviation data for Table 3.

Table A1. The standard deviation data for Table 3.

	WB-PESQ				CSIG				CBAK				COVL
SNRs (dB)	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5
noisy	0.43	0.59	0.70	0.71	0.66	0.71	0.71	0.68	0.31	0.41	0.48	0.48	0.54	0.65	0.71	0.72
SiNet	0.67	0.63	0.55	0.51	0.73	0.59	0.49	0.38	0.53	0.48	0.42	0.38	0.72	0.62	0.54	0.49
BiInterNet	0.70	0.64	0.54	0.48	0.64	0.54	0.41	0.32	0.50	0.44	0.38	0.34	0.68	0.60	0.49	0.43
UniInterNet	0.69	0.62	0.53	0.46	0.66	0.52	0.42	0.31	0.51	0.45	0.40	0.35	0.68	0.58	0.49	0.41

References

Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 3291–3295. [Google Scholar] [CrossRef]
Pandey, A.; Wang, D. Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in the Time Domain. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6629–6633. [Google Scholar] [CrossRef]
Kong, Z.; Ping, W.; Dantrey, A.; Catanzaro, B. Speech Denoising in the Waveform Domain With Self-Attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 22–27 May 2022; pp. 7867–7871. [Google Scholar] [CrossRef]
Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar] [CrossRef]
Choi, H.S.; Kim, J.; Huh, J.; Kim, A.; Ha, J.W.; Lee, K. Phase-Aware Speech Enhancement with Deep Complex U-Net. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–20. [Google Scholar]
Fu, S.; Liao, C.; Tsao, Y.; Lin, S. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of the International Conference on Machine Learning and Machine Intelligence, Jakarta, Indonesia, 18–20 September 2019; pp. 2031–2041. [Google Scholar]
Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 2472–2476. [Google Scholar] [CrossRef]
Li, A.; Liu, W.; Zheng, C.; Fan, C.; Li, X. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1829–1843. [Google Scholar] [CrossRef]
Yan, X.; Yang, Y.; Guo, Z.; Peng, L.; Xie, L. The NPU-Elevoc Personalized Speech Enhancement System for Icassp2023 DNS Challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar] [CrossRef]
Lu, Y.X.; Ai, Y.; Ling, Z.H. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 3834–3838. [Google Scholar]
Park, S.R.; Lee, J.W. A Fully Convolutional Neural Network for Speech Enhancement. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1993–1997. [Google Scholar] [CrossRef]
Gao, T.; Du, J.; Dai, L.R.; Lee, C.H. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5054–5058. [Google Scholar] [CrossRef]
Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proceedings of the Proceedings Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 936–940. [Google Scholar] [CrossRef]
Yu, G.; Li, A.; Wang, H.; Wang, Y.; Ke, Y.; Zheng, C. DBT-Net: Dual-Branch Federative Magnitude and Phase Estimation with Attention-in-Attention Transformer for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2629–2644. [Google Scholar] [CrossRef]
Tan, K.; Wang, D. Learning Complex Spectral Mapping with Gated Convolutional Recurrent Networks for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 380–390. [Google Scholar] [CrossRef] [PubMed]
Le, X.; Chen, H.; Chen, K.; Lu, J. DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 2811–2815. [Google Scholar] [CrossRef]
Zhao, S.; Ma, B.; Watcharasupat, K.N.; Gan, W.S. FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9281–9285. [Google Scholar] [CrossRef]
Liu, J.; Zhang, X. ICCRN: Inplace Cepstral Convolutional Recurrent Neural Network for Monaural Speech Enhancement. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Zhao, C.; He, S.; Zhang, X. SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10506–10510. [Google Scholar] [CrossRef]
Li, A.; Zheng, C.; Zhang, L.; Li, X. Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 2022, 187, 108499. [Google Scholar] [CrossRef]
Fan, C.; Zhang, H.; Li, A.; Xiang, W.; Zheng, C.; Lv, Z.; Wu, X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw. 2023, 168, 508–517. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Zou, H.; Zhu, J. A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12587–12591. [Google Scholar] [CrossRef]
Hao, X.; Su, X.; Horaud, R.; Li, X. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6633–6637. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Tuo, D.; Wu, Z.; Kang, S.; Meng, H. FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 22–27 May 2022; pp. 7857–7861. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, P. Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement. In Proceedings of the Proceedings Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 921–925. [Google Scholar] [CrossRef]
Chen, J.; Rao, W.; Wang, Z.; Lin, J.; Wu, Z.; Wang, Y.; Shang, S.; Meng, H. Inter-Subnet: Speech Enhancement with Subband Interaction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Hu, Y.; Chen, C.; Li, R.; Zhu, Q.; Chng, E.S. Noise-aware Speech Enhancement using Diffusion Probabilistic Model. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2225–2229. [Google Scholar] [CrossRef]
Odelowo, B.O.; Anderson, D.V. A Study of Training Targets for Deep Neural Network-Based Speech Enhancement Using Noise Prediction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5409–5413. [Google Scholar] [CrossRef]
Zheng, C.; Peng, X.; Zhang, Y.; Srinivasan, S.; Lu, Y. Interactive Speech and Noise Modeling for Speech Enhancement. Proc. AAAI Conf. Artif. Intell. 2021, 35, 14549–14557. [Google Scholar] [CrossRef]
Xiang, X.; Zhang, X.; Chen, H. Two-Stage Learning and Fusion Network with Noise Aware for Time-Domain Monaural Speech Enhancement. IEEE Signal Process. Lett. 2021, 28, 1754–1758. [Google Scholar] [CrossRef]
Zhou, A.; Zhang, W.; Li, X.; Xu, G.; Zhang, B.; Ma, Y.; Song, J. A Novel Noise-Aware Deep Learning Model for Underwater Acoustic Denoising. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Ahmed, N.; Natarajan, T.; Rao, K. Discrete Cosine Transform. IEEE Trans. Comput. 1974, C-23, 90–93. [Google Scholar] [CrossRef]
Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the ISCA Workshop on Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
Reddy, C.K.; Gopal, V.; Cutler, R.; Beyrami, E.; Cheng, R.; Dubey, H.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. In Proceedings of the Proceedings Interspeech, Shanghai, China, 25–29 October 2020; pp. 2492–2496. [Google Scholar] [CrossRef]
Srinivasan, S.; Roman, N.; Wang, D. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar] [CrossRef]
Geng, C.; Wang, L. End-to-End Speech Enhancement Based on Discrete Cosine Transform. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 379–383. [Google Scholar] [CrossRef]
Luo, Y.; Chen, Z.; Yoshioka, T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 46–50. [Google Scholar] [CrossRef]
Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; pp. 1–4. [Google Scholar] [CrossRef]
Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Proc. Meet. Acoust. 2013, 19, 035081. [Google Scholar] [CrossRef]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Pirker, G.; Wohlmayr, M.; Petrik, S.; Pernkopf, F. A pitch tracking corpus with evaluation on multipitch tracking scenario. In Proceedings of the Interspeech, Florence, Italy, 27–31 August 2011; pp. 1509–1512. [Google Scholar] [CrossRef]
Recommendation, I. Wideband Extension to Recommendation p. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs; Technical Report Standard 862.2; International Telecommunication Union: Geneva, Switzerland, 2007. [Google Scholar]
Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
Rix, A.; Beerends, J.; Hollier, M.; Hekstra, A. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Roux, J.L.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR—Half-baked or Well Done? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 626–630. [Google Scholar] [CrossRef]
Chen, J.; Rao, W.; Wang, Z.; Wu, Z.; Wang, Y.; Yu, T.; Shang, S.; Meng, H. Speech Enhancement with Fullband-Subband Cross-Attention Network. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 976–980. [Google Scholar] [CrossRef]
Li, Y.; Jin, X.; Tong, L.; Zhang, L.M.; Yao, Y.Q.; Yan, H. A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior. Appl. Acoust. 2024, 221, 109997. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of unidirectional information interaction-based dual-branch network (UniInterNet).

Figure 2. (a) The detail of the two-dimensional convolutional (Conv2d) block. (b) The detail of the two-dimensional deconvolutional (DeConv2d) block.

Figure 3. The diagram of time–frequency sequence modeling (TFSM) block. During temporal sequence modeling, a causal gated recurrent unit (GRU) layer is employed in the speech branch, while a non-causal bidirectional GRU (BiGRU) layer is utilized in the noise branch.

Figure 4. Structure of the unidirectional interaction module.

Figure 5. Visualization of the spectrum of the following: (a) noisy speech; (b) clean speech; (c) enhanced speech by SiNet; (d) enhanced speech by BiInterNet; (e) enhanced speech by UniInterNet-CausalNoise; (f) enhanced speech by UniInterNet. The noise type is open area cafeteria noise.

Figure 6. Visualization of the spectrum of the following: (a) noisy speech; (b) clean speech; (c) enhanced speech by UniInterNet w/o EncInter; (d) enhanced speech by UniInterNet w/o RecInter; (e) enhanced speech by UniInterNet w/o DecInter; (f) enhanced speech by UniInterNet. The noise type is public square noise.

Table 1. Ablation study with respect to the proposed unidirectional information interaction scheme.

	Causal	Param. (M)	MACs (G/s)	WB-PESQ	CSIG	CBAK	COVL
noisy	-	-	-	1.97 ± 0.75	3.35 ± 0.87	2.44 ± 0.67	2.63 ± 0.83
SiNet	✔	1.65	6.24	2.90 ± 0.72	4.17 ± 0.71	3.47 ± 0.57	3.55 ± 0.75
BiInterNet	✔	7.06	24.61	3.10 ± 0.71	4.37 ± 0.59	3.59 ± 0.52	3.76 ± 0.67
UniInterNet-CausalNoise	✔	1.65	6.24	3.00 ± 0.66	4.27 ± 0.62	3.50 ± 0.55	3.65 ± 0.67
UniInterNet	✔	1.65	6.24	3.05 ± 0.71	4.32 ± 0.60	3.57 ± 0.55	3.70 ± 0.67

Table 2. Ablation study on the effects of information interaction at different network parts.

	WB-PESQ	CSIG	CBAK	COVL
noisy	1.97 ± 0.75	3.35 ± 0.87	2.44 ± 0.67	2.63 ± 0.83
UniInterNet	3.05 ± 0.71	4.32 ± 0.60	3.57 ± 0.55	3.70 ± 0.67
w/o EncInter	2.97 ± 0.72	4.22 ± 0.60	3.49 ± 0.58	3.60 ± 0.68
w/o RecInter	2.99 ± 0.72	4.28 ± 0.64	3.51 ± 0.56	3.65 ± 0.71
w/o DecInter	2.94 ± 0.67	4.21 ± 0.60	3.47 ± 0.57	3.59 ± 0.66

Table 3. Performance evaluation results over different signal-to-noise ratios (SNRs).

	WB-PESQ				CSIG				CBAK				COVL
SNRs (dB)	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5	2.5	7.5	12.5	17.5
noisy	1.42	1.76	2.10	2.60	2.62	3.14	3.59	4.05	1.77	2.21	2.63	3.17	1.96	2.42	2.83	3.33
SiNet	2.31	2.80	3.08	3.43	3.50	4.09	4.40	4.67	2.98	3.37	3.61	3.92	2.88	3.45	3.76	4.10
BiInterNet	2.54	3.05	3.27	3.58	3.87	4.33	4.53	4.75	3.15	3.52	3.72	3.98	3.20	3.70	3.92	4.21
UniInterNet	2.46	2.92	3.21	3.57	3.80	4.25	4.49	4.73	3.09	3.46	3.71	4.01	3.13	3.60	3.87	4.19

Table 4. Performance comparison results with previous methods based on bidirectional interactive speech and noise prediction, specifically SN-Net [29].

	Causal	Param. (M)	MACs (G/s)	WB-PESQ	CSIG	CBAK	COVL
noisy	-	-	-	1.97	3.35	2.44	2.63
SN-Net (original paper)	✗	7.22	30.51	3.12	4.39	3.60	3.77
SN-Net (our reproduction)	✗	7.22	30.51	3.13	4.33	3.63	3.76
BiInterNet	✔	7.06	24.61	3.10	4.37	3.59	3.76
UniInterNet	✔	1.65	6.24	3.05	4.32	3.57	3.70

Table 5. Performance comparison results with existing advanced systems on the VoiceBank+ DEMAND dataset.

	Causal	Input	WB-PESQ	CSIG	CBAK	COVL
noisy	-	-	1.97	3.35	2.44	2.63
DCCRN [7]	✔	Complex	2.68	3.88	3.18	3.27
CompNet [21]	✔	Time+Complex	2.90	4.16	3.37	3.53
LFSFNet [25]	✔	Magnitude	2.91	-	-	-
CTS-Net [8]	✔	Complex	2.92	4.25	3.46	3.59
DEMUCS [1]	✔	Time	2.93	4.22	3.25	3.52
GaGNet [20]	✔	Complex	2.94	4.26	3.45	3.59
NASE [27]	✗	Complex	3.01	-	-	-
FDFNet [22]	✔	STDCT	3.05	4.23	3.55	3.65
UniInterNet (ours)	✔	STDCT	3.05	4.32	3.57	3.70

Table 6. Performance comparison results with existing advanced systems on the DNS-Challenge dataset.

	Causal	Input	WB-PESQ	NB-PESQ	STOI (%)	SI-SNR (dB)
noisy	-	-	1.58	2.45	91.52	9.07
NSNet [34]	✔	Log-power	2.15	2.87	94.47	15.61
SICRN [19]	✔	Complex	2.62	3.23	95.83	16.00
CTS-Net [8]	✔	Complex	2.94	3.42	96.66	17.99
FullSubNet+ [24]	✔	Complex+Magnitude	2.98	3.50	96.69	18.34
Inter-SubNet [26]	✗	Magnitude	3.00	3.50	96.61	18.05
FS-CANet [48]	✔	Magnitude	3.02	3.51	96.74	18.08
SEnet+Ran-net [49]	✗	Magnitude	3.16	3.57	-	-
GaGNet [20]	✔	Complex	3.17	3.56	97.13	18.91
UniInterNet (ours)	✔	STDCT	3.26	3.59	97.16	18.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Zou, H.; Zhu, J. Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Appl. Sci. 2025, 15, 2919. https://doi.org/10.3390/app15062919

AMA Style

Zhang Y, Zou H, Zhu J. Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Applied Sciences. 2025; 15(6):2919. https://doi.org/10.3390/app15062919

Chicago/Turabian Style

Zhang, Yuewei, Huanbin Zou, and Jie Zhu. 2025. "Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance" Applied Sciences 15, no. 6: 2919. https://doi.org/10.3390/app15062919

APA Style

Zhang, Y., Zou, H., & Zhu, J. (2025). Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Applied Sciences, 15(6), 2919. https://doi.org/10.3390/app15062919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Methods

2.2.1. Encoder and Decoder

2.2.2. Time–Frequency Sequence Modeling Block

2.2.3. Unidirectional Interaction Module

2.2.4. Loss Function

3. Experimental Setup

3.1. Datasets

3.1.1. VoiceBank+DEMAND

3.1.2. DNS-Challenge

3.2. Training Configurations

3.3. Evaluation Metrics

3.4. Baselines

3.4.1. SiNet

3.4.2. BiInterNet

4. Results and Discussion

4.1. Ablation Study

4.1.1. Performance Gains of Unidirectional Information Interaction Scheme

4.1.2. Effects of Information Interaction at Different Parts of the Network

4.2. Performance Analysis over Different Signal-to-Noise Ratios (SNRs)

4.3. Performance Comparison with Previous Methods Based on Bidirectional Interactive Speech and Noise Prediction

4.4. Performance Comparison with Existing Advanced Systems

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI