Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance
<p>Overall architecture of unidirectional information interaction-based dual-branch network (UniInterNet).</p> "> Figure 2
<p>(<b>a</b>) The detail of the two-dimensional convolutional (Conv2d) block. (<b>b</b>) The detail of the two-dimensional deconvolutional (DeConv2d) block.</p> "> Figure 3
<p>The diagram of time–frequency sequence modeling (TFSM) block. During temporal sequence modeling, a causal gated recurrent unit (GRU) layer is employed in the speech branch, while a non-causal bidirectional GRU (BiGRU) layer is utilized in the noise branch.</p> "> Figure 4
<p>Structure of the unidirectional interaction module.</p> "> Figure 5
<p>Visualization of the spectrum of the following: (<b>a</b>) noisy speech; (<b>b</b>) clean speech; (<b>c</b>) enhanced speech by SiNet; (<b>d</b>) enhanced speech by BiInterNet; (<b>e</b>) enhanced speech by UniInterNet-CausalNoise; (<b>f</b>) enhanced speech by UniInterNet. The noise type is open area cafeteria noise.</p> "> Figure 6
<p>Visualization of the spectrum of the following: (<b>a</b>) noisy speech; (<b>b</b>) clean speech; (<b>c</b>) enhanced speech by UniInterNet w/o EncInter; (<b>d</b>) enhanced speech by UniInterNet w/o RecInter; (<b>e</b>) enhanced speech by UniInterNet w/o DecInter; (<b>f</b>) enhanced speech by UniInterNet. The noise type is public square noise.</p> ">
Abstract
:1. Introduction
2. Materials and Methods
2.1. Materials
- Speech branch:
- Noise branch:
2.2. Methods
2.2.1. Encoder and Decoder
2.2.2. Time–Frequency Sequence Modeling Block
2.2.3. Unidirectional Interaction Module
2.2.4. Loss Function
3. Experimental Setup
3.1. Datasets
3.1.1. VoiceBank+DEMAND
3.1.2. DNS-Challenge
3.2. Training Configurations
3.3. Evaluation Metrics
3.4. Baselines
3.4.1. SiNet
3.4.2. BiInterNet
4. Results and Discussion
4.1. Ablation Study
4.1.1. Performance Gains of Unidirectional Information Interaction Scheme
4.1.2. Effects of Information Interaction at Different Parts of the Network
4.2. Performance Analysis over Different Signal-to-Noise Ratios (SNRs)
4.3. Performance Comparison with Previous Methods Based on Bidirectional Interactive Speech and Noise Prediction
4.4. Performance Comparison with Existing Advanced Systems
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
WB-PESQ | CSIG | CBAK | COVL | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SNRs (dB) | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 |
noisy | 0.43 | 0.59 | 0.70 | 0.71 | 0.66 | 0.71 | 0.71 | 0.68 | 0.31 | 0.41 | 0.48 | 0.48 | 0.54 | 0.65 | 0.71 | 0.72 |
SiNet | 0.67 | 0.63 | 0.55 | 0.51 | 0.73 | 0.59 | 0.49 | 0.38 | 0.53 | 0.48 | 0.42 | 0.38 | 0.72 | 0.62 | 0.54 | 0.49 |
BiInterNet | 0.70 | 0.64 | 0.54 | 0.48 | 0.64 | 0.54 | 0.41 | 0.32 | 0.50 | 0.44 | 0.38 | 0.34 | 0.68 | 0.60 | 0.49 | 0.43 |
UniInterNet | 0.69 | 0.62 | 0.53 | 0.46 | 0.66 | 0.52 | 0.42 | 0.31 | 0.51 | 0.45 | 0.40 | 0.35 | 0.68 | 0.58 | 0.49 | 0.41 |
References
- Défossez, A.; Synnaeve, G.; Adi, Y. Real Time Speech Enhancement in the Waveform Domain. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 3291–3295. [Google Scholar] [CrossRef]
- Pandey, A.; Wang, D. Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in the Time Domain. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6629–6633. [Google Scholar] [CrossRef]
- Kong, Z.; Ping, W.; Dantrey, A.; Catanzaro, B. Speech Denoising in the Waveform Domain With Self-Attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 22–27 May 2022; pp. 7867–7871. [Google Scholar] [CrossRef]
- Tan, K.; Wang, D. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3229–3233. [Google Scholar] [CrossRef]
- Choi, H.S.; Kim, J.; Huh, J.; Kim, A.; Ha, J.W.; Lee, K. Phase-Aware Speech Enhancement with Deep Complex U-Net. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–20. [Google Scholar]
- Fu, S.; Liao, C.; Tsao, Y.; Lin, S. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of the International Conference on Machine Learning and Machine Intelligence, Jakarta, Indonesia, 18–20 September 2019; pp. 2031–2041. [Google Scholar]
- Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 2472–2476. [Google Scholar] [CrossRef]
- Li, A.; Liu, W.; Zheng, C.; Fan, C.; Li, X. Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1829–1843. [Google Scholar] [CrossRef]
- Yan, X.; Yang, Y.; Guo, Z.; Peng, L.; Xie, L. The NPU-Elevoc Personalized Speech Enhancement System for Icassp2023 DNS Challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar] [CrossRef]
- Lu, Y.X.; Ai, Y.; Ling, Z.H. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 3834–3838. [Google Scholar]
- Park, S.R.; Lee, J.W. A Fully Convolutional Neural Network for Speech Enhancement. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1993–1997. [Google Scholar] [CrossRef]
- Gao, T.; Du, J.; Dai, L.R.; Lee, C.H. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5054–5058. [Google Scholar] [CrossRef]
- Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proceedings of the Proceedings Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 936–940. [Google Scholar] [CrossRef]
- Yu, G.; Li, A.; Wang, H.; Wang, Y.; Ke, Y.; Zheng, C. DBT-Net: Dual-Branch Federative Magnitude and Phase Estimation with Attention-in-Attention Transformer for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2629–2644. [Google Scholar] [CrossRef]
- Tan, K.; Wang, D. Learning Complex Spectral Mapping with Gated Convolutional Recurrent Networks for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 380–390. [Google Scholar] [CrossRef] [PubMed]
- Le, X.; Chen, H.; Chen, K.; Lu, J. DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 2811–2815. [Google Scholar] [CrossRef]
- Zhao, S.; Ma, B.; Watcharasupat, K.N.; Gan, W.S. FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9281–9285. [Google Scholar] [CrossRef]
- Liu, J.; Zhang, X. ICCRN: Inplace Cepstral Convolutional Recurrent Neural Network for Monaural Speech Enhancement. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Zhao, C.; He, S.; Zhang, X. SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10506–10510. [Google Scholar] [CrossRef]
- Li, A.; Zheng, C.; Zhang, L.; Li, X. Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 2022, 187, 108499. [Google Scholar] [CrossRef]
- Fan, C.; Zhang, H.; Li, A.; Xiang, W.; Zheng, C.; Lv, Z.; Wu, X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw. 2023, 168, 508–517. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Zou, H.; Zhu, J. A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12587–12591. [Google Scholar] [CrossRef]
- Hao, X.; Su, X.; Horaud, R.; Li, X. Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6633–6637. [Google Scholar] [CrossRef]
- Chen, J.; Wang, Z.; Tuo, D.; Wu, Z.; Kang, S.; Meng, H. FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 22–27 May 2022; pp. 7857–7861. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, P. Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement. In Proceedings of the Proceedings Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 921–925. [Google Scholar] [CrossRef]
- Chen, J.; Rao, W.; Wang, Z.; Lin, J.; Wu, Z.; Wang, Y.; Shang, S.; Meng, H. Inter-Subnet: Speech Enhancement with Subband Interaction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Hu, Y.; Chen, C.; Li, R.; Zhu, Q.; Chng, E.S. Noise-aware Speech Enhancement using Diffusion Probabilistic Model. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 2225–2229. [Google Scholar] [CrossRef]
- Odelowo, B.O.; Anderson, D.V. A Study of Training Targets for Deep Neural Network-Based Speech Enhancement Using Noise Prediction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5409–5413. [Google Scholar] [CrossRef]
- Zheng, C.; Peng, X.; Zhang, Y.; Srinivasan, S.; Lu, Y. Interactive Speech and Noise Modeling for Speech Enhancement. Proc. AAAI Conf. Artif. Intell. 2021, 35, 14549–14557. [Google Scholar] [CrossRef]
- Xiang, X.; Zhang, X.; Chen, H. Two-Stage Learning and Fusion Network with Noise Aware for Time-Domain Monaural Speech Enhancement. IEEE Signal Process. Lett. 2021, 28, 1754–1758. [Google Scholar] [CrossRef]
- Zhou, A.; Zhang, W.; Li, X.; Xu, G.; Zhang, B.; Ma, Y.; Song, J. A Novel Noise-Aware Deep Learning Model for Underwater Acoustic Denoising. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Ahmed, N.; Natarajan, T.; Rao, K. Discrete Cosine Transform. IEEE Trans. Comput. 1974, C-23, 90–93. [Google Scholar] [CrossRef]
- Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the ISCA Workshop on Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
- Reddy, C.K.; Gopal, V.; Cutler, R.; Beyrami, E.; Cheng, R.; Dubey, H.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. In Proceedings of the Proceedings Interspeech, Shanghai, China, 25–29 October 2020; pp. 2492–2496. [Google Scholar] [CrossRef]
- Srinivasan, S.; Roman, N.; Wang, D. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
- Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar] [CrossRef]
- Geng, C.; Wang, L. End-to-End Speech Enhancement Based on Discrete Cosine Transform. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 379–383. [Google Scholar] [CrossRef]
- Luo, Y.; Chen, Z.; Yoshioka, T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 46–50. [Google Scholar] [CrossRef]
- Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 25–27 November 2013; pp. 1–4. [Google Scholar] [CrossRef]
- Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. Proc. Meet. Acoust. 2013, 19, 035081. [Google Scholar] [CrossRef]
- Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
- Pirker, G.; Wohlmayr, M.; Petrik, S.; Pernkopf, F. A pitch tracking corpus with evaluation on multipitch tracking scenario. In Proceedings of the Interspeech, Florence, Italy, 27–31 August 2011; pp. 1509–1512. [Google Scholar] [CrossRef]
- Recommendation, I. Wideband Extension to Recommendation p. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs; Technical Report Standard 862.2; International Telecommunication Union: Geneva, Switzerland, 2007. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
- Rix, A.; Beerends, J.; Hollier, M.; Hekstra, A. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Roux, J.L.; Wisdom, S.; Erdogan, H.; Hershey, J.R. SDR—Half-baked or Well Done? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 626–630. [Google Scholar] [CrossRef]
- Chen, J.; Rao, W.; Wang, Z.; Wu, Z.; Wang, Y.; Yu, T.; Shang, S.; Meng, H. Speech Enhancement with Fullband-Subband Cross-Attention Network. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 976–980. [Google Scholar] [CrossRef]
- Li, Y.; Jin, X.; Tong, L.; Zhang, L.M.; Yao, Y.Q.; Yan, H. A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior. Appl. Acoust. 2024, 221, 109997. [Google Scholar] [CrossRef]
Causal | Param. (M) | MACs (G/s) | WB-PESQ | CSIG | CBAK | COVL | |
---|---|---|---|---|---|---|---|
noisy | - | - | - | 1.97 ± 0.75 | 3.35 ± 0.87 | 2.44 ± 0.67 | 2.63 ± 0.83 |
SiNet | ✔ | 1.65 | 6.24 | 2.90 ± 0.72 | 4.17 ± 0.71 | 3.47 ± 0.57 | 3.55 ± 0.75 |
BiInterNet | ✔ | 7.06 | 24.61 | 3.10 ± 0.71 | 4.37 ± 0.59 | 3.59 ± 0.52 | 3.76 ± 0.67 |
UniInterNet-CausalNoise | ✔ | 1.65 | 6.24 | 3.00 ± 0.66 | 4.27 ± 0.62 | 3.50 ± 0.55 | 3.65 ± 0.67 |
UniInterNet | ✔ | 1.65 | 6.24 | 3.05 ± 0.71 | 4.32 ± 0.60 | 3.57 ± 0.55 | 3.70 ± 0.67 |
WB-PESQ | CSIG | CBAK | COVL | |
---|---|---|---|---|
noisy | 1.97 ± 0.75 | 3.35 ± 0.87 | 2.44 ± 0.67 | 2.63 ± 0.83 |
UniInterNet | 3.05 ± 0.71 | 4.32 ± 0.60 | 3.57 ± 0.55 | 3.70 ± 0.67 |
w/o EncInter | 2.97 ± 0.72 | 4.22 ± 0.60 | 3.49 ± 0.58 | 3.60 ± 0.68 |
w/o RecInter | 2.99 ± 0.72 | 4.28 ± 0.64 | 3.51 ± 0.56 | 3.65 ± 0.71 |
w/o DecInter | 2.94 ± 0.67 | 4.21 ± 0.60 | 3.47 ± 0.57 | 3.59 ± 0.66 |
WB-PESQ | CSIG | CBAK | COVL | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SNRs (dB) | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 | 2.5 | 7.5 | 12.5 | 17.5 |
noisy | 1.42 | 1.76 | 2.10 | 2.60 | 2.62 | 3.14 | 3.59 | 4.05 | 1.77 | 2.21 | 2.63 | 3.17 | 1.96 | 2.42 | 2.83 | 3.33 |
SiNet | 2.31 | 2.80 | 3.08 | 3.43 | 3.50 | 4.09 | 4.40 | 4.67 | 2.98 | 3.37 | 3.61 | 3.92 | 2.88 | 3.45 | 3.76 | 4.10 |
BiInterNet | 2.54 | 3.05 | 3.27 | 3.58 | 3.87 | 4.33 | 4.53 | 4.75 | 3.15 | 3.52 | 3.72 | 3.98 | 3.20 | 3.70 | 3.92 | 4.21 |
UniInterNet | 2.46 | 2.92 | 3.21 | 3.57 | 3.80 | 4.25 | 4.49 | 4.73 | 3.09 | 3.46 | 3.71 | 4.01 | 3.13 | 3.60 | 3.87 | 4.19 |
Causal | Param. (M) | MACs (G/s) | WB-PESQ | CSIG | CBAK | COVL | |
---|---|---|---|---|---|---|---|
noisy | - | - | - | 1.97 | 3.35 | 2.44 | 2.63 |
SN-Net (original paper) | ✗ | 7.22 | 30.51 | 3.12 | 4.39 | 3.60 | 3.77 |
SN-Net (our reproduction) | ✗ | 7.22 | 30.51 | 3.13 | 4.33 | 3.63 | 3.76 |
BiInterNet | ✔ | 7.06 | 24.61 | 3.10 | 4.37 | 3.59 | 3.76 |
UniInterNet | ✔ | 1.65 | 6.24 | 3.05 | 4.32 | 3.57 | 3.70 |
Causal | Input | WB-PESQ | CSIG | CBAK | COVL | |
---|---|---|---|---|---|---|
noisy | - | - | 1.97 | 3.35 | 2.44 | 2.63 |
DCCRN [7] | ✔ | Complex | 2.68 | 3.88 | 3.18 | 3.27 |
CompNet [21] | ✔ | Time+Complex | 2.90 | 4.16 | 3.37 | 3.53 |
LFSFNet [25] | ✔ | Magnitude | 2.91 | - | - | - |
CTS-Net [8] | ✔ | Complex | 2.92 | 4.25 | 3.46 | 3.59 |
DEMUCS [1] | ✔ | Time | 2.93 | 4.22 | 3.25 | 3.52 |
GaGNet [20] | ✔ | Complex | 2.94 | 4.26 | 3.45 | 3.59 |
NASE [27] | ✗ | Complex | 3.01 | - | - | - |
FDFNet [22] | ✔ | STDCT | 3.05 | 4.23 | 3.55 | 3.65 |
UniInterNet (ours) | ✔ | STDCT | 3.05 | 4.32 | 3.57 | 3.70 |
Causal | Input | WB-PESQ | NB-PESQ | STOI (%) | SI-SNR (dB) | |
---|---|---|---|---|---|---|
noisy | - | - | 1.58 | 2.45 | 91.52 | 9.07 |
NSNet [34] | ✔ | Log-power | 2.15 | 2.87 | 94.47 | 15.61 |
SICRN [19] | ✔ | Complex | 2.62 | 3.23 | 95.83 | 16.00 |
CTS-Net [8] | ✔ | Complex | 2.94 | 3.42 | 96.66 | 17.99 |
FullSubNet+ [24] | ✔ | Complex+Magnitude | 2.98 | 3.50 | 96.69 | 18.34 |
Inter-SubNet [26] | ✗ | Magnitude | 3.00 | 3.50 | 96.61 | 18.05 |
FS-CANet [48] | ✔ | Magnitude | 3.02 | 3.51 | 96.74 | 18.08 |
SEnet+Ran-net [49] | ✗ | Magnitude | 3.16 | 3.57 | - | - |
GaGNet [20] | ✔ | Complex | 3.17 | 3.56 | 97.13 | 18.91 |
UniInterNet (ours) | ✔ | STDCT | 3.26 | 3.59 | 97.16 | 18.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Zou, H.; Zhu, J. Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Appl. Sci. 2025, 15, 2919. https://doi.org/10.3390/app15062919
Zhang Y, Zou H, Zhu J. Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Applied Sciences. 2025; 15(6):2919. https://doi.org/10.3390/app15062919
Chicago/Turabian StyleZhang, Yuewei, Huanbin Zou, and Jie Zhu. 2025. "Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance" Applied Sciences 15, no. 6: 2919. https://doi.org/10.3390/app15062919
APA StyleZhang, Y., Zou, H., & Zhu, J. (2025). Speech Enhancement Based on Unidirectional Interactive Noise Modeling Assistance. Applied Sciences, 15(6), 2919. https://doi.org/10.3390/app15062919