Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

Shilin Wang¹,
Haixin Guan²,
Shuang Wei¹ &
…
Yanhua Long ORCID: orcid.org/0000-0003-0924-408X¹

276 Accesses
Explore all metrics

Abstract

DeepFilterNet2, a recently proposed real-time and low-complexity speech enhancement (SE) technique, has shown state-of-the-art SE performance in many deep noise suppression tasks. This paper proposes a new approach, termed pDeepFilterNet2, to generalize and improve the original DeepFilterNet2 for personalized speech enhancement (PSE) tasks under multi-talker noisy and reverberant conditions. Three improvements are investigated: a frame-wise speaker adaptation (FSA) is first proposed to achieve dynamic target speaker cues for generalizing the DeepFilterNet2 to pDeepFilterNet2; Then, we remove a redundant skip connection in original DeepFilterNet2 and add a simple layer of causal multi-head self-attention to enhance the model for aggregating global context information; Finally, a multi-domain loss function combing both time and frequency domain losses is introduced to further improve the PSE system performance. Moreover, different types of target speaker embedding are also investigated in this study to see their effectiveness. Our experiments are conducted on the DNS4 Challenge dataset and results show that the proposed pDeepFilterNet2 outperforms the original DeepFilterNet2 significantly across multiple evaluation metrics. Furthermore, it exhibits competitive performance when compared to our previously proposed sDPCCN, which was specifically designed for target speaker extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Article 09 February 2023

ATT:Adversarial Trained Transformer for Speech Enhancement

Channel and temporal-frequency attention UNet for monaural speech enhancement

Article Open access 14 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

the datasets used and/or analyzed during the current study are available from the first author on reasonable request.

Notes

References

Desplanques, B., Jenthe, T., & Kris, D. (2020). EACAP-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech-Proceedings, (pp. 3830–3834).
Dubey, H., Aazami, A., Gopal, V., Naderi, B., Braun, S., Cutler, R., Ju, A., Zohourian, M., Tang, M., Gamper, H., Golestaneh, M., & Aichner, R. (2023). ICASSP 2023 deep speech enhancement challenge, in arXiv preprint arXiv:2303.11510.
Dubey, H., Gopal, V., Cutler, R., Aazami, A., Matusevych, S., Braun, S., & Eskimez, S. E., Thakker, M., Yoshioka, T., Gamper, H., & Aichner, R. (2022). ICASSP 2022 deep noise suppression challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9271–9275).
Eskimez, S. E., Yoshioka, T., Wang, H., Wang, X., Chen, Z., & Huang, X. (2022). Personalized speech enhancement: New models and comprehensive evaluation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP, (pp. 356–360).
Ge, X., Han, J., Guan, H., & Long, Y. (2022). Dynamic acoustic compensation and adaptive focal training for personalized speech enhancement, in arXiv preprint arXiv:2211.12097.
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 776–780).
Giri, R., Isik, U., & Krishnaswamy, A. (2019). Attention wave-u-net for speech enhancement. In Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), (pp. 249–253).
Han, J., Long, Y., Burget, L., & Černockỳ, J. (2022). DPCCN: Densely-connected pyramid complex convolutional network for robust speech separation and extraction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7292–7296).
He, S., Li, H., & Zhang, X (2020). Speakerfilter: Deep learning-based target speaker extraction using anchor speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 376–380).
He, S., Li, H., & Zhang, X (2022). Speakerfilter-Pro: An improved target speaker extractor combines the time domain and frequency domain. In Processing ISCSLP, (pp. 473–477).
Hsu, Y., Lee, Y., & Bai, M.R. (2022). Multi-channel target speech enhancement based on ERB-scaled spatial coherence features. In International Congress on Acoustics (ICA).
Ju, Y., Rao, W., Yan, X., Fu, Y., Lv, S., Cheng, L., Wang, Y., Xie, L., & Shang, S. (2022). TEA-PSE: Tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9291–9295).
Ju, Y., Zhang, S., Rao, W., Wang, Y., Yu, T., Xie, L., & Shang, S. (2023). TEA-PSE 2.0: Sub-band network for real-time personalized speech enhancement. In IEEE Spoken Language Technology Workshop (SLT), (pp. 472–479).
Kim, J., El-Khamy, M., & Lee, J. (2020). T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6649–6653).
Koizumi, Y., Yatabe, K., & Delcroix, M., et. al. (2020). Speech enhancement using self-adaptation and multi-head self-attention. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 181–185).
Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J.R. (2019). SDR–half-baked or well done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 626–630).
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations (ICLR).
Mack, W., & Habets, E. A. (2019). Deep filtering: Signal extraction and reconstruction using complex time-frequency filters. IEEE Signal Processing Letters, 27, 61–65.
Article Google Scholar
Naderi, B., & Cutler, R. (2021). Subjective evaluation of noise suppression algorithms in crowdsourcing. In Interspeech-Proceedings, (pp. 2132–2136).
Nicolson, A., & Paliwal, K. (2020). Masked multi-head self-attention for causal speech enhancement. Speech Communication, 125, 80–96.
Article Google Scholar
Pandey, A., & Wang, D. (2021). Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1270–1279.
Article Google Scholar
Reddy, C. K., Gopal, V., et al. (2022). DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 886–890).
Schroter, H., Escalante-B, A. N., Rosenkranz, T., & Maier, A. (2022). DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7407–7411).
Schröter, H., Maier, A., Escalante-B, A., & Rosenkranz, T. (2022). DeepFilterNet2: Towards real-time speech enhancement on embedded devices for full-band audio. In International Workshop on Acoustic Signal Enhancement (IWAENC), (pp. 1–5).
Schröter, H., Rosenkranz, T., & Maier, A., et. al. (2023). DeepFilterNet: Perceptually motivated real-time speech enhancement. In Interspeech-Proceedings, (pp 2008–2009).
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 19, 2125–2136.
Article Google Scholar
Thiemann, J., Ito, N., & Vincent, E. (2013) The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics, (vol. 19).
Union, I. (2007). Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. In International Telecommunication Union, Recommendation P, (vol. 25).
Wang, H., & Wang, D. (2022). Cross-domain speech enhancement with a neural cascade architecture. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7862–7866).
Wang, Q., Muckenhirn, H., Wilson, K., Sridhar, P., Wu, Z., Hershey, J., Saurous, R. A., Weiss, R. J., Jia, Y., Moreno, I.L. (2019). Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Interspeech - Proceedings, (pp. 2728–2732).
Zhao, S., Ma, B., Watcharasupat, K. N., & Gan, W. (2022). FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP)=, (pp. 9281–9285).
Zhao, Y., Wang, D., Xu, B., & Zhang, T. (2020). Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM transactions on Audio, Speech, and Language Processing, 28, 1598–1607.
Article Google Scholar
Žmolíková, K., Delcroix, M., Kinoshita, K., Ochiai, T., Nakatani, T., Burget, L., & Černockỳ, J. (2019). Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13, 800–814.
Article Google Scholar

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant No.62071302).

Author information

Authors and Affiliations

Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University, Shanghai, 200234, China
Shilin Wang, Shuang Wei & Yanhua Long
Unisound AI Technology Co., Ltd, Beijing, China
Haixin Guan

Authors

Shilin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haixin Guan
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Long
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanhua Long.

Ethics declarations

Conflict of interest

the authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, S., Guan, H., Wei, S. et al. Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement. Int J Speech Technol 27, 299–306 (2024). https://doi.org/10.1007/s10772-024-10101-z

Download citation

Received: 16 December 2023
Accepted: 20 March 2024
Published: 23 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10772-024-10101-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

ATT:Adversarial Trained Transformer for Speech Enhancement

Channel and temporal-frequency attention UNet for monaural speech enhancement

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

ATT:Adversarial Trained Transformer for Speech Enhancement

Channel and temporal-frequency attention UNet for monaural speech enhancement

Explore related subjects

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation