[go: up one dir, main page]

Skip to main content
Log in

Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

DeepFilterNet2, a recently proposed real-time and low-complexity speech enhancement (SE) technique, has shown state-of-the-art SE performance in many deep noise suppression tasks. This paper proposes a new approach, termed pDeepFilterNet2, to generalize and improve the original DeepFilterNet2 for personalized speech enhancement (PSE) tasks under multi-talker noisy and reverberant conditions. Three improvements are investigated: a frame-wise speaker adaptation (FSA) is first proposed to achieve dynamic target speaker cues for generalizing the DeepFilterNet2 to pDeepFilterNet2; Then, we remove a redundant skip connection in original DeepFilterNet2 and add a simple layer of causal multi-head self-attention to enhance the model for aggregating global context information; Finally, a multi-domain loss function combing both time and frequency domain losses is introduced to further improve the PSE system performance. Moreover, different types of target speaker embedding are also investigated in this study to see their effectiveness. Our experiments are conducted on the DNS4 Challenge dataset and results show that the proposed pDeepFilterNet2 outperforms the original DeepFilterNet2 significantly across multiple evaluation metrics. Furthermore, it exhibits competitive performance when compared to our previously proposed sDPCCN, which was specifically designed for target speaker extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

the datasets used and/or analyzed during the current study are available from the first author on reasonable request.

Notes

  1. https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb.

  2. https://github.com/jungjee/RawNet.

  3. https://freesound.org/.

References

  • Desplanques, B., Jenthe, T., & Kris, D. (2020). EACAP-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech-Proceedings, (pp. 3830–3834).

  • Dubey, H., Aazami, A., Gopal, V., Naderi, B., Braun, S., Cutler, R., Ju, A., Zohourian, M., Tang, M., Gamper, H., Golestaneh, M., & Aichner, R. (2023). ICASSP 2023 deep speech enhancement challenge, in arXiv preprint arXiv:2303.11510.

  • Dubey, H., Gopal, V., Cutler, R., Aazami, A., Matusevych, S., Braun, S., & Eskimez, S. E., Thakker, M., Yoshioka, T., Gamper, H., & Aichner, R. (2022). ICASSP 2022 deep noise suppression challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9271–9275).

  • Eskimez, S. E., Yoshioka, T., Wang, H., Wang, X., Chen, Z., & Huang, X. (2022). Personalized speech enhancement: New models and comprehensive evaluation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP, (pp. 356–360).

  • Ge, X., Han, J., Guan, H., & Long, Y. (2022). Dynamic acoustic compensation and adaptive focal training for personalized speech enhancement, in arXiv preprint arXiv:2211.12097.

  • Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 776–780).

  • Giri, R., Isik, U., & Krishnaswamy, A. (2019). Attention wave-u-net for speech enhancement. In Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), (pp. 249–253).

  • Han, J., Long, Y., Burget, L., & Černockỳ, J. (2022). DPCCN: Densely-connected pyramid complex convolutional network for robust speech separation and extraction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7292–7296).

  • He, S., Li, H., & Zhang, X (2020). Speakerfilter: Deep learning-based target speaker extraction using anchor speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 376–380).

  • He, S., Li, H., & Zhang, X (2022). Speakerfilter-Pro: An improved target speaker extractor combines the time domain and frequency domain. In Processing ISCSLP, (pp. 473–477).

  • Hsu, Y., Lee, Y., & Bai, M.R. (2022). Multi-channel target speech enhancement based on ERB-scaled spatial coherence features. In International Congress on Acoustics (ICA).

  • Ju, Y., Rao, W., Yan, X., Fu, Y., Lv, S., Cheng, L., Wang, Y., Xie, L., & Shang, S. (2022). TEA-PSE: Tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9291–9295).

  • Ju, Y., Zhang, S., Rao, W., Wang, Y., Yu, T., Xie, L., & Shang, S. (2023). TEA-PSE 2.0: Sub-band network for real-time personalized speech enhancement. In IEEE Spoken Language Technology Workshop (SLT), (pp. 472–479).

  • Kim, J., El-Khamy, M., & Lee, J. (2020). T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6649–6653).

  • Koizumi, Y., Yatabe, K., & Delcroix, M., et. al. (2020). Speech enhancement using self-adaptation and multi-head self-attention. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 181–185).

  • Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J.R. (2019). SDR–half-baked or well done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 626–630).

  • Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations (ICLR).

  • Mack, W., & Habets, E. A. (2019). Deep filtering: Signal extraction and reconstruction using complex time-frequency filters. IEEE Signal Processing Letters, 27, 61–65.

    Article  Google Scholar 

  • Naderi, B., & Cutler, R. (2021). Subjective evaluation of noise suppression algorithms in crowdsourcing. In Interspeech-Proceedings, (pp. 2132–2136).

  • Nicolson, A., & Paliwal, K. (2020). Masked multi-head self-attention for causal speech enhancement. Speech Communication, 125, 80–96.

    Article  Google Scholar 

  • Pandey, A., & Wang, D. (2021). Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1270–1279.

    Article  Google Scholar 

  • Reddy, C. K., Gopal, V., et al. (2022). DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 886–890).

  • Schroter, H., Escalante-B, A. N., Rosenkranz, T., & Maier, A. (2022). DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7407–7411).

  • Schröter, H., Maier, A., Escalante-B, A., & Rosenkranz, T. (2022). DeepFilterNet2: Towards real-time speech enhancement on embedded devices for full-band audio. In International Workshop on Acoustic Signal Enhancement (IWAENC), (pp. 1–5).

  • Schröter, H., Rosenkranz, T., & Maier, A., et. al. (2023). DeepFilterNet: Perceptually motivated real-time speech enhancement. In Interspeech-Proceedings, (pp 2008–2009).

  • Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 19, 2125–2136.

    Article  Google Scholar 

  • Thiemann, J., Ito, N., & Vincent, E. (2013) The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics, (vol. 19).

  • Union, I. (2007). Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. In International Telecommunication Union, Recommendation P, (vol. 25).

  • Wang, H., & Wang, D. (2022). Cross-domain speech enhancement with a neural cascade architecture. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7862–7866).

  • Wang, Q., Muckenhirn, H., Wilson, K., Sridhar, P., Wu, Z., Hershey, J., Saurous, R. A., Weiss, R. J., Jia, Y., Moreno, I.L. (2019). Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Interspeech - Proceedings, (pp. 2728–2732).

  • Zhao, S., Ma, B., Watcharasupat, K. N., & Gan, W. (2022). FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP)=, (pp. 9281–9285).

  • Zhao, Y., Wang, D., Xu, B., & Zhang, T. (2020). Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM transactions on Audio, Speech, and Language Processing, 28, 1598–1607.

    Article  Google Scholar 

  • Žmolíková, K., Delcroix, M., Kinoshita, K., Ochiai, T., Nakatani, T., Burget, L., & Černockỳ, J. (2019). Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13, 800–814.

    Article  Google Scholar 

Download references

Acknowledgements

The work is supported by the National Natural Science Foundation of China (Grant No.62071302).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhua Long.

Ethics declarations

Conflict of interest

the authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Guan, H., Wei, S. et al. Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement. Int J Speech Technol 27, 299–306 (2024). https://doi.org/10.1007/s10772-024-10101-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-024-10101-z

Keywords

Navigation