Abstract
DeepFilterNet2, a recently proposed real-time and low-complexity speech enhancement (SE) technique, has shown state-of-the-art SE performance in many deep noise suppression tasks. This paper proposes a new approach, termed pDeepFilterNet2, to generalize and improve the original DeepFilterNet2 for personalized speech enhancement (PSE) tasks under multi-talker noisy and reverberant conditions. Three improvements are investigated: a frame-wise speaker adaptation (FSA) is first proposed to achieve dynamic target speaker cues for generalizing the DeepFilterNet2 to pDeepFilterNet2; Then, we remove a redundant skip connection in original DeepFilterNet2 and add a simple layer of causal multi-head self-attention to enhance the model for aggregating global context information; Finally, a multi-domain loss function combing both time and frequency domain losses is introduced to further improve the PSE system performance. Moreover, different types of target speaker embedding are also investigated in this study to see their effectiveness. Our experiments are conducted on the DNS4 Challenge dataset and results show that the proposed pDeepFilterNet2 outperforms the original DeepFilterNet2 significantly across multiple evaluation metrics. Furthermore, it exhibits competitive performance when compared to our previously proposed sDPCCN, which was specifically designed for target speaker extraction.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
the datasets used and/or analyzed during the current study are available from the first author on reasonable request.
References
Desplanques, B., Jenthe, T., & Kris, D. (2020). EACAP-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech-Proceedings, (pp. 3830–3834).
Dubey, H., Aazami, A., Gopal, V., Naderi, B., Braun, S., Cutler, R., Ju, A., Zohourian, M., Tang, M., Gamper, H., Golestaneh, M., & Aichner, R. (2023). ICASSP 2023 deep speech enhancement challenge, in arXiv preprint arXiv:2303.11510.
Dubey, H., Gopal, V., Cutler, R., Aazami, A., Matusevych, S., Braun, S., & Eskimez, S. E., Thakker, M., Yoshioka, T., Gamper, H., & Aichner, R. (2022). ICASSP 2022 deep noise suppression challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9271–9275).
Eskimez, S. E., Yoshioka, T., Wang, H., Wang, X., Chen, Z., & Huang, X. (2022). Personalized speech enhancement: New models and comprehensive evaluation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP, (pp. 356–360).
Ge, X., Han, J., Guan, H., & Long, Y. (2022). Dynamic acoustic compensation and adaptive focal training for personalized speech enhancement, in arXiv preprint arXiv:2211.12097.
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 776–780).
Giri, R., Isik, U., & Krishnaswamy, A. (2019). Attention wave-u-net for speech enhancement. In Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), (pp. 249–253).
Han, J., Long, Y., Burget, L., & Černockỳ, J. (2022). DPCCN: Densely-connected pyramid complex convolutional network for robust speech separation and extraction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7292–7296).
He, S., Li, H., & Zhang, X (2020). Speakerfilter: Deep learning-based target speaker extraction using anchor speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 376–380).
He, S., Li, H., & Zhang, X (2022). Speakerfilter-Pro: An improved target speaker extractor combines the time domain and frequency domain. In Processing ISCSLP, (pp. 473–477).
Hsu, Y., Lee, Y., & Bai, M.R. (2022). Multi-channel target speech enhancement based on ERB-scaled spatial coherence features. In International Congress on Acoustics (ICA).
Ju, Y., Rao, W., Yan, X., Fu, Y., Lv, S., Cheng, L., Wang, Y., Xie, L., & Shang, S. (2022). TEA-PSE: Tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 9291–9295).
Ju, Y., Zhang, S., Rao, W., Wang, Y., Yu, T., Xie, L., & Shang, S. (2023). TEA-PSE 2.0: Sub-band network for real-time personalized speech enhancement. In IEEE Spoken Language Technology Workshop (SLT), (pp. 472–479).
Kim, J., El-Khamy, M., & Lee, J. (2020). T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6649–6653).
Koizumi, Y., Yatabe, K., & Delcroix, M., et. al. (2020). Speech enhancement using self-adaptation and multi-head self-attention. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 181–185).
Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J.R. (2019). SDR–half-baked or well done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 626–630).
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations (ICLR).
Mack, W., & Habets, E. A. (2019). Deep filtering: Signal extraction and reconstruction using complex time-frequency filters. IEEE Signal Processing Letters, 27, 61–65.
Naderi, B., & Cutler, R. (2021). Subjective evaluation of noise suppression algorithms in crowdsourcing. In Interspeech-Proceedings, (pp. 2132–2136).
Nicolson, A., & Paliwal, K. (2020). Masked multi-head self-attention for causal speech enhancement. Speech Communication, 125, 80–96.
Pandey, A., & Wang, D. (2021). Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1270–1279.
Reddy, C. K., Gopal, V., et al. (2022). DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 886–890).
Schroter, H., Escalante-B, A. N., Rosenkranz, T., & Maier, A. (2022). DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7407–7411).
Schröter, H., Maier, A., Escalante-B, A., & Rosenkranz, T. (2022). DeepFilterNet2: Towards real-time speech enhancement on embedded devices for full-band audio. In International Workshop on Acoustic Signal Enhancement (IWAENC), (pp. 1–5).
Schröter, H., Rosenkranz, T., & Maier, A., et. al. (2023). DeepFilterNet: Perceptually motivated real-time speech enhancement. In Interspeech-Proceedings, (pp 2008–2009).
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 19, 2125–2136.
Thiemann, J., Ito, N., & Vincent, E. (2013) The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics, (vol. 19).
Union, I. (2007). Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. In International Telecommunication Union, Recommendation P, (vol. 25).
Wang, H., & Wang, D. (2022). Cross-domain speech enhancement with a neural cascade architecture. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 7862–7866).
Wang, Q., Muckenhirn, H., Wilson, K., Sridhar, P., Wu, Z., Hershey, J., Saurous, R. A., Weiss, R. J., Jia, Y., Moreno, I.L. (2019). Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Interspeech - Proceedings, (pp. 2728–2732).
Zhao, S., Ma, B., Watcharasupat, K. N., & Gan, W. (2022). FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing ICA SSP)=, (pp. 9281–9285).
Zhao, Y., Wang, D., Xu, B., & Zhang, T. (2020). Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM transactions on Audio, Speech, and Language Processing, 28, 1598–1607.
Žmolíková, K., Delcroix, M., Kinoshita, K., Ochiai, T., Nakatani, T., Burget, L., & Černockỳ, J. (2019). Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13, 800–814.
Acknowledgements
The work is supported by the National Natural Science Foundation of China (Grant No.62071302).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
the authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Guan, H., Wei, S. et al. Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement. Int J Speech Technol 27, 299–306 (2024). https://doi.org/10.1007/s10772-024-10101-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10101-z