[go: up one dir, main page]

Skip to main content
Log in

One-Shot Voice Conversion Based on Style Generative Adversarial Networks with ESR and DSNet

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a novel one-shot voice conversion (VC) method called DS-ESR-StyleGAN-VC, which encompasses several innovations to address the challenges faced by StarGAN-VC. Firstly, we adopt ESR network in the generator to extract deep features, effectively solving the problem of semantic content corruption in StarGAN-VC. Secondly, we leverage the advantages of the dense weighted normalized shortcut employed by DSNet, which circumvents the performance degradation and gradient disappearance caused by the increasing convolutional layers. The DSNet network is integrated into the middle of the encoder and decoder of generator to further extract the spliced features, further enhancing VC quality. Thirdly, we remove the classifier module in StarGAN-VC and use a style encoder to extract speaker style features in order to improve speaker similarity. Moreover, our proposed method can naturally support one-shot VC. Experiments show that our proposed method consistently outperforms the competitive StarGAN-VC in terms of semantic content completeness, naturalness and speaker similarity under the many-to-many setting; our proposed method is superior to the competitive StarGAN-ZSVC in terms of naturalness under the one-shot setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Both two datasets used in research study are openly accessible. The VCC2020 dataset can be downloaded at https://github.com/nii-yamagishilab/VCC2020-database, and the VCC2018 dataset can be downloaded https://datashare.ed.ac.uk/handle/10283/3061.

References

  1. M. Baas, H. Kamper, Stargan-zsvc: Towards zero-shot voice conversion in low-resource contexts. In: Southern African conference for artificial intelligence research, pp 69–84 (2020)

  2. M. Chen, Y. Shi, T. Hain, Towards low-resource stargan voice conversion using weight adaptive instance normalization. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5949–5953 (2021)

  3. C. Chun, Y.H. Lee, G.W. Lee, et al., (2023) Non-parallel voice conversion using cycle-consistent adversarial networks with self-supervised representations. In: 2023 IEEE 20th consumer communications & networking conference (CCNC), pp 931–932

  4. S. Ghosh, Y. Sinha, I. Siegert, et al., (2023) Improving voice conversion for dissimilar speakers using perceptual losses. Deutsche Gesellschaft für Akustik eV pp 1358–1361

  5. K. He, X. Zhang, S. Ren, et al., Deep residual learning for image recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 (2016)

  6. C.C. Hsu, H.T. Hwang, Y.C. Wu, et al., Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)

  7. G. Huang, Z. Liu, L. Van Der Maaten et al., Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708 (2017)

  8. X. Huang, S. Belongie, Arbitrary style transfer in real-time with adaptive instance normalization. in Proceedings of the IEEE International Conference on Computer Vision, pp 1501–1510 (2017)

  9. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning, pp 448–456 (2015)

  10. H. Kameoka, T. Kaneko, K. Tanaka, et al., Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. in: 2018 IEEE spoken language technology workshop (SLT), pp 266–273 (2018)

  11. H. Kameoka, T. Kaneko, K. Tanaka et al., Nonparallel voice conversion with augmented classifier star generative adversarial networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 2982–2995 (2020)

    Article  Google Scholar 

  12. T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)

  13. T. Kaneko, H. Kameoka, K. Tanaka, et al., Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion. arXiv preprint arXiv:1907.12279 (2019)

  14. T. Karras, S. Laine, M. Aittala, et al., Analyzing and improving the image quality of stylegan. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8110–8119 (2020)

  15. D. P. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  16. Y. Li, D. Xu, Y. Zhang, et al., Non-parallel many-to-many voice conversion with psr-stargan. In: Interspeech, pp 781–785 (2020)

  17. Y. Li, Z. He, Y. Zhang et al., High-quality many-to-many voice conversion using transitive star generative adversarial networks with adaptive instance normalization. J. Circuits Syst. Comput. 30(10), 2150188 (2021)

    Article  Google Scholar 

  18. Y. Li, X. Qiu, P. Cao et al., Non-parallel voice conversion based on perceptual star generative adversarial network. Circuits Syst. Signal Process. 41(8), 4632–4648 (2022)

    Article  Google Scholar 

  19. B. Lim, S. Son, H. Kim, et al., (2017) Enhanced deep residual networks for single image super-resolution. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, pp 136–144

  20. K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin. in Fourth international conference on fuzzy systems and knowledge discovery (FSKD), pp 410–414 (2007)

  21. J. Lorenzo-Trueba, J. Yamagishi, T. Toda, et al., The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804.04262 (2018)

  22. H.T. Luong, J. Yamagishi, Nautilus: a versatile voice cloning system. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 2967–2981 (2020)

    Article  Google Scholar 

  23. M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  24. X. Qiu, Y. Luo, Research on synthesis of designated speaker speech based on stargan-vc model. in International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), pp 256–263 (2022)

  25. S. Sakamoto, A. Taniguchi, T. Taniguchi, et al., Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition. arXiv preprint arXiv:2108.04395 (2021)

  26. S. Si, J. Wang, X. Zhang, et al., Boosting stargans for voice conversion with contrastive discriminator. in International Conference on Neural Information Processing, pp 355–366 (2022)

  27. D. Wang, J. Yu, X. Wu, et al., End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7744–7748 (2020a)

  28. R. Wang, Y. Ding, L. Li, et al. One-shot voice conversion using star-gan. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7729–7733 (2020b)

  29. Y. Wang, D. Stanton, Y. Zhang, et al., Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, in International Conference on Machine Learning, pp 5180–5189 (2018)

  30. M. Zhang, X. Wang, F. Fang, et al., Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet. arXiv preprint arXiv:1903.12389 (2019)

  31. S. Zhao, T.H. Nguyen, H. Wang, et al., Fast learning for non-parallel many-to-many voice conversion with residual star generative adversarial networks. In: Interspeech, pp 689–693 (2019)

  32. Y. Zhao, W.C. Huang, X. Tian, et al., Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527 (2020)

  33. W.Z. Zheng, J.Y. Han, C.Y. Chen et al., Improving the efficiency of dysarthria voice conversion system based on data augmentation. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 4613–4623 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This paper is supported by the National Science and Technology Innovation 2030 - “New Generation of Artificial Intelligence” major project “Cognitive Computing Basic Theory and Method Research” (2020AAA0106200), the National Natural Science Foundation of China 61936005, 62001038, the grant of Gusu Leading Talents (ZXL2022472) Young Talent Innovation Project and Natural Science Foundation of Nanjing University of Posts and Telecommunications (NY223115).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanping Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Pan, L., Qiu, X. et al. One-Shot Voice Conversion Based on Style Generative Adversarial Networks with ESR and DSNet. Circuits Syst Signal Process 43, 4565–4587 (2024). https://doi.org/10.1007/s00034-024-02675-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-024-02675-5

Keywords

Navigation