[go: up one dir, main page]

Skip to main content
Log in

A method of multi-models fusion for speaker recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

As a new type of biometrics recognition technology, speaker recognition is gaining more and more attention because of the advantages in remote authentication. In this paper, we construct an end-to-end speaker recognition model named GAPCNN in which a convolutional neural network is used to extract speaker embeddings from spectrogram, and speaker recognition is performed by the cosine similarity of embeddings. In addition, we use global average pooling instead of the traditional temporal average pooling to adapt to different voice lengths. We use the ‘dev’ set of Voxceleb2 for training, then evaluate the model in the test set of Voxceleb1, and obtain an equal error rate (EER) of 4.04%. Furthermore, we fuse our GAPCNN with the x-vector model and the thin-Resnet model with GhostVLAD, and obtain an EER of 3.01% which is better than any of the three. It indicates that GAPCNN is an important complement to the x-vector model and the thin-Resnet model with GhostVLAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5297–5307).

  • Cai, W., Chen, J., & Li, M. (2018). Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop (pp. 74–81).

  • Chung, J. S.,  Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition. Proceedings of Interspeech, 2018, 1086–1090.

  • Dai, M., Dai, G., Wu, Y., Xia, Y., Shen, F., & Zhang, H. (2019). An improved feature fusion for speaker recognition. In 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC) (pp. 183–187). IEEE.

  • Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4690–4699).

  • Hajavi, A., & Etemad, A. (2019). A deep neural network for short-segment speaker recognition. Proceedings of Interspeech, 2878–2882, 2019.

    Google Scholar 

  • Hajibabaei, M., & Dai, D. (2018). Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv:1807.08312.

  • Ioffe, S. (2006). Probabilistic linear discriminant analysis. In European Conference on Computer Vision (pp. 531–542). Springer.

  • Nagrani, A., Chung, J. S., & Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. Proc. Interspeech 2017 (pp. 2616–2620).

  • Nian, S., Yi, Z., Haibo, L., & Chao, H. (2018). Short utterance speaker recognition algorithm based on multi-featured i-vector. Journal of Computer Applications, 38(10), 2839.

    Google Scholar 

  • Ramou, N., Djeddou, M., & Guerti, M. (2011). Two classifiers score fusion for text independent speaker verification. In 2011 11th International Conference on Intelligent Systems Design and Applications (pp. 937–940). IEEE.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815–823).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Computerence.

  • Snyder, D., Garcia-Romero, D., & Povey, D. (2015). Time delay deep neural network-based universal background models for speaker recognition. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 92–97). IEEE.

  • Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In Interspeech (pp. 999–1003).

  • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329–5333). IEEE.

  • Variani, E., Lei, X., & McDermott, E. (2014). Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4052–4056). IEEE

  • Wang, F., Cheng, J., Liu, W., & Liu, H. (2018). Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7), 926–930.

    Article  Google Scholar 

  • Wang, Z., Yao, K., Li, X., & Fang, S. (2020). Multi-resolution multi-head attention in deep speaker embedding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6464–6468). IEEE.

  • Xie, W., Nagrani, A., Chung, J. S., & Zisserman, A. (2019). Utterance-level aggregation for speaker recognition in the wild. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5791–5795). IEEE.

  • Zhong, Y., Arandjelović, R., & Zisserman, A. (2018). Ghostvlad for set-based face recognition. In Asian Conference on Computer Vision (pp. 35–50). Springer.

  • Zhu, Y., Ko, T., Snyder, D., Mak, B., & Povey, D. (2018). Self-attentive speaker embeddings for text-independent speaker verification. In Interspeech (pp. 3573–3577).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linkai Luo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Luo, L., Peng, H. et al. A method of multi-models fusion for speaker recognition. Int J Speech Technol 25, 493–498 (2022). https://doi.org/10.1007/s10772-022-09973-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-09973-w

Keywords

Navigation