Abstract
The advancement of deep learning technology in speech generation has made fake speech almost perceptually indistinguishable from real speech. Most of the attempts in literature are dataset dependent and fail to detect fake speech in domain variability or cross-dataset scenarios. This study explores the potential of excitation source features to detect fake speech in domain variable conditions. Motivated by the distinction observed using excitation source information, this work proposes a new feature called residual modulation spectrogram for fake speech detection. ResNet-34 is used for the binary classification task to distinguish between fake and real speech. The modulation spectrogram is used as the baseline feature. The proposed approach performs well in domain variability in most cases and shows generalizability across different datasets. Additionally, the score-level combination of residual modulation spectrogram and modulation spectrogram shows enhanced performance. This justifies the efficacy of source and system information in domain variability scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402
Alzantot, M., Wang, Z., Srivastava, M.B.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)
Ballesteros, D.M., Rodriguez-Ortega, Y., Renza, D., Arce, G.: Deep4snet: deep learning for fake speech classification. Expert Syst. Appl. 184, 115465 (2021)
Black, A.W.: CMU wilderness multilingual speech dataset. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5971–5975. IEEE (2019)
Cassani, R., Albuquerque, I., Monteiro, J., Falk, T.H.: AMA: an open-source amplitude modulation analysis toolkit for signal processing applications. In: 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1–4 (2019). https://doi.org/10.1109/GlobalSIP45357.2019.8969210
Chintha, A., et al.: Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Selected Topics Signal Process. 14(5), 1024–1037 (2020)
Fang, X., et al.: Semi-supervised end-to-end fake speech detection method based on time-domain waveforms. J. Comput. Appl. 43(1), 227 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)
Ito, K., Johnson, L.: The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/ (2017)
Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., Kazi, F.: A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 47(3), 3447–3458 (2021). https://doi.org/10.1007/s13369-021-06297-w
Kumar, D., Patil, P.K.V., Agarwal, A., Prasanna, S.R.M.: Fake speech detection using OpenSMILE features. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds.) Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings, pp. 404–415. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_35
Lei, Z., Yang, Y., Liu, C., Ye, J.: Siamese convolutional neural network using gaussian probability feature for spoofing speech detection. In: INTERSPEECH, pp. 1116–1120 (2020)
Magazine, R., Agarwal, A., Hedge, A., Prasanna, S.M.: Fake speech detection using modulation spectrogram. In: Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings. pp. 451–463. Springer (2022). https://doi.org/10.1007/978-3-031-20980-2_39
Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Mishra, J., Pati, D., Prasanna, S.M.: Modelling glottal flow derivative signal for detection of replay speech samples. In: 2019 National Conference on Communications (NCC), pp. 1–5. IEEE (2019)
Mishra, J., Singh, M., Pati, D.: LP residual features to counter replay attacks. In: 2018 International Conference on Signals and Systems (ICSigSys), pp. 261–266. IEEE (2018)
Mishra, J., Singh, M., Pati, D.: Processing linear prediction residual signal to counter replay attacks. In: 2018 International Conference on Signal Processing and Communications (SPCOM), pp. 95–99. IEEE (2018)
Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J.: A review of deep learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019)
Prasanna, S.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Reimao, R., Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)
Siddhartha, S., Mishra, J., Prasanna, S.M.: Language specific information from LP residual signal using linear sub band filters. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)
Todisco, M., et al.: Asvspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Wang, C., et al.: Fully automated end-to-end fake audio detection. In: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, pp. 27–33 (2022)
Wijethunga, R., Matheesha, D., Al Noman, A., De Silva, K., Tissera, M., Rupasinghe, L.: Deepfake audio detection: a deep learning based solution for group conversations. In: 2020 2nd International Conference on Advancements in Computing (ICAC). vol. 1, pp. 192–197. IEEE (2020)
Wu, Z., et al.: ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: INTERSPEECH, pp. 2037–2041 (2015)
Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)
Yi, J., et al.: Add 2022: the first audio deep synthesis detection challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9216–9220. IEEE (2022)
Zen, H., et al.: Libritts: a corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Acknowledgement
The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Government of India for funding this research work and “Anantganak”, the high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
All views and data related to information technology, and anything deemed to be “cyber security” are made on behalf of the authors of this paper and not on behalf of McAfee.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sadashiv T. N., R., Kumar, D., Agarwal, A., Tzudir, M., Mishra, J., Prasanna, S.R.M. (2023). Source and System-Based Modulation Approach for Fake Speech Detection. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-48309-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)