Defending against FakeBob Adversarial Attacks in Speaker Verification Systems with Noise-Adding †
<p>A score-based speaker verification system.</p> "> Figure 2
<p>The process of FakeBob adversarial attacks.</p> "> Figure 3
<p>An adversarial audio in the time domain and its perturbations in both the time domain and the mel spectrogram. (<b>a</b>) Adversarial audio in time domain; (<b>b</b>) perturbations in time domain; (<b>c</b>) perturbations in mel spectrogram.</p> "> Figure 4
<p>The proposed defense system.</p> "> Figure 5
<p>The mel spectrograms of an adversarial audio, a denoised adversarial audio, and a noise-added adversarial audio. (<b>a</b>) Adversarial audio; (<b>b</b>) denoised adversarial audio; (<b>c</b>) noise-added adversarial audio.</p> "> Figure 6
<p>Performance of the strategy of bypassing the plugin function by adaptive FakeBob attacks with different <math display="inline"><semantics> <mi>ϵ</mi> </semantics></math> in GMM SV systems.</p> "> Figure 7
<p>Performance of the adaptive FakeBob attacks with <math display="inline"><semantics> <mrow> <mi>ϵ</mi> <mo>=</mo> <mn>0.002</mn> </mrow> </semantics></math> in GMM SV systems with noise-adding.</p> "> Figure 8
<p>Performance of the noise-adding defense with different types of noise on the normal operations of GMM SV systems.</p> "> Figure 9
<p>Performance of the noise-adding defense with different types of noise on FakeBob adversarial audios in GMM SV systems.</p> ">
Abstract
:1. Introduction
- Using the time domain waveform and the Mel spectrogram, we find that the perturbations in an adversarial audio are very small and are similar to white noise. On the other hand, these perturbations are not random, but are intentionally designed to fool speaker verification systems.
- We propose a defense framework that is simple, light-weight, and effective against adversarial attacks in speaker verification systems.
- We find that the denoising function is able to reduce or remove the perturbations. As shown in our experiments based on FakeBob [7] against GMM and i-vector speaker verification systems, denoising can reduce the targeted attack success rate from 100% to 56.05% and from 95% to 38.63%, respectively. A downside of denoising is the added nonnegligible processing time in the GMM system.
- We discover that the noise-adding method performs much better than the denoising function. For example, we show that noise-adding can further reduce the targeted attack success rate of FakeBob to 5.20% in the GMM system and to 0.50% in the i-vector system. Moreover, the speed for FakeBob to generate an adversarial audio is slowed down 25 times in GMM and 5 times in i-vector under the impact of this defense. On the other hand, the processing time of the noise-adding function is very small and can be negligible. Furthermore, noise-adding only slightly increases the equal error rate of a speaker verification system. Therefore, we believe that such a simple solution can be applied to any speaker verification system against adversarial attacks.
- Inspired by adaptive attacks in [15], we study two intuitive strategies that can be used by adaptive FakeBob attacks against the noise-adding defense. One strategy is to bypass the plugin function in the speaker verification system to generate an adversarial audio. The other strategy is to update the objective function of the attack to make the adversarial audio more robust against noise. We show through experiments in a GMM speaker verification system that although these countermeasures reduce the effectiveness of the defense, noise-adding can still provide a considerable level of protection against adaptive FakeBob attacks.
- Through experiments, we find that noise with different probability distribution functions, such as Gaussian [16], uniform [17], logistic [18], Laplace [19], and a variation of the Rademacher distribution [20], has a similar effect on the normal operations of speaker verification systems and the FakeBob adversarial audios.
2. Related Work
3. Background
3.1. Speaker Verification Systems
3.2. Adversarial Attacks against Speaker Verification Systems
Algorithm 1 FakeBob Attacks |
1: Input: an audio signal array, threshold of the SV system |
2: Output: an adversarial audio |
3: Require: threshold of targeted SV system , audio signal array A, maximum iteration m, score function S, gradient decent function , clip function , learning rate , and sign function |
4: |
5: |
6: for ; ; do |
7: |
8: if then |
9: retun |
10: end if |
11: |
12: end for |
4. Proposed Defense System
4.1. Design Goals of a Defense System
- Simplicity. The defense system is easy to implement and can be compatible with an existing SV system. That is, it does not require any change to the internal structure of the currently used SV system.
- Light weight. It does not significantly increase the computation load of the SV system. The defense method only slightly increases the processing time for an input audio.
- Effectiveness. The defense algorithm should be able to greatly increase the time for an attacker to generate a successful adversarial audio and significantly reduce the ASR of adversarial attacks such as FakeBob. On the other hand, the defense method should minimally affect the normal operations of an SV system, such as slightly increasing the EER of the SV system.
4.2. A Defense System
4.3. Denoising
Algorithm 2 Denoising Function |
1: Input: an audio clip , noise variance |
2: Output: a denoised audio clip |
3: Require: normal distribution generator N, short-time Fourier transform , inverse short-time Fourier transform |
4: |
5: noise clip , size = length of |
6: for each signal noise[i] in noise clip do |
7: noise[i] |
8: end for |
9: (noise clip) |
10: |
11: Remove noise from based on and |
12: |
4.4. Noise-Adding
Algorithm 3 Noise-Adding Function |
1: Input: an audio clip , noise variance |
2: Output: a noise-added audio clip |
3: Require: normal distribution generator N |
4: |
5: noise clip , size = length of |
6: for each signal noise[i] in noise clip do |
7 noise[i] |
8: end for |
9: noise clip |
5. Adaptive FakeBob Attacks
5.1. Bypassing the Plugin Function
5.2. Updating the Objective Function
6. Performance Evaluation
6.1. Experimental Setup
6.2. Performance Evaluation of the Proposed Defense System for the Normal Operations of SV Systems
6.3. Performance Evaluation of the Proposed Defense System against FakeBob Attacks
6.4. Performance Evaluation of Adaptive FakeBob Attacks against the Noise-Adding Defense
6.4.1. Bypassing the Plugin Function
6.4.2. Updating the Objective Function
6.5. Effect of Different Noise Types for Noise-Adding Defense
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chang, L.; Chen, Z.; Chen, C.; Wang, G.; Bi, Z. Defending against adversarial attacks in speaker verification systems. In Proceedings of the IEEE International Performance Computing and Communications Conference (IPCCC), Austin, TX, USA, 29–31 October 2021. [Google Scholar]
- Babangida, L.; Perumal, T.; Mustapha, N.; Yaakob, R. Internet of things (IoT) based activity recognition strategies in smart homes: A review. IEEE Sens. J. 2022, 22, 8327–8336. [Google Scholar] [CrossRef]
- Das, R.K.; Tian, X.; Kinnunen, T.; Li, H. The attacker’s perspective on automatic speaker verification: An overview. arXiv 2020, arXiv:2004.08849. [Google Scholar]
- Wang, S.; Cao, J.; He, X.; Sun, K.; Li, Q. When the differences in frequency domain are compensated: Understanding and defeating modulated replay attacks on automatic speech recognition. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS), Virtual Event, 9–13 November 2020. [Google Scholar]
- Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015, 66, 130–153. [Google Scholar] [CrossRef]
- Wu, Z.; Li, H. Voice conversion versus speaker verification: An overview. In APSIPA Transactions on Signal and Information Processing; Cambridge University Press: Cambridge, UK, 2014; Volume 3. [Google Scholar]
- Chen, G.; Chen, S.; Fan, L.; Du, X.; Zhao, Z.; Song, F.; Liu, Y. Who is real Bob? Adversarial attacks on speaker recognition systems. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 24–27 May 2021. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Yuan, X.; He, P.; Zhu, Q.; Li, X. Adversarial examples: Attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2805–2824. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; Li, B.; Chen, P.; Song, D. Characterizing audio adversarial examples using temporal dependency. In Proceedings of the International Conference on Learning Representations, New Orleans, LO, USA, 6–9 May 2019. [Google Scholar]
- Muda, L.; Begam, M.; Elamvazuthi, I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. J. Comput. 2010, 2, 138–143. [Google Scholar]
- Sainburg, T. timsainb/noisereduce: V1.0. Zenodo. 2019. Available online: https://github.com/timsainb/noisereduce (accessed on 12 July 2022).
- Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Tramer, F.; Carlini, N.; Brendel, W.; Madry, A. On adaptive attacks to adversarial example defenses. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
- Normal Distribution, Wikipedia. Available online: https://en.wikipedia.org/wiki/Normal_distribution (accessed on 12 July 2022).
- Continuous Uniform Distribution, Wikipedia. Available online: https://en.wikipedia.org/wiki/Continuous_uniform_distribution (accessed on 12 July 2022).
- Logistic Distribution, Wikipedia. Available online: https://en.wikipedia.org/wiki/Logistic_distribution (accessed on 12 July 2022).
- Laplace Distribution, Wikipeida. Available online: https://en.wikipedia.org/wiki/Laplace_distribution (accessed on 12 July 2022).
- Rademacher Distribution, Wikipedia. Available online: https://en.wikipedia.org/wiki/Rademacher_distribution (accessed on 12 July 2022).
- Abdullah, H.; Warren, K.; Bindschaedler, V.; Papernot, N.; Traynor, P. SoK: The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 24–27 May 2021. [Google Scholar]
- Jati, A.; Hsu, C.; Pal, M.; Peri, R.; AbdAlmageed, W.; Narayanan, S. Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 2021, 68, 101199. [Google Scholar] [CrossRef]
- Villalba, J.; Joshi, S.; Zelasko, P.; Dehak, N. Representation learning to classify and detect adversarial attacks against speaker and speech recognition systems. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021. [Google Scholar]
- Li, X.; Li, N.; Zhong, J.; Wu, X.; Liu, X.; Su, D.; Yu, D.; Meng, H. Investigating robustness of adversarial samples detection for automatic speaker verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Wu, H.; Li, X.; Liu, A.T.; Wu, Z.; Meng, H.; Lee, H. Adversarial defense for automatic speaker verification by cascaded self-supervised learning models. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
- Joshi, S.; Villalba, J.; Zelasko, P.; Moro-Velazquez, L.; Dehak, N. Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4811–4826. [Google Scholar] [CrossRef]
- Wu, H.; Zhang, Y.; Wu, Z.; Wang, D.; Lee, H. Voting for the right answer: Adversarial defense for speaker verification. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021. [Google Scholar]
- Byun, J.; Go, H.; Kim, C. Small input noise is enough to defend against query-based black-box attacks. arXiv 2021, arXiv:2101.04829. [Google Scholar]
- Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; McCree, A.; Povey, D.; Khudanpur, S. Speaker recognition for multi-speaker conversations using X-vectors. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
- Cheng, J.M.; Wang, H.C. A method of estimating the equal error rate for automatic speaker verification. In Proceedings of the IEEE International Symposium on Chinese Spoken Language Processing, Hong Kong, China, 15–18 December 2004. [Google Scholar]
- Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Srndic, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion attacks against machine learning at test time. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Prague, Czech Republic, 23–27 September 2013; Volume 8190, pp. 387–402. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2015, arXiv:1412.6572. [Google Scholar]
- Li, J.; Liu, Y.; Chen, T.; Xiao, Z.; Li, Z.; Wang, J. Adversarial attacks and defenses on cyber–physical systems: A survey. IEEE Internet Things J. 2020, 7, 5103–5115. [Google Scholar] [CrossRef]
- Rahman, A.; Hossain, M.S.; Alrajeh, N.A.; Alsolami, F. Adversarial examples—Security threats to COVID-19 deep learning systems in medical IoT devices. IEEE Internet Things J. 2021, 8, 9603–9610. [Google Scholar] [CrossRef]
- Kong, X.; Ge, Z. Adversarial attacks on neural-network-based soft sensors: Directly attack output. IEEE Trans. Ind. Inform. 2022, 18, 2443–2451. [Google Scholar] [CrossRef]
- Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Ilyas, A.; Engstrom, L.; Athalye, A.; Lin, J. Black-box adversarial attacks with limited queries and information. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- McClellan, J.H.; Schafer, R.W.; Yoder, M.A. DSP First, 2nd ed.; Pearson Education, Inc.: London, UK, 2016. [Google Scholar]
- Librosa, a Python Package for Music and Audio Analysis. Available online: https://librosa.org/ (accessed on 12 July 2022).
- Athalye, A.; Carlini, N.; Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 274–283. [Google Scholar]
- Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. arXiv 2018, arXiv:1707.07397. [Google Scholar]
- Google. Google Cloud Platform. Available online: https://cloud.google.com/ (accessed on 12 July 2022).
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb2: Deep speaker recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 27 June 2018. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015. [Google Scholar]
- defense_FakeBob. GitHub. Available online: https://github.com/zeshengchen/defense_FakeBob (accessed on 12 July 2022).
- Rustle Noise, Wikipedia. Available online: https://en.wikipedia.org/wiki/Rustle_noise (accessed on 12 July 2022).
Plugin | EER (%) | Time (s) | |
---|---|---|---|
Original | 0 | 1.05 | 18.44 |
Denoising | 0.001 | 1.61 | 30.67 |
Denoising | 0.002 | 2.95 | 30.41 |
Denoising | 0.005 | 3.36 | 30.79 |
Noise-Adding | 0.001 | 1.21 | 19.34 |
Noise-Adding | 0.002 | 1.92 | 19.78 |
Noise-Adding | 0.005 | 3.94 | 20.31 |
Plugin | EER (%) | Time (s) | |
---|---|---|---|
Original | 0 | 0 | 433.45 |
Denoising | 0.001 | 0.15 | 447.37 |
Denoising | 0.002 | 0.05 | 447.82 |
Denoising | 0.005 | 0.49 | 446.20 |
Noise-Adding | 0.001 | 0.44 | 435.35 |
Noise-Adding | 0.002 | 0.39 | 435.89 |
Noise-Adding | 0.005 | 1.14 | 435.51 |
Plugin | Avg Iter | Avg Time (s) | Avg ASR (%) | |
---|---|---|---|---|
Original | 0 | 23.00 | 158.68 | 100.00 |
Denoising | 0.001 | 18.90 | 192.02 | 77.20 |
Denoising | 0.002 | 22.85 | 235.96 | 56.05 |
Denoising | 0.005 | 22.30 | 235.78 | 51.00 |
Noise-Adding | 0.001 | 92.60 | 614.92 | 24.35 |
Noise-Adding | 0.002 | 604.95 | 3992.88 | 5.20 |
Noise-Adding | 0.005 | 694.95 | 4350.35 | 4.10 |
Plugin | Avg Iter | Avg Time (s) | Avg ASR (%) | |
---|---|---|---|---|
Original | 0 | 168.88 | 6080.47 | 95.00 |
Denoising | 0.001 | 97.40 | 3702.36 | 55.68 |
Denoising | 0.002 | 100.58 | 3825.02 | 38.63 |
Denoising | 0.005 | 344.53 | 13,130.24 | 17.73 |
Noise-Adding | 0.001 | 556.33 | 20,041.00 | 8.98 |
Noise-Adding | 0.002 | 918.23 | 33,017.30 | 0.50 |
Noise-Adding | 0.005 | 921.48 | 33,103.39 | 1.03 |
a | 0 | 10% | 20% | 30% |
---|---|---|---|---|
ASR (%) | 90.24 | 88.34 | 86.00 | 84.03 |
Total running time | 31 h 3 min | 35 h 32 min | 40 h 17 min | 44 h 47 min |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Z.; Chang, L.-C.; Chen, C.; Wang, G.; Bi, Z. Defending against FakeBob Adversarial Attacks in Speaker Verification Systems with Noise-Adding. Algorithms 2022, 15, 293. https://doi.org/10.3390/a15080293
Chen Z, Chang L-C, Chen C, Wang G, Bi Z. Defending against FakeBob Adversarial Attacks in Speaker Verification Systems with Noise-Adding. Algorithms. 2022; 15(8):293. https://doi.org/10.3390/a15080293
Chicago/Turabian StyleChen, Zesheng, Li-Chi Chang, Chao Chen, Guoping Wang, and Zhuming Bi. 2022. "Defending against FakeBob Adversarial Attacks in Speaker Verification Systems with Noise-Adding" Algorithms 15, no. 8: 293. https://doi.org/10.3390/a15080293