Open AccessProceeding Paper

Voice Profile Authentication Using Machine Learning^†

Ivelina Balabanova

Kristina Sidorova

and

Georgi Georgiev

Department of Communications Equipment and Technologies, Technical University of Gabrovo, 5300 Gabrovo, Bulgaria

Author to whom correspondence should be addressed.

^†

Presented at the International Conference on Electronics, Engineering Physics and Earth Science (EEPES’24), Kavala, Greece, 19–21 June 2024.

Eng. Proc. 2024, 70(1), 37; https://doi.org/10.3390/engproc2024070037

Published: 8 August 2024

(This article belongs to the Proceedings of International Conference on Electronics, Engineering Physics and Earth Science (EEPES 2024))

Download

Browse Figures

Figure 1
Variables in selection of Pseudo-Quadratic Discriminant classifier. "> Figure 2
Confusion matrices at Diagonal Linear (a) and Pseudo-Quadratic (b) classifiers. "> Figure 3
Synthesized Feed-Forward (a) and Probabilistic (b) models for voice profile identification. "> Figure 4
Matrices of correct (green color) and incorrect (red color) classifications for selected Feed-Forward (a) and Probabilistic (b) neural models for voice profile personalization. "> Figure 5
Error diagrams at application procedures of selected FFNN (a) and PNN (b) models. "> Figure 6
Variables in synthesis procedures of Decision Tree structures for voice authentication. "> Figure 7
Confusion matrices for Optimal (a) and Worst case (b) Decision Tree classification models. "> Figure 8
Examine the quality of Naïve Bayes voice profile classification models at Gaussian (a) and Kernel (b) input data distribution. "> Figure 9
Confusion matrices at voice profile identification models for NB classifiers with Gaussian (a) and Kernel (b) input data distribution. ">

Versions Notes

Abstract

In the paper, personalized results are presented in the methodology for monitoring information security based on voice authentication. Integration of sound preprocessing and Machine Learning techniques for feature extraction, training, and validation of classification models has been implemented. The objects of research are staked mixed-test voice profiles. Classifies were selected with quantitative evaluation under a threshold of 90.00% by Naive Bayes and Discriminant Analysis. Significantly improved accuracy to approximate levels of 96.0% was established at Decision Tree synthesis. Strongly satisfactory performance indices were reached at the diagnosis of voice profiles using Feed-Forward and Probabilistic Neural Networks, respectively, 98.00% and 100.00%.

Keywords:

sound analysis; voice authentication; discriminant analysis; neural networks; decision tree; naïve bayes

1. Introduction

Voice Recognition is a complex and complicated process that passes through the execution of a sequence of components: (1) Speech Preprocessing; (2) Feature Extraction; (3) Speech Classification; and (4) Recognition [1,2]. Feature extraction modules in voice diagnostics are expressed in the analysis of spectrograms and mel-spectrograms associated with the acquisition of Spectral Features. These are sets of Static or Dynamic Mel-Frequency Cepstral Coefficients (MFCCs) [3,4,5]. The incoming voice spectrograms are received by the Short-Time Fourier Transform (STFT) or the Discrete Cosine Transform (DCT). The generated feature datasets serve as input parametric units for different dimension analytics for generating classification models [6,7,8]. The scientific ones are based on voice classifiers with Machine Learning (ML) and Deep Learning (DL) as Support Vector Machine (SVM), Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional-LSTM Network (Bi-LSTM) [9,10,11,12]. The algorithm and neural efficiency are analyzed with the application of heterogeneous datasets under the following conceptual tasks: (1) Gender Classification (GC); (2) Speaker Verification (SV); (3) Language Identification (LID); (4) Regional Dialect Identification (DID); and (5) Channel Classification (CC) [13,14,15].

In relation to the purpose of the present study, experimentally statistical characteristics of sound level were extracted at manipulations with the voices of seven male and female individuals when pronouncing selected speech commands. The procedures included two types of sound measurements: No 1, all acoustic and audio measurements; and No 2, measurements for sound levels below 100 dB performed in fixed temporal intervals of 15 s for every registered vocal profile. The mentioned categories of sound parameters had been registered by a portable device with a sound analyzer application specified. In the process of preliminary selection of informative signs between formed input datasets with different combinations of sound parameters, tests were performed with artificial neuron networks to check the correctness of voice recognition. The best indicators were received at category No. 2, including LAE (A-weighted, sound exposure level), [dBA]; LAeq (A-weighted, equivalent sound level), [dBA]; LAF (A-weighted, fast time constant, sound level), [dBA]; LAS (A-weighted, slow time constant response, sound level), [dBA]; and LAI (A-weighted, frequency weighting and impulse time, sound level), [dBA].

The mentioned spectrum of sound features was selected after benchmarking performance and recognition accuracy assessments using initial variants of neural networks, subsequently used as a basic input unit for training and verification of models for voice profile identification. The paper synthesized a methodology for voice profile identification for security and information access personalization based on probabilistic analysis tools in an entirely MATLAB R2014a version. The following analytics tools were introduced: Discriminant Analysis; Feed-Forward Neural Networks; Probabilistic Neural Networks; CART Decision Trees; and Naive Bayes for classification. Through performance evaluation of analytical modules, final classification models with a high recognition index of voice samples in dataset information were selected.

2. Discriminant Analysis in Voice Profile Recognition

In an initial aspect of the studies, the possibility of adapting Linear Didcriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) classifiers and their subvariants was considered as follows:

Diagonal Linear Discriminant Analysis (DLDA);
Pseudo Linear Discriminant Analysis (PLDA);
Diagonal Quadratic Discriminant Analysis (DQDA);
Pseudo-Quadratic Discriminant Analysis (PQDA).

A set of criteria has been adopted to examine the quality of classification models created as follows: (1) Lost; (2) Accuracy; and (3) Misclassifications determined against two basic approaches: Resubstitution and k-fold cross-validation. The results of the indicated activities are shown in Table 1. The obtained quantitative estimates show a significant advantage of Linear compared to Quadratic models. According to Resubstitution, low accuracy thresholds of 55.36% with DLDA and 66.14% using LDA and PLDA were reported. The indicators in cross-validation procedures turned out to be similar, where 66.30%, 55.36%, and 66.36% were found for LDA, DLDA, and PLDA. Relatively better were the results achieved in Quadratic classifiers. Values for Resubstitution Lost in the order of 0.1321 and a minimum indication of 0.2907 were observed—equivalent to 86.79% in QDA and PQDA and 70.93% at DQDA. Regarding cross-validation lost, an accuracy of 85.86%, 70.71%, and 86.36% was obtained for the cases of QDA, DQDA, and PQDA.

According to the analysis of the results, a classification model based on Pseudo-Quadratic Discriminant Analysis was selected, with an approximately expected accuracy of 86.575% when operating with voice samples not involved in the training and test procedures. A specification of the classifier synthesis variables is given in Figure 1. Figure 2 visualizes the results when determining the belonging of the voice samples to the defined classification groups for the models with the highest and lowest established quality indicators, respectively, when using DLDA and PQDA. Correctly classified data were arranged diagonally from upper left to lower right, and Misclassifications were assigned to the remaining groups by matrix rows. The largest amount of incorrectly classified data was found in the sample for voice analysis object “Person No 4”. It should be noted that 1, 20, 61, and 38 samples were assigned to the information compositions of “Person No 3”, “Person No 5”, “Person No 6”, and “Person No 7”, respectively.

3. Feed-Forward and Probabilistic Neural Networks in Voice Identification

In the course of research related to the Artificial Intelligence concept, two categories of neural networks were introduced: Feed-Forward Neural Networks (FFNNs) and Probabilistic Neural Networks (PNNs). The neural validation performance score was prepared on the basis of Mean-Squared Error and Classification accuracy indicators. The variation in declared indicators under stepwise increment neurons in the hidden layer with hyperbolic tangent sigmoid activation for FFNNs and the spread indicator in the Radial Base Function (RBF) Layer with Kernel Transfer Function for PNNs was studied.

The results of the applied series of experiments are presented in Table 2. Regarding the FFNN, the MSE variations were limited to the interval 0.0113 to 0.0527, registered for networks with a content of 35 and 5 hidden neurons. The received equivalents of the established errors according to accuracy criteria were a maximum of 97.90% (Figure 3a at the FFNN model selected) and a minimum threshold of 73.20%, respectively.

For the Probabilistic Neural Network structures created with a fixed number of RBF neurons (Figure 3b), variations in the spread from 0.500 to 0.575 did not cause a change in the error, where a constant level of MSE = 4.0816 × 10⁻⁴ was observed. After increasing the criterion from 0.600, a smooth to faster exponential growth of the error begins until the highest value of 0.0051 is reached at the limit spread of 0.925. The analysis shows more adequate behavior of PNNs compared to FFNNs, as the lowest accuracy found does not fall below 98.20%. Similar judgments can be made in a quantitative analysis against the MSE indicator, where there is a threefold lower maximum error for PNN relative to the largest reported MSE value for the feed-forward structures.

The Confusion matrices and Error diagrams in Figure 4 and Figure 5 confirm the advantages of the synthesized FFNN with 35 hidden neurons and the PNN with a variety of spreads from 0.500 to 0.575. An increased minimization of Misclassifications was observed before Linear and Quadratic Discriminant classifiers were assigned by matrix rows to incorrect output groups. A variation range of “−0.7561 to 0.6393” was determined for the network errors in the final FFNN model. Within a relatively close range fall the established greater fluctuations of the errors for voice samples for “Person No 3” to “Person No 7”. A similar point was observed for the lower variations in Person No 1 and Person No 2, which were subject to voice profile identification. Significantly lower are the variations found in the selected PNN classification model, with the presence of sharply limited increases in the fourth and seventh output groups, which can be ignored.

4. Decision Tree Modeling for Voice Profile Recognition

In the next stage of the research, activities were carried out on the modeling of structures for multivariate selection of classification decisions using the CART algorithm. In accordance with the specifics of the model synthesis procedure for voice authentication, 24 classification models were generated, corresponding to a basic structure—Pruning level “0”—and structures with sequential removal of nodal branches—Pruning level “1” to level “23” (Table 3). The applied tests using Resubstitution and cross-validation techniques on the initially generated model with 49 nodes show high levels of accuracy, respectively, 98.50000% and 93.7857%. The minimization of the building nodes at Pruning Level “21” and Level “22” was tied to a significant decrease in the efficiency of Decision Tree (DT) models—a fact for which accuracy around 65.0000% and 57.0000% were observed. At a finite content of structural nodes Pruning level “23”, the classification accuracy dropped to only 14.28571 in Resubstitution and 14.2857% for cross-validation.

Following an analysis of the performance of the models, a selection of the Best Pruning Level “two”—DT model was made with a composition of 41 nodes, responsible for maintaining an optimal solution structure and acceptable classification accuracy. Based on the chosen classification architecture, an approximately expected accuracy when operating with new voice data content of 95.85425% was calculated. The specification of the variables in the overall processes of training, verification, and evaluation of the effectiveness of the models using the Decision Tree method is shown in Figure 6. The distribution of correctly and incorrectly classified voice samples for the found optimum (Pruning Level “two”) and the structure with the worst performance (Pruning Level “23”) is presented in Figure 7. The unsuitability of the last-mentioned structure for multivariate decision selection is clearly confirmed by the successful authentication found only for the first-person object of voice analysis.

5. Naïve Bayes Algorithm in Voice Identification

The last module of the methodology for selecting models for personalizing user access using voice analysis instruments and analytics provides for the implementation of Naïve Bayes (NB) classification. In the NB approach, two variants of the probabilistic description of the functional input data using Gaussian and Kernel distributions were set. With regard to the created NB models with the specified distributions, similar Resubstitution and cross-validation procedures were performed, as in the other approaches, shown in Figure 8. In the case of a normal distribution, the input information set was obtained at 70.93% and 70.64%. Through the introduction of Kernel distribution instruments, an increase to 76.36%, 74.16%, and 75.26% as Resubstitution, cross-validation, and Approximately Expected New Data Accuracy were achieved. The specification reflects the input variables during training, the assigned labels of predictors and classification groups, the created NB models, specific evaluations in the separate phases of the tests for functional belonging, etc.

The Confusion matrices in Figure 9 show the highest recognition rate of voice samples at:

first, second, third, and seventh for the Gaussian technique;
second, first, third, and sixth persons in the Kernel distribution are included in the target group for voice identification.

6. Conclusions

The empirically established accuracy thresholds for personal authentication based on voice analysis with the proposed methodology show very good applicability regarding Discriminant Analysis, Decision Trees, Feed-Forward, and Probabilistic Networks. In this particular aspect, the developed methodology for the synthesis of models for voice identification is allowed to be implemented in security management systems and user access authorization. Regarding the emerging need to improve the classification accuracy of probabilistic models for crossing the threshold of 80.00%, created on the basis of the Naïve Bayes algorithm, preprocessing procedures in the Frequency Domain of voice profile manipulation were planned. Similar activities would also be of interest to the Linear and Quadratic classifiers. Another important point is the search for new potential Machine Learning Methods and Algorithms with a high success of recognition. Examples include Support Vector Machines, Adaptive Neuro-Fuzzy Interface Systems, and k-Nearest Neighbors, among others.

Author Contributions

Conceptualization, I.B. and G.G.; methodology, I.B., K.S. and G.G.; software, I.B., K.S. and G.G.; validation, I.B., K.S. and G.G.; formal analysis, I.B; investigation, I.B., K.S. and G.G.; resources, K.S.; data curation, I.B., K.S. and G.G.; writing—original draft preparation, I.B., K.S. and G.G.; writing—review and editing, I.B., K.S. and G.G.; visualization, I.B., K.S. and G.G.; supervision, I.B.; project administration, G.G.; funding acquisition, Internal project for Technical University of Gabrovo, Bulgaria. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Detailed information about the presented article can be freely obtained by contacting with authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dudhrejia, H.J.; Shah, S.A. Speech recognition using neural networks. Int. J. Eng. Res. Technol. 2018, 7, 196–202. [Google Scholar]
Kamble, B.C. Speech recognition using artificial neural network—A review. Int. J. Comput. Commun. Instrum. Eng. 2016, 3, 61–64. [Google Scholar]
Javanmardi, F.L.; Kadari, S.R.; Alku, P.K. A comparison of data augmentation methods in voice technology. Comput. Speech Lang. 2023, 83, 101552. [Google Scholar] [CrossRef]
Alsobhani, A.; Albboodi, H.M.; Mahdi, H.L. Speech recognition using convolutional deep neural networks. J. Phys. Conf. Ser. 2021, 1973, 012166. [Google Scholar] [CrossRef]
Rady, E.R.; Hassen, A.; Nassan, N.M.; Hesham, M.U. Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 2021, 8, 27–38. [Google Scholar]
Kadiri, S.R.; Javanmadri, F.J.; Alku, P.I. Investigation of self-supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals. Comput. Speech Lang. 2023, 83, 101550. [Google Scholar] [CrossRef]
Isvamko, D.R.; Ryuman, D.P. Development of visual and audio speech recognition systems using deep neural networks. In Proceedings of the International Conference on Computer and Vision, Nizhny Novgorod, Russia, 27–30 September 2021. [Google Scholar]
Graves, A.N.; Mohamed, A.R.; Hinton, G.U. Speech recognition with recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
Song, W.S.; Cai, J. End-to-end deep neural network for automatic speech recognition. J. Comput. Sci. 2015, 1, 1–8. [Google Scholar]
Sheikh, I.P.; Vincent, E.P.; Illina, I.F. Training RNN language models on uncertain ASR hypothesis in limited data scenarios. Comput. Speech Lang. 2023, 83, 101555. [Google Scholar] [CrossRef]
Sridhar, C.D.; Kanhe, A.R. Performance comparison of various neural networks for speech recognition. In Proceedings of the International Conference on Communications Systems, Karaikal, India, 4–8 January 2022. [Google Scholar]
Okay, M.O.; Akin, E.; Asian, O.; Kosunaip, S.; Iliev, T.B.; Stoyanov, I.S.; Beloev, I. A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions. IEEE Access 2024, 12, 12229–12256. [Google Scholar] [CrossRef]
Shaughnessy, D.K. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2023, 83, 101538. [Google Scholar] [CrossRef]
Chowdhury, S.A.; Durrani, N.M.; Ali, A.G. What do end-to-end speech models learn about speaker, language and channel information? A layer-wise and neuron-level analysis. Comput. Speech Lang. 2023, 83, 101539. [Google Scholar] [CrossRef]
Rudregowda, S.S.; Patilkulkurni, S.H.; Ravi, V.Y.; Gururaj, H.L. Audiovisual speech recognition based on a deep convolutional neural network. Data Sci. Manag. 2024, 7, 25–34. [Google Scholar] [CrossRef]

Figure 1. Variables in selection of Pseudo-Quadratic Discriminant classifier.

Figure 2. Confusion matrices at Diagonal Linear (a) and Pseudo-Quadratic (b) classifiers.

Figure 3. Synthesized Feed-Forward (a) and Probabilistic (b) models for voice profile identification.

Figure 4. Matrices of correct (green color) and incorrect (red color) classifications for selected Feed-Forward (a) and Probabilistic (b) neural models for voice profile personalization.

Figure 5. Error diagrams at application procedures of selected FFNN (a) and PNN (b) models.

Figure 6. Variables in synthesis procedures of Decision Tree structures for voice authentication.

Figure 7. Confusion matrices for Optimal (a) and Worst case (b) Decision Tree classification models.

Figure 8. Examine the quality of Naïve Bayes voice profile classification models at Gaussian (a) and Kernel (b) input data distribution.

Figure 9. Confusion matrices at voice profile identification models for NB classifiers with Gaussian (a) and Kernel (b) input data distribution.

Table 1. Discriminant classifiers in voice profile operating procedures.

Type Classifier	Resubstitution			Cross-Validation
Type Classifier	Loss	Accuracy, %	Misc.	Loss	Accuracy, %	Misc.
Linear	0.3386	66.14	474	0.3400	66.00	476
DiagLinear	0.4464	55.36	625	0.4464	55.36	625
PseudoLinear	0.3386	66.14	474	0.3364	66.36	471
Quadratic	0.1321	86.79	185	0.1414	85.86	198
DiagQuadratic	0.2907	70.93	407	0.2929	70.71	410
PseudoQuadratic	0.1321	86.79	185	0.1364	86.36	191

Table 2. FFNNs and PNNs at voice profile classification.

Feed-Forward Neural Networks			Probabilistic Neural Networks
Hidden Neurons	MSE	Accuracy, %	Spread Indicator	MSE	Accuracy, %
5	0.0527	73.20	0.500	4.0816 × 10⁻⁴	99.90
10	0.0412	80.00	0.525	-	-
15	0.0149	95.70	0.550	-	-
20	0.0143	-	0.575	-	-
25	-	96.40	0.600	8.1633 × 10⁻⁴	99.70
30	0.0149	95.70	0.625	-	-
35	0.0113	97.90	0.650	0.0012	99.60
40	0.0141	-	0.675	0.0014	99.50
45	0.0133	-	0.700	0.0018	99.40
50	0.0146	96.80	0.725	-	-
55	0.0164	97.90	0.750	0.0027	99.10
60	0.0154	97.10	0.775	0.0029	99.00
65	0.0155	96.10	0.800	-	-
70	0.0137	97.90	0.825	-	-
75	0.0173	95.40	0.850	0.0033	98.90
80	0.0153	-	0.875	0.0043	98.50
85	0.0145	97.50	0.900	0.0047	98.40
90	0.0140	97.10	0.925	0.0051	98.20

Table 3. Decision Trees based on the CART algorithm for voice profile classification.

Pruning Level	Nodes	Resubstitution		Cross-Validation
Pruning Level	Nodes	Loss	Accuracy, %	Loss	Accuracy, %
0	49	0.015000	98.50000	0.062143	93.7857
1	43	0.019286	98.07143	0.060000	94.0000
2	41	0.021429	97.85714	0.061429	93.8571
3	37	0.027143	97.28571	0.069286	93.0714
4	35	0.030714	96.92857	0.069286	93.0714
5	33	0.035000	96.50000	0.071429	92.8571
6	29	0.045000	95.50000	0.075714	92.4286
7	27	0.050714	94.92857	0.084286	91.5714
8	25	0.057857	94.21429	0.090714	90.9286
9	22	0.070714	92.92857	0.097143	90.2857
10	21	0.075714	92.42857	0.104286	89.5714
11	20	0.081429	91.85714	0.108571	89.1429
12	19	0.087857	91.21429	0.128571	87.1429
13	17	0.102143	89.78571	0.147143	85.2857
14	13	0.132857	86.71429	0.155714	84.4286
15	12	0.141429	85.85714	0.160714	83.9286
16	11	0.150714	84.92857	0.162143	83.7857
17	10	0.167143	83.28571	0.199286	80.0714
18	8	0.201429	79.85714	0.207857	79.2143
19	7	0.230714	76.92857	0.234286	76.5714
20	6	0.276429	72.35714	0.280000	72.0000
21	5	0.352143	64.78571	0.343571	65.6429
22	4	0.428571	57.14286	0.429286	57.0714
23	1	0.857143	14.28571	0.857143	14.2857

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Balabanova, I.; Sidorova, K.; Georgiev, G. Voice Profile Authentication Using Machine Learning. Eng. Proc. 2024, 70, 37. https://doi.org/10.3390/engproc2024070037

AMA Style

Balabanova I, Sidorova K, Georgiev G. Voice Profile Authentication Using Machine Learning. Engineering Proceedings. 2024; 70(1):37. https://doi.org/10.3390/engproc2024070037

Chicago/Turabian Style

Balabanova, Ivelina, Kristina Sidorova, and Georgi Georgiev. 2024. "Voice Profile Authentication Using Machine Learning" Engineering Proceedings 70, no. 1: 37. https://doi.org/10.3390/engproc2024070037

Article Menu

Voice Profile Authentication Using Machine Learning^†

Abstract

1. Introduction

2. Discriminant Analysis in Voice Profile Recognition

3. Feed-Forward and Probabilistic Neural Networks in Voice Identification

4. Decision Tree Modeling for Voice Profile Recognition

5. Naïve Bayes Algorithm in Voice Identification

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Voice Profile Authentication Using Machine Learning †

Abstract

1. Introduction

2. Discriminant Analysis in Voice Profile Recognition

3. Feed-Forward and Probabilistic Neural Networks in Voice Identification

4. Decision Tree Modeling for Voice Profile Recognition

5. Naïve Bayes Algorithm in Voice Identification

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Voice Profile Authentication Using Machine Learning^†