1. Introduction
Voice Recognition is a complex and complicated process that passes through the execution of a sequence of components: (1) Speech Preprocessing; (2) Feature Extraction; (3) Speech Classification; and (4) Recognition [
1,
2]. Feature extraction modules in voice diagnostics are expressed in the analysis of spectrograms and mel-spectrograms associated with the acquisition of Spectral Features. These are sets of Static or Dynamic Mel-Frequency Cepstral Coefficients (MFCCs) [
3,
4,
5]. The incoming voice spectrograms are received by the Short-Time Fourier Transform (STFT) or the Discrete Cosine Transform (DCT). The generated feature datasets serve as input parametric units for different dimension analytics for generating classification models [
6,
7,
8]. The scientific ones are based on voice classifiers with Machine Learning (ML) and Deep Learning (DL) as Support Vector Machine (SVM), Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional-LSTM Network (Bi-LSTM) [
9,
10,
11,
12]. The algorithm and neural efficiency are analyzed with the application of heterogeneous datasets under the following conceptual tasks: (1) Gender Classification (GC); (2) Speaker Verification (SV); (3) Language Identification (LID); (4) Regional Dialect Identification (DID); and (5) Channel Classification (CC) [
13,
14,
15].
In relation to the purpose of the present study, experimentally statistical characteristics of sound level were extracted at manipulations with the voices of seven male and female individuals when pronouncing selected speech commands. The procedures included two types of sound measurements: No 1, all acoustic and audio measurements; and No 2, measurements for sound levels below 100 dB performed in fixed temporal intervals of 15 s for every registered vocal profile. The mentioned categories of sound parameters had been registered by a portable device with a sound analyzer application specified. In the process of preliminary selection of informative signs between formed input datasets with different combinations of sound parameters, tests were performed with artificial neuron networks to check the correctness of voice recognition. The best indicators were received at category No. 2, including LAE (A-weighted, sound exposure level), [dBA]; LAeq (A-weighted, equivalent sound level), [dBA]; LAF (A-weighted, fast time constant, sound level), [dBA]; LAS (A-weighted, slow time constant response, sound level), [dBA]; and LAI (A-weighted, frequency weighting and impulse time, sound level), [dBA].
The mentioned spectrum of sound features was selected after benchmarking performance and recognition accuracy assessments using initial variants of neural networks, subsequently used as a basic input unit for training and verification of models for voice profile identification. The paper synthesized a methodology for voice profile identification for security and information access personalization based on probabilistic analysis tools in an entirely MATLAB R2014a version. The following analytics tools were introduced: Discriminant Analysis; Feed-Forward Neural Networks; Probabilistic Neural Networks; CART Decision Trees; and Naive Bayes for classification. Through performance evaluation of analytical modules, final classification models with a high recognition index of voice samples in dataset information were selected.
2. Discriminant Analysis in Voice Profile Recognition
In an initial aspect of the studies, the possibility of adapting Linear Didcriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) classifiers and their subvariants was considered as follows:
Diagonal Linear Discriminant Analysis (DLDA);
Pseudo Linear Discriminant Analysis (PLDA);
Diagonal Quadratic Discriminant Analysis (DQDA);
Pseudo-Quadratic Discriminant Analysis (PQDA).
A set of criteria has been adopted to examine the quality of classification models created as follows: (1) Lost; (2) Accuracy; and (3) Misclassifications determined against two basic approaches: Resubstitution and k-fold cross-validation. The results of the indicated activities are shown in
Table 1. The obtained quantitative estimates show a significant advantage of Linear compared to Quadratic models. According to Resubstitution, low accuracy thresholds of 55.36% with DLDA and 66.14% using LDA and PLDA were reported. The indicators in cross-validation procedures turned out to be similar, where 66.30%, 55.36%, and 66.36% were found for LDA, DLDA, and PLDA. Relatively better were the results achieved in Quadratic classifiers. Values for Resubstitution Lost in the order of 0.1321 and a minimum indication of 0.2907 were observed—equivalent to 86.79% in QDA and PQDA and 70.93% at DQDA. Regarding cross-validation lost, an accuracy of 85.86%, 70.71%, and 86.36% was obtained for the cases of QDA, DQDA, and PQDA.
According to the analysis of the results, a classification model based on Pseudo-Quadratic Discriminant Analysis was selected, with an approximately expected accuracy of 86.575% when operating with voice samples not involved in the training and test procedures. A specification of the classifier synthesis variables is given in
Figure 1.
Figure 2 visualizes the results when determining the belonging of the voice samples to the defined classification groups for the models with the highest and lowest established quality indicators, respectively, when using DLDA and PQDA. Correctly classified data were arranged diagonally from upper left to lower right, and Misclassifications were assigned to the remaining groups by matrix rows. The largest amount of incorrectly classified data was found in the sample for voice analysis object “Person No 4”. It should be noted that 1, 20, 61, and 38 samples were assigned to the information compositions of “Person No 3”, “Person No 5”, “Person No 6”, and “Person No 7”, respectively.
3. Feed-Forward and Probabilistic Neural Networks in Voice Identification
In the course of research related to the Artificial Intelligence concept, two categories of neural networks were introduced: Feed-Forward Neural Networks (FFNNs) and Probabilistic Neural Networks (PNNs). The neural validation performance score was prepared on the basis of Mean-Squared Error and Classification accuracy indicators. The variation in declared indicators under stepwise increment neurons in the hidden layer with hyperbolic tangent sigmoid activation for FFNNs and the spread indicator in the Radial Base Function (RBF) Layer with Kernel Transfer Function for PNNs was studied.
The results of the applied series of experiments are presented in
Table 2. Regarding the FFNN, the MSE variations were limited to the interval 0.0113 to 0.0527, registered for networks with a content of 35 and 5 hidden neurons. The received equivalents of the established errors according to accuracy criteria were a maximum of 97.90% (
Figure 3a at the FFNN model selected) and a minimum threshold of 73.20%, respectively.
For the Probabilistic Neural Network structures created with a fixed number of RBF neurons (
Figure 3b), variations in the spread from 0.500 to 0.575 did not cause a change in the error, where a constant level of MSE = 4.0816 × 10
−4 was observed. After increasing the criterion from 0.600, a smooth to faster exponential growth of the error begins until the highest value of 0.0051 is reached at the limit spread of 0.925. The analysis shows more adequate behavior of PNNs compared to FFNNs, as the lowest accuracy found does not fall below 98.20%. Similar judgments can be made in a quantitative analysis against the MSE indicator, where there is a threefold lower maximum error for PNN relative to the largest reported MSE value for the feed-forward structures.
The Confusion matrices and Error diagrams in
Figure 4 and
Figure 5 confirm the advantages of the synthesized FFNN with 35 hidden neurons and the PNN with a variety of spreads from 0.500 to 0.575. An increased minimization of Misclassifications was observed before Linear and Quadratic Discriminant classifiers were assigned by matrix rows to incorrect output groups. A variation range of “−0.7561 to 0.6393” was determined for the network errors in the final FFNN model. Within a relatively close range fall the established greater fluctuations of the errors for voice samples for “Person No 3” to “Person No 7”. A similar point was observed for the lower variations in Person No 1 and Person No 2, which were subject to voice profile identification. Significantly lower are the variations found in the selected PNN classification model, with the presence of sharply limited increases in the fourth and seventh output groups, which can be ignored.
4. Decision Tree Modeling for Voice Profile Recognition
In the next stage of the research, activities were carried out on the modeling of structures for multivariate selection of classification decisions using the CART algorithm. In accordance with the specifics of the model synthesis procedure for voice authentication, 24 classification models were generated, corresponding to a basic structure—Pruning level “0”—and structures with sequential removal of nodal branches—Pruning level “1” to level “23” (
Table 3). The applied tests using Resubstitution and cross-validation techniques on the initially generated model with 49 nodes show high levels of accuracy, respectively, 98.50000% and 93.7857%. The minimization of the building nodes at Pruning Level “21” and Level “22” was tied to a significant decrease in the efficiency of Decision Tree (DT) models—a fact for which accuracy around 65.0000% and 57.0000% were observed. At a finite content of structural nodes Pruning level “23”, the classification accuracy dropped to only 14.28571 in Resubstitution and 14.2857% for cross-validation.
Following an analysis of the performance of the models, a selection of the Best Pruning Level “two”—DT model was made with a composition of 41 nodes, responsible for maintaining an optimal solution structure and acceptable classification accuracy. Based on the chosen classification architecture, an approximately expected accuracy when operating with new voice data content of 95.85425% was calculated. The specification of the variables in the overall processes of training, verification, and evaluation of the effectiveness of the models using the Decision Tree method is shown in
Figure 6. The distribution of correctly and incorrectly classified voice samples for the found optimum (Pruning Level “two”) and the structure with the worst performance (Pruning Level “23”) is presented in
Figure 7. The unsuitability of the last-mentioned structure for multivariate decision selection is clearly confirmed by the successful authentication found only for the first-person object of voice analysis.
5. Naïve Bayes Algorithm in Voice Identification
The last module of the methodology for selecting models for personalizing user access using voice analysis instruments and analytics provides for the implementation of Naïve Bayes (NB) classification. In the NB approach, two variants of the probabilistic description of the functional input data using Gaussian and Kernel distributions were set. With regard to the created NB models with the specified distributions, similar Resubstitution and cross-validation procedures were performed, as in the other approaches, shown in
Figure 8. In the case of a normal distribution, the input information set was obtained at 70.93% and 70.64%. Through the introduction of Kernel distribution instruments, an increase to 76.36%, 74.16%, and 75.26% as Resubstitution, cross-validation, and Approximately Expected New Data Accuracy were achieved. The specification reflects the input variables during training, the assigned labels of predictors and classification groups, the created NB models, specific evaluations in the separate phases of the tests for functional belonging, etc.
The Confusion matrices in
Figure 9 show the highest recognition rate of voice samples at:
first, second, third, and seventh for the Gaussian technique;
second, first, third, and sixth persons in the Kernel distribution are included in the target group for voice identification.
6. Conclusions
The empirically established accuracy thresholds for personal authentication based on voice analysis with the proposed methodology show very good applicability regarding Discriminant Analysis, Decision Trees, Feed-Forward, and Probabilistic Networks. In this particular aspect, the developed methodology for the synthesis of models for voice identification is allowed to be implemented in security management systems and user access authorization. Regarding the emerging need to improve the classification accuracy of probabilistic models for crossing the threshold of 80.00%, created on the basis of the Naïve Bayes algorithm, preprocessing procedures in the Frequency Domain of voice profile manipulation were planned. Similar activities would also be of interest to the Linear and Quadratic classifiers. Another important point is the search for new potential Machine Learning Methods and Algorithms with a high success of recognition. Examples include Support Vector Machines, Adaptive Neuro-Fuzzy Interface Systems, and k-Nearest Neighbors, among others.
Author Contributions
Conceptualization, I.B. and G.G.; methodology, I.B., K.S. and G.G.; software, I.B., K.S. and G.G.; validation, I.B., K.S. and G.G.; formal analysis, I.B; investigation, I.B., K.S. and G.G.; resources, K.S.; data curation, I.B., K.S. and G.G.; writing—original draft preparation, I.B., K.S. and G.G.; writing—review and editing, I.B., K.S. and G.G.; visualization, I.B., K.S. and G.G.; supervision, I.B.; project administration, G.G.; funding acquisition, Internal project for Technical University of Gabrovo, Bulgaria. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Detailed information about the presented article can be freely obtained by contacting with authors.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Dudhrejia, H.J.; Shah, S.A. Speech recognition using neural networks. Int. J. Eng. Res. Technol. 2018, 7, 196–202. [Google Scholar]
- Kamble, B.C. Speech recognition using artificial neural network—A review. Int. J. Comput. Commun. Instrum. Eng. 2016, 3, 61–64. [Google Scholar]
- Javanmardi, F.L.; Kadari, S.R.; Alku, P.K. A comparison of data augmentation methods in voice technology. Comput. Speech Lang. 2023, 83, 101552. [Google Scholar] [CrossRef]
- Alsobhani, A.; Albboodi, H.M.; Mahdi, H.L. Speech recognition using convolutional deep neural networks. J. Phys. Conf. Ser. 2021, 1973, 012166. [Google Scholar] [CrossRef]
- Rady, E.R.; Hassen, A.; Nassan, N.M.; Hesham, M.U. Convolutional neural network for Arabic speech recognition. Egypt. J. Lang. Eng. 2021, 8, 27–38. [Google Scholar]
- Kadiri, S.R.; Javanmadri, F.J.; Alku, P.I. Investigation of self-supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals. Comput. Speech Lang. 2023, 83, 101550. [Google Scholar] [CrossRef]
- Isvamko, D.R.; Ryuman, D.P. Development of visual and audio speech recognition systems using deep neural networks. In Proceedings of the International Conference on Computer and Vision, Nizhny Novgorod, Russia, 27–30 September 2021. [Google Scholar]
- Graves, A.N.; Mohamed, A.R.; Hinton, G.U. Speech recognition with recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Song, W.S.; Cai, J. End-to-end deep neural network for automatic speech recognition. J. Comput. Sci. 2015, 1, 1–8. [Google Scholar]
- Sheikh, I.P.; Vincent, E.P.; Illina, I.F. Training RNN language models on uncertain ASR hypothesis in limited data scenarios. Comput. Speech Lang. 2023, 83, 101555. [Google Scholar] [CrossRef]
- Sridhar, C.D.; Kanhe, A.R. Performance comparison of various neural networks for speech recognition. In Proceedings of the International Conference on Communications Systems, Karaikal, India, 4–8 January 2022. [Google Scholar]
- Okay, M.O.; Akin, E.; Asian, O.; Kosunaip, S.; Iliev, T.B.; Stoyanov, I.S.; Beloev, I. A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions. IEEE Access 2024, 12, 12229–12256. [Google Scholar] [CrossRef]
- Shaughnessy, D.K. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2023, 83, 101538. [Google Scholar] [CrossRef]
- Chowdhury, S.A.; Durrani, N.M.; Ali, A.G. What do end-to-end speech models learn about speaker, language and channel information? A layer-wise and neuron-level analysis. Comput. Speech Lang. 2023, 83, 101539. [Google Scholar] [CrossRef]
- Rudregowda, S.S.; Patilkulkurni, S.H.; Ravi, V.Y.; Gururaj, H.L. Audiovisual speech recognition based on a deep convolutional neural network. Data Sci. Manag. 2024, 7, 25–34. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).