Speech Recognition Using Machine Learning
Speech Recognition Using Machine Learning
net/publication/352850495
CITATIONS READS
0 1,602
3 authors:
12 PUBLICATIONS 122 CITATIONS
ABES Institute of Technology
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Satya Prakash Yadav on 15 July 2021.
Received December 31, 2020; Revised February 14, 2021; Accepted March 7, 2021; Published June 30, 2021
* Regular Paper
* Review Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and
has been accepted by the editorial board through the regular reviewing process.
Abstract: Speech recognition is one of the fastest-growing engineering technologies. It has several
applications in different areas, and provides many potential benefits. A lot of people are unable to
communicate due to language barriers. We aim to reduce this barrier via our project, which was
designed and developed to achieve systems in particular cases to provide significant help so people
can share information by operating a computer using voice input. This project keeps that factor in
mind, and an effort is made to ensure our project is able to recognize speech and convert input
audio into text; it also enables a user to perform file operations like Save, Open, or Exit from voice-
only input. We design a system that can recognize the human voice as well as audio clips, and
translate between English and Hindi. The output is in text form, and we provide options to convert
audio from one language to the other. Going forward, we expect to add functionality that provides
dictionary meanings for Hindi and English words. Neural machine translation is the primary
algorithm used in the industry to perform machine translation. Two recurrent neural networks used
in tandem to construct an encoder–decoder structure are the architecture behind neural machine
translation. This work on speech recognition starts with an introduction to the technology and the
applications used in different sectors. Part of the report is based on software developments in
speech recognition.
Keywords: Speech recognition, Speech emotion recognition, Statistical classifiers, Dimensionality reduction
techniques, Emotional speech databases, Vision processing, Computational intelligence, Machine
learning, Computer visit
2. Literature Review
Mehmet Berkehan Akçay et al. [1] explained that
neural networks are mainly limited to industrial control
and robotics applications. However, recent advances in
neural networks through the introduction of intelligent
travel, intelligent diagnosis and health monitoring for
precision medicine, robotics and home appliance
automation, virtual online support, e-marketing, weather
forecasting and natural disaster management, among others,
Fig. 2. Recognition Process.
have contributed to successful IS implementations in
almost every aspect of human life.
of context or word translation, text that was present in the G. Tsontzos et al. [2] clarified how feelings allow us
source is missing. Neural machine translation is the use of to better understand each other, and a natural consequence
a neural network to learn a mathematical model for NMT. is to expand this understanding to computers. Thanks to
The key benefit of the methodology is that a single smart mobile devices capable of accepting and responding
framework can be trained directly on the source and target to voice commands with synthesized speech, speech
text, which no longer requires a pipeline of complex recognition is now part of our daily lives. To allow devices
systems used in statistical machine learning [5]. to detect our emotions, speech emotion recognition (SER)
· Connected Speech: Linked words or connected could be used.
speech are identical to independent speech, and T. Taleb et al. [7] said they were motivated by
except for brief delays between them, they make understanding that these standards place higher boundaries
separate utterances. on the improvement that can be achieved when using
·Continuous Speech: Continuous speech allows the HMMs in speech recognition. In an attempt to improve
user to speak almost naturally; it is also called robustness, particularly under noisy conditions, new
computer dictation. modeling schemes that can explicitly model time are being
·Spontaneous Speech: At a simple level, this can be explored, and this work was partially funded by the EU-
viewed as speech that is natural-sounding and not IST FP6 HIWIRE research project. Spatial similarities,
rehearsed. An ASR device with random speech including dynamic linear models (LDM), were initially
abilities should be able to accommodate a variety of proposed for use in speech recognition.
normal speech features, such as sentences that run Vinícius Maran et al. [6] explained that learning
together, and that include "ums" and "ahs" and even speech is a dynamic mechanism in which the processing of
slight stutters. phonemes is marked by continuities and discontinuities in
the path of the infant towards the advanced production of
Machine Translation: Machine translation usually ambient language segments and structures.
models whole sentences with the use of an artificial neural Y. Wu et al. [3] noted that discriminative testing has
network to predict a sequence of terms. Typically, it been used for speech recognition for many years now. The
models entire sentences in a single integrated model few organizations that have had the resources to
through the use of an artificial neural network to predict implement discriminatory instructions for large-scale
the sequence of words. Initially, word sequence modeling speech recognition assignments have mostly used the full
is usually carried out using a recurrent neural network shared information system in recent years (MMI). Instead,
(RNN). Unlike the traditional phrase-based translation in the extension of the studies first presented, we reflect on
method that consists of many small subcomponents that the minimum classification error (MCE) paradigm for
are tuned separately, neural machine translation is used to discriminatory instruction.
build and train a single, broad neural network that reads a Peng et al. [4] stated that identification of speakers
phrase and outputs the correct translation. Neural machine refers to identifying people by their voice. This technology
translation by end-to-end systems is said to be a neural is increasingly adopted and used as a kind of biometrics
machine translation system because only one model is for its ease of use and non-interactivity, and soon became a
needed for translation. The transfer of scientific, research hotspot in the field of biometrics.
metaphysical, literary, commercial, political, and artistic Shahnawazuddin and Sinha [10] discussed how the
knowledge through linguistic barriers is an integral and work presented was an extension of the current quick
essential component of human endeavor [4]. adaptation approaches based on acoustic model
Translation is more prevalent and available today than interpolation. The basis (model) weights are calculated in
236 Vashisht et al.: Speech Recognition using Machine Learning
4.2 Translate
The assessment function is close to the teaching loop,
Fig. 7. Third Translation from English to Hindi.
except that we are not pressuring teachers here. At each
time point, the input to the decoder, along with the hidden
state and the encoder output, are its previous predictions. unit (CPU) can understand. It also detects errors in the
Avoid forecasting the final token as the model predicts it. program.
Store the attention weights for each step in time. A translator
is a programming language processor that converts a
computer program from one language to another. It takes 5. Conclusion
a program written in source code and converts it into
machine code. It discovers and identifies errors during In the past few years, the complexity and precision of
translation. It translates a high-level language program into speech recognition applications have evolved
a machine language program that the central processing exponentially. This paper extensively explores the recent
238 Vashisht et al.: Speech Recognition using Machine Learning
advancements in intelligent vision and speech algorithms, [7] T. Taleb, K. Samdanis, B. Mada, H. Flinck, S. Dutta,
their applications on the most popular smart phones and and D. Sabella, "On Multi-Access Edge Computing:
embedded platforms, and their application limitations. In A Survey of the Emerging 5G Network Edge Cloud
spite of immense advances in success and efficacy from Architecture and Orchestration," IEEE
deep learning algorithms, training the machine with other Communications Surveys & Tutorials, vol. 19, no. 3,
knowledge sources, which are the framework, also pp. 1657-1681, 2017, Article (CrossRefLink)
contributes significantly to the class subject. [8] J. C. Amengual, A. Castaño, A. Castellanos, V.M.
Jiménez, D. Llorens, A. Marzal, F. Prat, J.M. Vilar,
J.M. Benedi, F. Casacuberta, M. Pastor, E. Vidal, The
6. Future Scopes EuTrans spoken language translation system,
Machine Translation 15 (2000) 75–103
This work can be explored in depth in order to [9] D. J. Atha, M. R. Jahanshahi, Evaluation of deep
improvise and incorporate new functionality to the project, learning approaches based on convolutional neural
and it can be worked on further.. In order to accumulate a networks for corrosion detection, Struct. Health
larger number of samples and maximize productivity, the Monit. 17 (5) (2018) 1110–1128. Article (CrossRef
new software does not accommodate a broad vocabulary Link)
[10]. Only a few parts of the notepad are protected by the [10] S. Shahnawazuddin, Rohit Sinha, Sparse coding over
current edition of the app, but more areas can be covered, redundant dictionaries for fast adaptation of speech
and efforts will be made in this respect. recognition system, Computer Speech & Language,
Volume 43, 2017, Pages 1-17, ISSN 0885-2308,
Article (CrossRefLink).
Acknowledgement
The authors would like to express their sincere thanks Satya Prakash Yadav is currently on
to the editor-in-chief for his valuable suggestions to the faculty of the Information
improve this article. Technology Department, ABES
Institute of Technology (ABESIT),
References Ghaziabad (India). A seasoned
academician having more than 13
[1] Mehmet Berkehan Akçay, Kaya Oğuz, Speech years of experience, he has published
emotion recognition: Emotional models, three books (Programming in C,
databases,features, preprocessing methods, Programming in C++, and Blockchain
supporting modalities, and classifiers, Speech and Cryptocurrency) under I.K. International Publishing
Communication, Volume116, 2020, Pages 56-76, House Pvt. Ltd. He has undergone industrial training
ISSN 0167-6393, Article (CrossRefLink). programs during which he was involved in live projects
[2] G. Tsontzos, V. Diakoloukas, C. Koniaris and V. with companies in the areas of SAP, Railway Traffic
Digalakis, "Estimation of General Identifiable Linear Management Systems, and Visual Vehicles Counter and
Dynamic Models with an sApplication in Speech Classification (used in the Metro rail network design). He
characteristics vectors, Computer Standards & is an alumnus of Netaji Subhas Institute of Technology
Interfaces, Volume 35, Issue 5, 2013, Pages 490-506, (NSIT), Delhi University. A prolific writer, Mr. Yadav has
ISSN 0920-5489. filed two patents and authored many research papers in the
[3] Y. Wu et al., "Google's Neural Machine Translation Web of Science indexed journals. Additionally, he has
System: Bridging the Gap between Human and presented research papers at many conferences in areas of
Machine Translation," arXiv preprint Image Processing and Programming such as Image
arXiv:1609.08144, pp. 1-23, 2016. Processing, Feature Extraction and Inforamtion Rectrival .
[4] Shuping Peng, Tao Lv, Xiyu Han, Shisong Wu, He is also the lead editor in CRC Press, Taylor and Francis
ChunhuiYan, Heyong Zhang,Remote speaker Group Publisher (U.S.A), Science Publishing Group
recognition based on the enhanced LDV-captured (U.S.A.), and Eureka Journals, Pune (India).
speech, Applied Acoustics,Volume 143, 2019, Pages
165-170, ISSN 0003-682X, Article (CrossRefLink)
[5] A. A. Varghese, J. P. Cherian and J. J.
Kizhakkethottam, "Overview on emotion recognition Vineet Vashisht is currently a research
system," 2015 International Conference on Soft- scholar in the Information Technology
Computing and Networks Security (ICSNS), Department at Dr. A.P.J. Abdul Kalam
Coimbatore, 2015, pp. 1-5, Article (CrossRefLink) Technical University, Lucknow.
[6] Vinícius Maran, Marcia Keske-Soares, towards a Vineet Vashisht is supervised by Ass.
speech therapy support system based on phonological Prof. Satya Prakash Yadav of the
processes early detection, Computer Speech & Information Technology Department,
Language, Volume 65, 2021, 101130, ISSN 0885- ABES Institute of Technology
2308, Article (CrossRefLink). (ABESIT).
IEIE Transactions on Smart Processing and Computing, vol. 10, no. 3, June 2021 239