[go: up one dir, main page]

0% found this document useful (0 votes)
115 views8 pages

Speech Recognition Using Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views8 pages

Speech Recognition Using Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/352850495

Speech Recognition using Machine Learning

Article  in  IEIE Transactions on Smart Processing and Computing · June 2021


DOI: 10.5573/IEIESPC.2021.10.3.233

CITATIONS READS

0 1,602

3 authors:

Vineet Vashisht Aditya Kumar Pandey

12 PUBLICATIONS   122 CITATIONS   
ABES Institute of Technology
2 PUBLICATIONS   0 CITATIONS   
SEE PROFILE
SEE PROFILE

Satya Prakash Yadav


ABES Institute of Technology
41 PUBLICATIONS   32 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine learning View project

MATHEMATICAL IMPLEMENTATION OF FUSION RULES View project

All content following this page was uploaded by Satya Prakash Yadav on 15 July 2021.

The user has requested enhancement of the downloaded file.


IEIE Transactions on Smart Processing and Computing, vol. 10, no. 3, June 2021
https://doi.org/10.5573/IEIESPC.2021.10.3.233 233

IEIE Transactions on Smart Processing and Computing

Speech Recognition using Machine Learning


Vineet Vashisht*, Aditya Kumar Pandey, and Satya Prakash Yadav

Department of Information Technology, ABES Institute of Technology (ABESIT), Ghaziabad-201009, India

* Corresponding Author: Vineet Vashisht, vashishtvineet01@gmail.com

Received December 31, 2020; Revised February 14, 2021; Accepted March 7, 2021; Published June 30, 2021

* Regular Paper

* Review Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and
has been accepted by the editorial board through the regular reviewing process.

Abstract: Speech recognition is one of the fastest-growing engineering technologies. It has several
applications in different areas, and provides many potential benefits. A lot of people are unable to
communicate due to language barriers. We aim to reduce this barrier via our project, which was
designed and developed to achieve systems in particular cases to provide significant help so people
can share information by operating a computer using voice input. This project keeps that factor in
mind, and an effort is made to ensure our project is able to recognize speech and convert input
audio into text; it also enables a user to perform file operations like Save, Open, or Exit from voice-
only input. We design a system that can recognize the human voice as well as audio clips, and
translate between English and Hindi. The output is in text form, and we provide options to convert
audio from one language to the other. Going forward, we expect to add functionality that provides
dictionary meanings for Hindi and English words. Neural machine translation is the primary
algorithm used in the industry to perform machine translation. Two recurrent neural networks used
in tandem to construct an encoder–decoder structure are the architecture behind neural machine
translation. This work on speech recognition starts with an introduction to the technology and the
applications used in different sectors. Part of the report is based on software developments in
speech recognition.

Keywords: Speech recognition, Speech emotion recognition, Statistical classifiers, Dimensionality reduction
techniques, Emotional speech databases, Vision processing, Computational intelligence, Machine
learning, Computer visit

1. Introduction complex languages for interactions with machines that are


hard to understand and use. A speech synthesizer converts
In this project we are trying to reduce the language written text into spoken language. Speech synthesis is also
barriers among people with a communication technique referred to as text-to-speech (TTS) conversion, as shown in
from amongst speech-trained systems that achieves better Fig. 1.
performance than those trained with normal speech. Speech synthesis is the artificial production of human
Speech emotion recognition is also used in call center speech. A computer used for this purpose is called a
applications and mobile wireless communications. This speech computer, or speech synthesizer, and it can be
encouraged us to think of speech as a fast and powerful implemented in software or hardware products. A text-to-
means of communicating with machines. The method of speech (TTS) system converts normal language text into
converting an acoustic signal, captured by a microphone or speech; other systems render symbolic linguistic
other instrument, into a set of words is speech recognition representations like phonetic transcriptions into speech.
[1]. We use linguistic analysis to achieve speech Synthesized speech can be created by concatenating
comprehension. Everybody needs to draw in with those in pieces of recorded speech that are stored in a database.
the public arena, and we need to see one another.It is also Systems differ in the size of the stored speech units; a
normal for individuals to expect computers to have a system that stores phones or diphones provides the largest
speech interface. In the present era, humans also need output range, but may lack clarity. For specific usages, the
234 Vashisht et al.: Speech Recognition using Machine Learning

Speech Recognition Terminology: Recognition of


speech is a technology that enables a device to catch the
words spoken by a human into a microphone. These words
are later processed through speech recognition and,
ultimately, the system outputs recognized words. The
speech recognition process consists of different steps that
are discussed one by one in the following sections [6].
Speech translation is important because it allows speakers
from around the world to communicate in their own
languages, erasing the language gap in global business and
cross-cultural exchanges. It would be of immense
scientific, cultural, and economic importance for humanity
to achieve universal speech translation. Our project breaks
Fig. 1. Speech Synthesis [2]. down the language barrier so that individuals can interact
with each other in their preferred language. Speech
storage of entire words or sentences allows for high- recognition systems can be categorized according to their
quality output. Alternatively, a synthesizer can incorporate ability to understand the terms and lists of words they have
a model of the vocal tract and other human voice in a number of groups. A desirable condition in the speech
characteristics to create completely synthetic voice output. recognition process is where the spoken word is heard. The
Text-to-speech (TTS) is a type of assistive technology that recognition engine respects all words spoken by a person,
reads digital text aloud. It is sometimes called read-aloud but in practice, the speech recognition engine's efficiency
technology. With the click of a button or the touch of a depends on a variety of factors. The key variables that are
finger, TTS can take words on a computer or other digital counted as dependent variables for a speech recognition
device and convert them into audio. A simple solution engine are terminology, concurrent users, and noisy
could be to ease this touch barrier. A simple solution may settings.
be to ease this communication barrier with spoken Speech Recognition Process: The communication of
language that can be understood by a computer. In this meaning from one language (the source) to another
area, great progress has been made, but such systems still language (the target) is translation. Basically, speech
face the problem of limited vocabulary or complex synthesis is used for two main reasons. First and foremost,
grammar, along with the problem of retraining the system dictation is the conversion of spoken words into text in the
under different circumstances for different speakers. For form of speech processing, and secondly, control of
applications that require normal human–machine devices requires the creation of software that enables a
interaction, such as web movies and computer person to run various voice applications [3]. The PC sound
demonstration applications, detection of emotion in speech card generates the corresponding digital representation of
is particularly useful, where the reaction of the system to received audio through microphone input. The method of
the user depends on sensed emotion. Speech recognition– translating the analog signal into digital form is
interface implementations include voice (e.g., "call home"), digitization. Sampling transforms a continuous signal into
call routing (e.g., "I would like to make a collect call"), home a discrete signal; thus, quantization is defined as the
appliance control, keyword search (e.g., locating a podcast method of approximating a continuous set of values.
where particular words are spoken), basic data entry (e.g., Attention models are input processing techniques for
entering a credit card number), formal document preparation neural networks that allow the network to focus on specific
(e.g., creating a radiology report), and defining organisms aspects of complex input, one at a time, until the entire
(usually termed direct voice input, where particular words are dataset is categorized. The goal is to break down
spoken) [8]. complicated tasks into smaller areas of attention that are
Visual processing is a term used to refer to the brain's processed sequentially. An attention model works in broad
ability to use and interpret visual information. The process strokes; attention is expressed as a function that maps a
of converting light energy into a meaningful image is a query and “s set” of key–value pairs to an output, one in
complex process that is facilitated by numerous brain which the query, keys, values, and final output are all
structures and higher-level cognitive processes. In the vectors. The output is then calculated as a weighted sum of
areas of human–computer interactions, biometric the values, with the weight assigned to each value
applications, protection and surveillance, and most expressed by a compatibility function of the query with the
recently in computational behavioral analysis, the corresponding key–value.
advancements in speech- and visual-processing systems Neural Machine Translation: This machine
has facilitated considerable research and growth. Although translation technique is used in artificial neural networks to
IS has been enriched for several decades by conventional predict the probability of a series of terms, usually in a
machine learning and evolutionary computations to solve single integrated model, modeling entire sentences. In
complicated pattern recognition issues, these methods have recent years, in a variety of ways, technology using neural
limits on their abilities to handle natural data or images in networks has been used to solve problems.
raw data formats. A variety of computational steps are In the natural speech processing area, the use of neural
used before implementing machine learning models to machine translation (NMT) is a example of this. The
derive representative features from raw data or images. missing translation is the phenomenon in which, in terms
IEIE Transactions on Smart Processing and Computing, vol. 10, no. 3, June 2021 235

ever before. Organizations with larger budgets may choose


to hire a translation company or independent professional
translators to manage all their translation needs;
organizations with smaller budgets, or that deal in subjects
that are unknown to many translators, may choose to
combine the services of professional translators.

2. Literature Review
Mehmet Berkehan Akçay et al. [1] explained that
neural networks are mainly limited to industrial control
and robotics applications. However, recent advances in
neural networks through the introduction of intelligent
travel, intelligent diagnosis and health monitoring for
precision medicine, robotics and home appliance
automation, virtual online support, e-marketing, weather
forecasting and natural disaster management, among others,
Fig. 2. Recognition Process.
have contributed to successful IS implementations in
almost every aspect of human life.
of context or word translation, text that was present in the G. Tsontzos et al. [2] clarified how feelings allow us
source is missing. Neural machine translation is the use of to better understand each other, and a natural consequence
a neural network to learn a mathematical model for NMT. is to expand this understanding to computers. Thanks to
The key benefit of the methodology is that a single smart mobile devices capable of accepting and responding
framework can be trained directly on the source and target to voice commands with synthesized speech, speech
text, which no longer requires a pipeline of complex recognition is now part of our daily lives. To allow devices
systems used in statistical machine learning [5]. to detect our emotions, speech emotion recognition (SER)
· Connected Speech: Linked words or connected could be used.
speech are identical to independent speech, and T. Taleb et al. [7] said they were motivated by
except for brief delays between them, they make understanding that these standards place higher boundaries
separate utterances.  on the improvement that can be achieved when using
·Continuous Speech: Continuous speech allows the HMMs in speech recognition. In an attempt to improve
user to speak almost naturally; it is also called robustness, particularly under noisy conditions, new
computer dictation.  modeling schemes that can explicitly model time are being
·Spontaneous Speech: At a simple level, this can be explored, and this work was partially funded by the EU-
viewed as speech that is natural-sounding and not IST FP6 HIWIRE research project. Spatial similarities,
rehearsed. An ASR device with random speech including dynamic linear models (LDM), were initially
abilities should be able to accommodate a variety of proposed for use in speech recognition.
normal speech features, such as sentences that run Vinícius Maran et al. [6] explained that learning
together, and that include "ums" and "ahs" and even speech is a dynamic mechanism in which the processing of
slight stutters.  phonemes is marked by continuities and discontinuities in
the path of the infant towards the advanced production of
Machine Translation: Machine translation usually ambient language segments and structures.
models whole sentences with the use of an artificial neural Y. Wu et al. [3] noted that discriminative testing has
network to predict a sequence of terms. Typically, it been used for speech recognition for many years now. The
models entire sentences in a single integrated model few organizations that have had the resources to
through the use of an artificial neural network to predict implement discriminatory instructions for large-scale
the sequence of words. Initially, word sequence modeling speech recognition assignments have mostly used the full
is usually carried out using a recurrent neural network shared information system in recent years (MMI). Instead,
(RNN). Unlike the traditional phrase-based translation in the extension of the studies first presented, we reflect on
method that consists of many small subcomponents that the minimum classification error (MCE) paradigm for
are tuned separately, neural machine translation is used to discriminatory instruction.
build and train a single, broad neural network that reads a Peng et al. [4] stated that identification of speakers
phrase and outputs the correct translation. Neural machine refers to identifying people by their voice. This technology
translation by end-to-end systems is said to be a neural is increasingly adopted and used as a kind of biometrics
machine translation system because only one model is for its ease of use and non-interactivity, and soon became a
needed for translation. The transfer of scientific, research hotspot in the field of biometrics.
metaphysical, literary, commercial, political, and artistic Shahnawazuddin and Sinha [10] discussed how the
knowledge through linguistic barriers is an integral and work presented was an extension of the current quick
essential component of human endeavor [4]. adaptation approaches based on acoustic model
Translation is more prevalent and available today than interpolation. The basis (model) weights are calculated in
236 Vashisht et al.: Speech Recognition using Machine Learning

Fig. 4. The Attention Model.


Fig. 3. Working Model of Speech Recognition.
sources of information and functionality was presented.
these methods using an iterative process that uses the The key production and commitment of this work is the
maximum likelihood (ML) criterion. manner in which external information input is used to
Varghese et al. [5] stated there are many ways to increase the system's accuracy, thereby allowing a notable
understand feelings from expression. Many attempts have improvement, compared to the processes of nature. In
been made to identify states by vocal information. To addition, a new initiative has been launched from an
understand feelings, some essential voice function vectors analytical standpoint, while remaining a realistic one, and
have been picked, in which utterance level statistics are was discussed. The system we want converts Hindi to
measured. English, as per our discussion and planning, and vice versa.
D.J. Atha et al. [9] pointed out that a long-term target
is the creation of an automated real-time translation device 4.1 Initial Test
where the voice is the source. Recent developments in the
area of computational translation science, however, boost The constraint of the encoder–decoder model, attention,
the possibility of widespread adoption in the near future. is proposed to encrypt the input sequence to one fixed
length vector from which each output time stage is
decoded. With long sequences, focus is proposed as a
strategy to both align and interpret, and this problem is
3. Proposed Method believed to be more of a concern. Instead of encoding the
input sequence into a single fixed-context vector, the
In this research, the work is based on the flowchart attention model produces a context vector that is filtered
below. According to working model of speech. The independently for each output time step. The approach is
models illustrated previously are made up of millions of extended, as with the encoder–decoder text, to a machine
parameters, from which the instruction corpus needs to be translation problem, and uses GRU units rather than LSTM
learned. We make use of additional information where memory cells [9]. In this case, bidirectional input is used
appropriate, such as text that is closely linked to the speech where both forward and backward input sequences are
we are about to translate [7]. It is possible to write this text given, which are then concatenated before being passed to
in the source language, the target language, or both. the decoder. The input is positioned into an encoder model
Future development will reach billions of smart phone that gives us the output of the form encoder and the hidden
users for the most complex intelligent systems focused on shape state encoder.
deep learning. There is a lengthy list of vision and voice Facilitated communication (FC), or supported typing, is
technologies that can increasingly simplify and assist the a scientifically discredited technique that attempts to aid
visual and auditory processing of humans to a greater scale communication by people with autism or other
and consistency, from sensation and emotion detection to communication disabilities, and who are non-verbal. The
the development of self-driving autonomous transport facilitator guides the disabled person's arm or hand, and
systems. This paper serves scholars, clinicians, technology attempts to help them type on a keyboard or other device.
creators, and consumers as an exemplary analysis of
emerging technologies in many fields, such as behavioral The calculations applied are:
science, psychology, transportation, and medicine. FC= Totally Connected Layer (Dense)
Output EO=Encoder
H = Concealed State
4. Results X=Entry into the Decoder
And with the pseudo-code:
Voice detection with real-time predictive voice
translation device optimization using multimodal vector FC (tanh (FC(EO) + FC(H))) score =Weights for focus
IEIE Transactions on Smart Processing and Computing, vol. 10, no. 3, June 2021 237

Fig. 6. Second Translation from English to Hindi.

Fig. 5. First Translation from English to Hindi.

= SoftMax (score, axis = 1). It is implemented on the last


axis by default, but we want to implement it on the first
axis here, as the score form is as follows: batch size, max
length, secret size. The length of our input is Max length.
Since we are attempting to assign a weight to each input, it
is important to add SoftMax on that axis. Context vector =
sum (weights of focus * EO, axis = 1). The same
explanation as above applies for an axis selection of 1.
Embedding output = The input is transferred through an
embedding layer to the decoder. Integrated vector =
concept (embedding output, context vector).

4.2 Translate
The assessment function is close to the teaching loop,
Fig. 7. Third Translation from English to Hindi.
except that we are not pressuring teachers here. At each
time point, the input to the decoder, along with the hidden
state and the encoder output, are its previous predictions. unit (CPU) can understand. It also detects errors in the
Avoid forecasting the final token as the model predicts it. program.
Store the attention weights for each step in time. A translator
is a programming language processor that converts a
computer program from one language to another. It takes 5. Conclusion
a program written in source code and converts it into
machine code. It discovers and identifies errors during In the past few years, the complexity and precision of
translation. It translates a high-level language program into speech recognition applications have evolved
a machine language program that the central processing exponentially. This paper extensively explores the recent
238 Vashisht et al.: Speech Recognition using Machine Learning

advancements in intelligent vision and speech algorithms, [7] T. Taleb, K. Samdanis, B. Mada, H. Flinck, S. Dutta,
their applications on the most popular smart phones and and D. Sabella, "On Multi-Access Edge Computing:
embedded platforms, and their application limitations. In A Survey of the Emerging 5G Network Edge Cloud
spite of immense advances in success and efficacy from Architecture and Orchestration," IEEE
deep learning algorithms, training the machine with other Communications Surveys & Tutorials, vol. 19, no. 3,
knowledge sources, which are the framework, also pp. 1657-1681, 2017, Article (CrossRefLink)
contributes significantly to the class subject. [8] J. C. Amengual, A. Castaño, A. Castellanos, V.M.
Jiménez, D. Llorens, A. Marzal, F. Prat, J.M. Vilar,
J.M. Benedi, F. Casacuberta, M. Pastor, E. Vidal, The
6. Future Scopes EuTrans spoken language translation system,
Machine Translation 15 (2000) 75–103
This work can be explored in depth in order to [9] D. J. Atha, M. R. Jahanshahi, Evaluation of deep
improvise and incorporate new functionality to the project, learning approaches based on convolutional neural
and it can be worked on further.. In order to accumulate a networks for corrosion detection, Struct. Health
larger number of samples and maximize productivity, the Monit. 17 (5) (2018) 1110–1128. Article (CrossRef
new software does not accommodate a broad vocabulary Link)
[10]. Only a few parts of the notepad are protected by the [10] S. Shahnawazuddin, Rohit Sinha, Sparse coding over
current edition of the app, but more areas can be covered, redundant dictionaries for fast adaptation of speech
and efforts will be made in this respect. recognition system, Computer Speech & Language,
Volume 43, 2017, Pages 1-17, ISSN 0885-2308,
Article (CrossRefLink).
Acknowledgement
The authors would like to express their sincere thanks Satya Prakash Yadav is currently on
to the editor-in-chief for his valuable suggestions to the faculty of the Information
improve this article. Technology Department, ABES
Institute of Technology (ABESIT),
References Ghaziabad (India). A seasoned
academician having more than 13
[1] Mehmet Berkehan Akçay, Kaya Oğuz, Speech years of experience, he has published
emotion recognition: Emotional models, three books (Programming in C,
databases,features, preprocessing methods, Programming in C++, and Blockchain
supporting modalities, and classifiers, Speech and Cryptocurrency) under I.K. International Publishing
Communication, Volume116, 2020, Pages 56-76, House Pvt. Ltd. He has undergone industrial training
ISSN 0167-6393, Article (CrossRefLink). programs during which he was involved in live projects
[2] G. Tsontzos, V. Diakoloukas, C. Koniaris and V. with companies in the areas of SAP, Railway Traffic
Digalakis, "Estimation of General Identifiable Linear Management Systems, and Visual Vehicles Counter and
Dynamic Models with an sApplication in Speech Classification (used in the Metro rail network design). He
characteristics vectors, Computer Standards & is an alumnus of Netaji Subhas Institute of Technology
Interfaces, Volume 35, Issue 5, 2013, Pages 490-506, (NSIT), Delhi University. A prolific writer, Mr. Yadav has
ISSN 0920-5489. filed two patents and authored many research papers in the
[3] Y. Wu et al., "Google's Neural Machine Translation Web of Science indexed journals. Additionally, he has
System: Bridging the Gap between Human and presented research papers at many conferences in areas of
Machine Translation," arXiv preprint Image Processing and Programming such as Image
arXiv:1609.08144, pp. 1-23, 2016. Processing, Feature Extraction and Inforamtion Rectrival .
[4] Shuping Peng, Tao Lv, Xiyu Han, Shisong Wu, He is also the lead editor in CRC Press, Taylor and Francis
ChunhuiYan, Heyong Zhang,Remote speaker Group Publisher (U.S.A), Science Publishing Group
recognition based on the enhanced LDV-captured (U.S.A.), and Eureka Journals, Pune (India).
speech, Applied Acoustics,Volume 143, 2019, Pages
165-170, ISSN 0003-682X, Article (CrossRefLink)
[5] A. A. Varghese, J. P. Cherian and J. J.
Kizhakkethottam, "Overview on emotion recognition Vineet Vashisht is currently a research
system," 2015 International Conference on Soft- scholar in the Information Technology
Computing and Networks Security (ICSNS), Department at Dr. A.P.J. Abdul Kalam
Coimbatore, 2015, pp. 1-5, Article (CrossRefLink) Technical University, Lucknow.
[6] Vinícius Maran, Marcia Keske-Soares, towards a Vineet Vashisht is supervised by Ass.
speech therapy support system based on phonological Prof. Satya Prakash Yadav of the
processes early detection, Computer Speech & Information Technology Department,
Language, Volume 65, 2021, 101130, ISSN 0885- ABES Institute of Technology
2308, Article (CrossRefLink). (ABESIT).
IEIE Transactions on Smart Processing and Computing, vol. 10, no. 3, June 2021 239

Aditya Kumar Pandey is currently a


research scholar in the Information
Technology Department at Dr. A.P.J.
Abdul Kalam Technical University,
Lucknow. Aditya Kumar Pandey is
supervised by Ass. Prof. Satya Prakash
Yadav of the Information Technology
Department, ABES Institute of
Technology (ABESIT).

Copyrights © 2021 The Institute of Electronics and Information Engineers

View publication stats

You might also like