[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Speech Recognition and Natural Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 31 May 2025 | Viewed by 7229

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computing and Mathematics, Faculty of Science and Engineering, University of Derby, Derby DE22 1GB, UK
Interests: artificial intelligence (AI); natural language processing (NLP)

E-Mail Website
Guest Editor
Department of Computing and Mathematics, Faculty of Science and Engineering, University of Derby, Derby DE22 1GB, UK
Interests: artificial intelligence; natural language processing (NLP)

Special Issue Information

Dear Colleagues,

Speech Recognition (SR) and Natural Language Processing (NLP) have emerged as two of the most transformative fields in artificial intelligence. This Special Issue aims to explore the latest advancements and challenges in the interdisciplinary fields of Speech Recognition (SR) and Natural Language Processing (NLP). As the demand for intelligent systems capable of understanding and processing human language continues to rise, researchers are increasingly focusing on developing innovative algorithms, models, and applications in these domains. This Special Issue provides a platform for scholars and practitioners to disseminate their cutting-edge research findings, methodologies, and insights, fostering collaboration and driving progress in this dynamically progressive field.

Topics of interest include, but are not limited to, the following:

  • Automatic Speech Recognition (ASR) systems;
  • Natural Language Understanding (NLU) and interpretation;
  • Speech synthesis and generation;
  • Sentiment analysis and opinion mining;
  • Dialogue systems and conversational interfaces;
  • Machine translation and cross-lingual NLP;
  • Voice user interfaces (VUIs) and intelligent assistants;
  • Language modeling and representation learning;
  • End-to-end speech-to-text and text-to-speech systems;
  • Speech and language applications.

We invite original research contributions, review articles, case studies, and surveys that advance the state of the art in Speech Recognition and Natural Language Processing. Submissions should present novel methodologies, experimental results, theoretical insights, or practical applications that contribute to the development and understanding of these critical areas.

Dr Asad Abdi
Prof. Dr. Farid Meziane
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • automatic speech recognition
  • natural language understanding
  • sentiment analysis
  • machine translation
  • voice user interfaces
  • speech-to-text
  • text-to-speech
  • dialogue systems
  • conversational AI
  • spoken language understanding
  • language modeling

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 1536 KiB  
Article
Deep Learning Classification of Traffic-Related Tweets: An Advanced Framework Using Deep Learning for Contextual Understanding and Traffic-Related Short Text Classification
by Wasen Yahya Melhem, Asad Abdi and Farid Meziane
Appl. Sci. 2024, 14(23), 11009; https://doi.org/10.3390/app142311009 - 27 Nov 2024
Viewed by 770
Abstract
Classifying social media (SM) messages into relevant or irrelevant categories is challenging due to data sparsity, imbalance, and ambiguity. This study aims to improve Intelligent Transport Systems (ITS) by enhancing short text classification of traffic-related SM data. Deep learning methods such as RNNs, [...] Read more.
Classifying social media (SM) messages into relevant or irrelevant categories is challenging due to data sparsity, imbalance, and ambiguity. This study aims to improve Intelligent Transport Systems (ITS) by enhancing short text classification of traffic-related SM data. Deep learning methods such as RNNs, CNNs, and BERT are effective at capturing context, but they can be computationally expensive, struggle with very short texts, and perform poorly with rare words. On the other hand, transfer learning leverages pre-trained knowledge but may be biased towards the pre-training domain. To address these challenges, we propose DLCTC, a novel system combining character-level, word-level, and context features with BiLSTM and TextCNN-based attention. By utilizing external knowledge, DLCTC ensures an accurate understanding of concepts and abbreviations in traffic-related short texts. BiLSTM captures context and term correlations; TextCNN captures local patterns. Multi-level attention focuses on important features across character, word, and concept levels. Experimental studies demonstrate DLCTC’s effectiveness over well-known short-text classification approaches based on CNN, RNN, and BERT. Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
Show Figures

Figure 1

Figure 1
<p>The overall framework of the proposed method.</p>
Full article ">Figure 2
<p>Character embedding framework.</p>
Full article ">Figure 3
<p>Word embedding framework.</p>
Full article ">
20 pages, 5718 KiB  
Article
Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering
by Jovan Galić, Branko Marković, Đorđe Grozdić, Branislav Popović and Slavko Šajić
Appl. Sci. 2024, 14(18), 8223; https://doi.org/10.3390/app14188223 - 12 Sep 2024
Cited by 1 | Viewed by 1990
Abstract
Modern Automatic Speech Recognition (ASR) systems are primarily designed to recognize normal speech. Due to a considerable acoustic mismatch between normal speech and whisper, ASR systems suffer from a significant loss of performance in whisper recognition. Creating large databases of whispered speech is [...] Read more.
Modern Automatic Speech Recognition (ASR) systems are primarily designed to recognize normal speech. Due to a considerable acoustic mismatch between normal speech and whisper, ASR systems suffer from a significant loss of performance in whisper recognition. Creating large databases of whispered speech is expensive and time-consuming, so research studies explore the synthetic generation using pre-existing normal or whispered speech databases. The impact of standard audio data augmentation techniques on the accuracy of isolated-word recognizers based on Hidden Markov Models (HMM) and Convolutional Neural Networks (CNN) is examined in this research study. Furthermore, the study explores the potential of inverse filtering as an augmentation strategy for producing pseudo-whisper speech. The Whi-Spe speech database, containing recordings in normal and whisper phonation, is utilized for data augmentation, while the internally recorded speech database, developed specifically for this study, is employed for testing purposes. Experimental results demonstrate statistically significant improvement in performance when employing data augmentation strategies and inverse filtering. Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
Show Figures

Figure 1

Figure 1
<p>The waveform (<b>a</b>) and spectrogram (<b>b</b>) of the phrase “Govor šapata”. spoken in Serbian (normal—capital letters; whisper—small letters). The horizontal axis represents time in seconds.</p>
Full article ">Figure 2
<p>The architecture of sequential audio augmentation for one (<b>a</b>) and N (<b>b</b>) augmentations of the input audio signal.</p>
Full article ">Figure 3
<p>Example of inverse filtering on a word in normal speech: (<b>a</b>) FFT spectrum of the word, (<b>b</b>) LPC spectral envelope, (<b>c</b>) FFT spectrum after inverse filtering, and (<b>d</b>) frequency response of the inverse filter <span class="html-italic">IF</span> (<span class="html-italic">z</span>).</p>
Full article ">Figure 4
<p>The flow diagram for the generation of augmented datasets using various augmentation techniques and their combinations.</p>
Full article ">Figure 5
<p>The flow diagram for the generation of augmented datasets by varying the number of augmentations using a single augmentation technique.</p>
Full article ">Figure 6
<p>The topology of HMM models.</p>
Full article ">Figure 7
<p>Training history of model: blue—training loss; orange—validation loss.</p>
Full article ">Figure 8
<p>The segmentation of utterances using the fixed overlap factor.</p>
Full article ">Figure 9
<p>The flow of the recognition process in (<b>a</b>) baseline experiments and (<b>b</b>) experiments testing the impact of data augmentation techniques and the number of augmentations.</p>
Full article ">Figure 10
<p>The average recognition accuracy (in %) for Whi-Spe (closed set) and DBtest database (open set) in the HMM framework. The horizontal axis denotes the percentage of the Whi-Spe subset employed in the training.</p>
Full article ">Figure 11
<p>The average recognition accuracy (in %) for Whi-Spe (closed set) and DBtest database (open set) in the CNN framework. The horizontal axis denotes the percentage of the Whi-Spe subset employed in the training.</p>
Full article ">Figure 12
<p>Average training time (<b>a</b>) and Real Time Factor (<b>b</b>) for HMM and CNN recognizers. The corresponding accuracies are given in parentheses.</p>
Full article ">Figure 13
<p>Average recognition accuracy vs. the number of augmentations for HMM and CNN recognizers.</p>
Full article ">Figure 14
<p>Relative <span class="html-italic">WER</span> improvement compared to the baseline.</p>
Full article ">
17 pages, 543 KiB  
Article
Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
by Minsoo Kim and Gil-Jin Jang
Appl. Sci. 2024, 14(18), 8138; https://doi.org/10.3390/app14188138 - 10 Sep 2024
Viewed by 1614
Abstract
Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR [...] Read more.
Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers’ voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER). Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
Show Figures

Figure 1

Figure 1
<p>Multi-speaker speech recognition problem illustration. The voices of two independent speakers were recorded by a single microphone, denoted by <math display="inline"><semantics> <mi mathvariant="bold">x</mi> </semantics></math>. A multi-speaker speech recognition system generates two or more word sequences, denoted by <math display="inline"><semantics> <msup> <mi mathvariant="bold">y</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi mathvariant="bold">y</mi> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </msup> </semantics></math>, where the parenthesized superscripts are speaker indices from the given audio recordings of overlapped speakers.</p>
Full article ">Figure 2
<p>Four types of the conventional multi-speaker automatic speech recognition methods. (<b>a</b>) A combination of acoustic source separation and single-input mixed speech and single output text (SISO) ASR; (<b>b</b>) a combination of single-input mixed speech and multiple-output text (SIMO) ASR; (<b>c</b>) the addition of speaker embedding vectors as an additional input to a SIMO ASR; and (<b>d</b>) the addition of an encoder that splits multiple speakers into multiple representations, with encoder outputs as speaker and text embedding vectors that are suited to SISO ASR decoders.</p>
Full article ">Figure 3
<p>Overview of the proposed speaker-attributed training ASR system. The gray blocks were trained by single-speaker recordings, and these were fixed when the white blocks were trained with multi-speaker recordings. The same <math display="inline"><semantics> <msub> <mi>Enc</mi> <mi>rec</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>Dec</mi> <mi>rec</mi> </msub> </semantics></math> were used with different inputs, and they are represented by the dotted link denoted by <span class="html-italic">*shared</span>. The boxed parts (**) require fine-tuning with multi-speaker utterances.</p>
Full article ">Figure 4
<p>Overview of the SAT-ASR system when using speaker profiles. The speaker embedding vector <math display="inline"><semantics> <msub> <mi mathvariant="bold">q</mi> <mi>mix</mi> </msub> </semantics></math> was passed through an additional block <math display="inline"><semantics> <msub> <mi>Attention</mi> <mi>speaker</mi> </msub> </semantics></math> and then sent to <math display="inline"><semantics> <msub> <mi>Enc</mi> <mi>mix</mi> </msub> </semantics></math>. <math display="inline"><semantics> <mi mathvariant="bold">P</mi> </semantics></math> is a profile matrix composed of the speaker embedding vectors obtained from the training dataset, and <math display="inline"><semantics> <mi>β</mi> </semantics></math> is a set of computed attention weights.</p>
Full article ">Figure 5
<p>Comparison of the number of profile utterances on the <span class="html-italic">LibriMix</span> dataset by WER.</p>
Full article ">
13 pages, 4133 KiB  
Article
Gender Recognition Based on the Stacking of Different Acoustic Features
by Ergün Yücesoy
Appl. Sci. 2024, 14(15), 6564; https://doi.org/10.3390/app14156564 - 27 Jul 2024
Viewed by 1268
Abstract
A speech signal can provide various information about a speaker, such as their gender, age, accent, and emotional state. The gender of the speaker is the most salient piece of information contained in the speech signal and is directly or indirectly used in [...] Read more.
A speech signal can provide various information about a speaker, such as their gender, age, accent, and emotional state. The gender of the speaker is the most salient piece of information contained in the speech signal and is directly or indirectly used in many applications. In this study, a new approach is proposed for recognizing the gender of the speaker based on the use of hybrid features created by stacking different types of features. For this purpose, four different features, namely Mel frequency cepstral coefficients (MFCC), Mel scaled power spectrogram (Mel Spectrogram), Chroma, Spectral contrast (Contrast), and Tonal Centroid (Tonnetz), and twelve hybrid features created by stacking these features were used. These features were applied to four different classifiers, two of which were based on traditional machine learning (KNN and LDA) while two were based on the deep learning approach (CNN and MLP), and the performance of each was evaluated separately. In the experiments conducted on the Turkish subset of the Common Voice dataset, it was observed that hybrid features, created by stacking different acoustic features, led to improvements in gender recognition accuracy ranging from 0.3 to 1.73%. Full article
(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)
Show Figures

Figure 1

Figure 1
<p>Mel Spectrogram features extracted from a female (<b>left</b>) and a male speech (<b>right</b>).</p>
Full article ">Figure 2
<p>MFCC features extracted from a female (<b>left</b>) and a male speech (<b>right</b>).</p>
Full article ">Figure 3
<p>Chroma features extracted from a female voice (<b>left</b>) and a male voice (<b>right</b>).</p>
Full article ">Figure 4
<p>Contrast features extracted from a female voice (<b>left</b>) and a male voice (<b>right</b>).</p>
Full article ">Figure 5
<p>Tonnetz features extracted from a female voice (<b>left</b>) and a male voice (<b>right</b>).</p>
Full article ">
Back to TopTop