[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Computational Linguistics: From Text to Speech Technologies

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 30 April 2025 | Viewed by 4749

Special Issue Editors


E-Mail Website
Guest Editor
Research Institute on Multilingual Language Technologies, Department of Translation and Interpreting, University of Malaga, 29016 Málaga, Spain
Interests: corpus linguistics; machine interpreting; speech-to-text; translation and interpreting technologies; computational phraseology

E-Mail Website
Guest Editor
UCREL, Lancaster University, Lancaster LA1 4WA, UK
Interests: computational linguistics; natural language processing; machine translation; quality estimation

Special Issue Information

Dear Colleagues,

In recent years, advancements in machine learning, natural language processing, artificial intelligence, and speech synthesis have revolutionized how we communicate with other humans and language-based systems. From virtual assistants to language translation tools, the capabilities of these technologies continue to expand, offering new possibilities for communication, accessibility, and innovation.

This Special Issue serves as a platform to explore the latest research, methodologies, and applications that drive the development of various text-to-speech technologies, such as automatic speech recognition, machine interpreting, speech translation, and speech-to-text software, among others. The Special Issue is intended for researchers, practitioners, and enthusiasts in the fields of computational linguistics, corpus linguistics, natural language processing, and machine learning. We invite research studies based on neural network architectures, large language models, linguistic modeling, AI-driven systems, and the intersection of linguistics and computer science (including multilingual communication). We would also like to invite authors to address the challenges in applying text-to-speech technologies in practical applications, low-resource languages, and specific domains.

Prof. Dr. Gloria Corpas Pastor
Dr. Tharindu Ranasinghe
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • artificial intelligence (AI)
  • automatic speech recognition (ASR)
  • machine interpreting (MI)
  • cascaded models
  • end2end models
  • speech-to-text (STT) modelling
  • speech translation
  • quality estimation
  • large language models (LLMs)

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 1448 KiB  
Article
Fit for What Purpose? NER Certification of Automatic Captions in English and Spanish
by Pablo Romero-Fresco and Yanou Van Gauwbergen
Appl. Sci. 2025, 15(3), 1387; https://doi.org/10.3390/app15031387 - 29 Jan 2025
Viewed by 611
Abstract
As human and fully automatic live captioning methods coexist and compete against one another, quality analyses and certification become essential. A case in point is LiRICS, the Live Respeaking International Certification Standard created by the Galician Observatory for Media Accessibility (GALMA) to help [...] Read more.
As human and fully automatic live captioning methods coexist and compete against one another, quality analyses and certification become essential. A case in point is LiRICS, the Live Respeaking International Certification Standard created by the Galician Observatory for Media Accessibility (GALMA) to help maintain high international standards in the live captioning profession. Until now, this certification had only been used to assess human captioners. In this paper, it is applied for the first time to automatic captioning (more specifically to Lexi, the automatic software used by the leading captioning company AI-Media) in order to ascertain whether automatic captions have reached an accuracy level that can match that of human captions. After presenting the materials and the methods (NER model), the paper reports on the results of the analysis of Lexi’s English and Spanish automatic captions. With average accuracy rates of 98.56% in English and 98.26% in Spanish, these captions often manage to reach human levels of quality, except when applied to colloquial content featuring several speakers. A final discussion is devoted to a reflection on how automatic and human live captions can coexist as long as the different purposes they serve are considered, namely the access in bulk provided by automatic captions and the curated access offered by human captions. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

Figure 1
<p>Formula used by the NER model to calculate accuracy.</p>
Full article ">Figure 2
<p>Bar graph of NER accuracy rate scores with assessment of speaker IDs and without assessment of speaker IDs for English (in red) and Spanish subtitles (in blue).</p>
Full article ">Figure 3
<p>Caption to Transcription Continuum for live captioning.</p>
Full article ">Figure 4
<p>Classification of live captions along the Caption to Transcription Continuum.</p>
Full article ">
24 pages, 432 KiB  
Article
Sequence-to-Sequence Models and Their Evaluation for Spoken Language Normalization of Slovenian
by Mirjam Sepesy Maučec, Darinka Verdonik and Gregor Donaj
Appl. Sci. 2024, 14(20), 9515; https://doi.org/10.3390/app14209515 - 18 Oct 2024
Viewed by 785
Abstract
Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers to transforming non-standard language forms into their standard counterparts. Non-standard language forms come from different written and spoken sources. This [...] Read more.
Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers to transforming non-standard language forms into their standard counterparts. Non-standard language forms come from different written and spoken sources. This paper deals with one such source, namely speech from the less-resourced highly inflected Slovenian language. The paper explores speech corpora recently collected in public and private environments. We analyze the efficiencies of three sequence-to-sequence models for automatic normalization from literal transcriptions to standard forms. Experiments were performed using words, subwords, and characters as basic units for normalization. In the article, we demonstrate that the superiority of the approach is linked to the choice of the basic modeling unit. Statistical models prefer words, while neural network-based models prefer characters. The experimental results show that the best results are obtained with neural architectures based on characters. Long short-term memory and transformer architectures gave comparable results. We also present a novel analysis tool, which we use for in-depth error analysis of results obtained by character-based models. This analysis showed that systems with similar overall results can differ in the performance for different types of errors. Errors obtained with the transformer architecture are easier to correct in the post-editing process. This is an important insight, as creating speech corpora is a time-consuming and costly process. The analysis tool also incorporates two statistical significance tests: approximate randomization and bootstrap resampling. Both statistical tests confirm the improved results of neural network-based models compared to statistical ones. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

Figure 1
<p>Example HTML output with the most common errors (missing conversion marked in red, wrong conversion marked in green, and unwarranted conversion marked in blue).</p>
Full article ">Figure 2
<p>Example HTML output with deleted word error (W-del) due to a deleted space (missing conversion marked in red, unwarranted conversion marked in blue, and missing word marked in bright red).</p>
Full article ">Figure 3
<p>Example HTML output with inserted word error (W-ins) and deleted word error (W-Del).</p>
Full article ">
14 pages, 433 KiB  
Article
Automatic Speech Recognition Advancements for Indigenous Languages of the Americas
by Monica Romero, Sandra Gómez-Canaval and Ivan G. Torre
Appl. Sci. 2024, 14(15), 6497; https://doi.org/10.3390/app14156497 - 25 Jul 2024
Cited by 1 | Viewed by 1381
Abstract
Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities in America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed the task of training automatic speech recognition (ASR) systems for [...] Read more.
Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities in America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed the task of training automatic speech recognition (ASR) systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa’ikhana. In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. We systematically investigate, using a Bayesian search, the impact of the different hyperparameters on the Wav2vec2.0 XLS-R variants of 300 M and 1 B parameters. Our findings indicate that data and detailed hyperparameter tuning significantly affect ASR accuracy, but language complexity determines the final result. The Quechua model achieved the lowest character error rate (CER) (12.14), while the Kotiria model, despite having the most extensive dataset during the fine-tuning phase, showed the highest CER (36.59). Conversely, with the smallest dataset, the Guarani model achieved a CER of 15.59, while Bribri and Wa’ikhana obtained, respectively, CERs of 34.70 and 35.23. Additionally, Sobol’ sensitivity analysis highlighted the crucial roles of freeze fine-tuning updates and dropout rates. We release our best models for each language, marking the first open ASR models for Wa’ikhana and Kotiria. This work opens avenues for future research to advance ASR techniques in preserving minority Indigenous languages. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

Figure 1
<p>Sketch of the dataset used for fine-tuning the ASR system, the CNN and transformer-based architecture wav2vec2.0, the fine-tuning process, the Bayesian hyperparameter search and the Sobol sensitivity analysis.</p>
Full article ">Figure 2
<p>The outer bar chart panel displays the character error rates (CERs) for five Indigenous language models: Kotiria, Wa’ikhana, Bribri, Guarani, and Quechua. Lower bars indicate better-quality performance of the model. The inner panel provides a Sobol’ sensitivity analysis of the various hyperparameters tuned during model training, assessing their impact on model performance variability. The orange bars represent the total sensitivity (ST) index, while the green bars indicate the first-order sensitivity (S1) index. A higher bar indicates the more importance of that hyperparameter when correctly choosing it during the fine-tuning phase.</p>
Full article ">
17 pages, 791 KiB  
Article
Using Transfer Learning to Realize Low Resource Dungan Language Speech Synthesis
by Mengrui Liu, Rui Jiang and Hongwu Yang
Appl. Sci. 2024, 14(14), 6336; https://doi.org/10.3390/app14146336 - 20 Jul 2024
Viewed by 1193
Abstract
This article presents a transfer-learning-based method to improve the synthesized speech quality of the low-resource Dungan language. This improvement is accomplished by fine-tuning a pre-trained Mandarin acoustic model to a Dungan language acoustic model using a limited Dungan corpus within the Tacotron2+WaveRNN framework. [...] Read more.
This article presents a transfer-learning-based method to improve the synthesized speech quality of the low-resource Dungan language. This improvement is accomplished by fine-tuning a pre-trained Mandarin acoustic model to a Dungan language acoustic model using a limited Dungan corpus within the Tacotron2+WaveRNN framework. Our method begins with developing a transformer-based Dungan text analyzer capable of generating unit sequences with embedded prosodic information from Dungan sentences. These unit sequences, along with the speech features, provide <unit sequence with prosodic labels, Mel spectrograms> pairs as the input of Tacotron2 to train the acoustic model. Concurrently, we pre-trained a Tacotron2-based Mandarin acoustic model using a large-scale Mandarin corpus. The model is then fine-tuned with a small-scale Dungan speech corpus to derive a Dungan acoustic model that autonomously learns the alignment and mapping of the units to the spectrograms. The resulting spectrograms are converted into waveforms via the WaveRNN vocoder, facilitating the synthesis of high-quality Mandarin or Dungan speech. Both subjective and objective experiments suggest that the proposed transfer learning-based Dungan speech synthesis achieves superior scores compared to models trained only with the Dungan corpus and other methods. Consequently, our method offers a strategy to achieve speech synthesis for low-resource languages by adding prosodic information and leveraging a similar, high-resource language corpus through transfer learning. Full article
(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies)
Show Figures

Figure 1

Figure 1
<p>The framework of Tacotron2+WaveRNN-based Dungan speech synthesis.</p>
Full article ">Figure 2
<p>Procedure of Dungan text analysis.</p>
Full article ">Figure 3
<p>Structure of a Dungan character.</p>
Full article ">Figure 4
<p>The framework of BLSTM_CRF-based Dungan Prosodic Boundary Prediction. The input is a Dungan sentence with prosodic information.</p>
Full article ">Figure 5
<p>The framework of Transformer-based Dungan character-to-unit conversion. The input is a Dungan sentence with prosodic information (<b>left</b>) and its corresponding Pinyin sequence (<b>right</b>). The output is the Pinyin sequence with prosodic information.</p>
Full article ">Figure 6
<p>Procedure of training the Dungan language acoustic model with transfer learning.</p>
Full article ">Figure 7
<p>The average MOS scores of synthesized Dungan speech under 95% confidence intervals.</p>
Full article ">Figure 8
<p>The average MOS scores of synthesized Mandarin speech under 95% confidence intervals.</p>
Full article ">Figure 9
<p>The average DMOS scores of synthesized Dungan speech under 95% confidence intervals.</p>
Full article ">Figure 10
<p>The average DMOS scores of synthesized Mandarin speech under 95% confidence intervals.</p>
Full article ">
Back to TopTop