JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer
<p>Dataset Construction Workflow.</p> "> Figure 2
<p>Data statistics. (<b>a</b>) Percentage distribution of volunteers’ ages. (<b>b</b>) Percentage distribution of transcription topics in JLMS25.</p> "> Figure 3
<p>The structure of MDKT.</p> "> Figure 4
<p>The structure of the proposed WFAdapter (<b>a</b>) and AttAdapter (<b>b</b>).</p> ">
Abstract
:1. Introduction
- Dataset Compilation—To create a comprehensive 25-h speech recognition dataset tailored for Jiao-Liao Mandarin. This dataset will include a substantial number of idiomatic expressions, enriching the corpus with the distinctive linguistic nuances of the dialect.
- Framework Development—To design and implement an advanced speech recognition framework. This framework will enhance the backbone model by incorporating the WFAdapter and AttAdapter modules. The three-phase training strategy effectively addresses and overcomes the unique challenges of Jiao-Liao Mandarin speech recognition, significantly enhancing the system’s accuracy and robustness in recognizing this dialect.
2. Related Work
3. JLMS25 Dataset
- a collection of commonly used expressions and folk proverbs from the Jiao-Liao Mandarin-speaking region, reflecting the unique linguistic features and acoustic variations of the area. Examples of folk proverbs and their meanings are shown in Table 1.
- transcriptions from a subset of the AISHELL-1 [21] speech corpus, significantly broadening the thematic scope of the corpus to include topics such as ‘finance’, ‘technology’, ‘sports’, ‘entertainment’, and ‘news’.
4. Approach
4.1. Feature Extraction
4.2. Adapter Tuning
4.3. Three-Phase Training Strategy
5. Experiment
5.1. Datasets and Metrics
5.2. Experimental Setup
5.3. Main Results
5.4. Ablation Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, B.; Sainath, T.N.; Sim, K.C.; Bacchiani, M.; Weinstein, E.; Nguyen, P.; Chen, Z.; Wu, Y.; Rao, K. Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4749–4753. [Google Scholar]
- Yang, Y.; Xu, H.; Huang, H.; Chng, E.S.; Li, S. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
- Yang, X.; Audhkhasi, K.; Rosenberg, A.; Thomas, S.; Ramabhadran, B.; Hasegawa-Johnson, M. Joint modeling of accents and acoustics for multi-accent speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1–5. [Google Scholar]
- Tang, Z.; Wang, D.; Xu, Y.; Sun, J.; Lei, X.; Zhao, S.; Wen, C.; Tan, X.; Xie, C.; Zhou, C.; et al. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Online, 6–14 December 2021. [Google Scholar]
- Liu, X.; Ye, S.; Fiumara, G.; De Meo, P. Influence Nodes Identifying Method via Community-based Backward Generating Network Framework. IEEE Trans. Netw. Sci. Eng. 2024, 11, 236–253. [Google Scholar] [CrossRef]
- Liu, X.; Miao, C.; Fiumara, G.; De Meo, P. Information Propagation Prediction Based on Spatial–Temporal Attention and Heterogeneous Graph Convolutional Networks. IEEE Trans. Comput. Soc. Syst. 2024, 11, 945–958. [Google Scholar] [CrossRef]
- Liu, X.; Feng, H.; Zhang, X.; Zhou, X.; Bouyer, A. Graph contrast learning for recommendation based on relational graph convolutional neural network. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102168. [Google Scholar] [CrossRef]
- Neubig, G.; Hu, J. Rapid Adaptation of Neural Machine Translation to New Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 875–880. [Google Scholar]
- Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viegas, F.; Wattenberg, M.; Corrado, G. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
- Abdel-Hamid, O.; Jiang, H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7942–7946. [Google Scholar]
- Michel, P.; Neubig, G. Extreme Adaptation for Personalized Neural Machine Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 2, pp. 312–318. [Google Scholar]
- Bapna, A.; Arivazhagan, N.; Firat, O. Simple, Scalable Adaptation for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1538–1548. [Google Scholar]
- Yoo, S.; Song, I.; Bengio, Y. A highly adaptive acoustic model for accurate multi-dialect speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5716–5720. [Google Scholar]
- Huang, J.T.; Li, J.; Yu, D.; Deng, L.; Gong, Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7304–7308. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
- Radhakrishnan, S.; Yang CH, H.; Khan, S.A.; Kiani, N.A.; Gomez-Cabrero, D.; Tegner, J.N. A parameter-efficient learning approach to arabic dialect identification with pre-trained general-purpose speech model. arXiv 2023, arXiv:2305.11244. [Google Scholar]
- Gu, Y.; Du, Z.; Zhang, S.; He, Y. Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024; pp. 2870–2874. [Google Scholar]
- Pham, N.Q.; Nguyen, T.N.; Stüker, S.; Waibel, A. Efficient Weight Factorization for Multilingual Speech Recognition. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 2421–2425. [Google Scholar]
- Leng, J.; Liu, W.; Guo, Q. Stock movement prediction model based on gated orthogonal recurrent units. Intell. Syst. Appl. 2022, 16, 200156. [Google Scholar] [CrossRef]
- Pham, N.Q.; Waibel, A.; Niehues, J. Adaptive multilingual speech recognition with pretrained models. arXiv 2022, arXiv:2205.12304. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline. arXiv 2017, arXiv:1709.05522. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Clevert, D.A. Fast and Accurate Deep Network Learning by Exponential Linear Units. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Ba, J.L. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Vaswani, A. Attention is all you need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101.2017(5). [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Text Content | Meaning |
---|---|
草包打擂台风雨无阻 | A jest at the punctuality of a machine. |
亲侄儿不糙起棵小树儿 | The nephew has a deep affection for his aunt and uncle. |
强扭的瓜不甜 | A task done reluctantly will not be fulfilling. |
Model | KeSpeech | JLMS25 | Mean | |||
---|---|---|---|---|---|---|
CER | WER | CER | WER | CER | WER | |
SHL-MDNN | 0.363 | 0.552 | 0.271 | 0.444 | 0.317 | 0.498 |
JMAA | 0.682 | 0.895 | 0.619 | 0.852 | 0.651 | 0.879 |
Full fine-tuning | 0.364 | 0.533 | 0.281 | 0.467 | 0.323 | 0.5 |
adapters | 0.314 | 0.473 | 0.258 | 0.438 | 0.286 | 0.455 |
MDKT | 0.310 | 0.472 | 0.204 | 0.359 | 0.257 | 0.411 |
Model | KeSpeech | JLMS25 | Mean | |||
---|---|---|---|---|---|---|
CER | WER | CER | WER | CER | WER | |
Wav2vec2.0 | 0.364 | 0.553 | 0.281 | 0.467 | 0.323 | 0.5 |
MDKT-bi | 0.319 | 0.481 | 0.275 | 0.471 | 0.297 | 0.476 |
Wav2vec2.0 + AttA | 0.332 | 0.496 | 0.239 | 0.396 | 0.286 | 0.446 |
Wav2vec2.0 + WFA | 0.337 | 0.507 | 0.227 | 0.392 | 0.282 | 0.449 |
MDKT-tri | 0.310 | 0.472 | 0.204 | 0.359 | 0.257 | 0.411 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Wang, Y.; Liu, X.; Su, K.; Li, Z.; Wang, Y.; Jiang, B.; Xie, K.; Liu, J. JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Appl. Sci. 2025, 15, 1670. https://doi.org/10.3390/app15031670
Li X, Wang Y, Liu X, Su K, Li Z, Wang Y, Jiang B, Xie K, Liu J. JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Applied Sciences. 2025; 15(3):1670. https://doi.org/10.3390/app15031670
Chicago/Turabian StyleLi, Xuchen, Yiqun Wang, Xiaoyang Liu, Kun Su, Zhaochen Li, Yitian Wang, Bin Jiang, Kang Xie, and Jie Liu. 2025. "JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer" Applied Sciences 15, no. 3: 1670. https://doi.org/10.3390/app15031670
APA StyleLi, X., Wang, Y., Liu, X., Su, K., Li, Z., Wang, Y., Jiang, B., Xie, K., & Liu, J. (2025). JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Applied Sciences, 15(3), 1670. https://doi.org/10.3390/app15031670