Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax
<p>Low-rank encoder structure diagram.</p> "> Figure 2
<p>Low-rank decoder structure diagram.</p> "> Figure 3
<p>Illustration of our end-to-end Conformer framework graph using low-rank approximate decomposition and balanced softmax.</p> "> Figure 4
<p>Training loss of different rank values of Conformer model and baseline model.</p> "> Figure 5
<p>The train loss with and without CTC as an auxiliary task.</p> "> Figure 6
<p>The train loss of the proposed efficient Conformer model and the baseline Conformer model.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Methods
3.1. Low Rank Approximation Decomposition
3.1.1. Low-Rank Encoder
3.1.2. Low-Rank Decoder
3.2. Balanced Softmax
3.3. Connectionist Temporal Classification
3.4. Efficient Conformer ASR
4. Experimental Environment
4.1. Dataset
4.2. Experimental Configuration
5. Results and Discussion
5.1. Low-Rank Decomposition Experiment
5.2. Experiment with Balanced Softmax
5.3. Efficient Conformer Model Using Low Rank Decomposition and Balanced Softmax
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100 2020. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
- Winata, G.I.; Cahyawijaya, S.; Lin, Z.; Liu, Z.; Fung, P. Lightweight and efficient end-to-end speech recognition using low-rank transformer. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6144–6148. [Google Scholar]
- Wang, X.; Sun, S.; Xie, L.; Ma, L. Efficient conformer with prob-sparse attention mechanism for end-to-end speech recognition. arXiv 2021, arXiv:2106.09236. [Google Scholar]
- Burchi, M.; Vielzeuf, V. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 8–15. [Google Scholar]
- Li, S.; Xu, M.; Zhang, X.-L. Efficient conformer-based speech recognition with linear attention. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; pp. 448–453. [Google Scholar]
- Xue, J.; Li, J.; Yu, D.; Seltzer, M.; Gong, Y. Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6359–6363. [Google Scholar]
- Jamal, M.A.; Brown, M.; Yang, M.-H.; Wang, L.; Gong, B. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7610–7619. [Google Scholar]
- Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11662–11671. [Google Scholar]
- Tang, K.; Huang, J.; Zhang, H. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Adv. Neural Inf. Process. Syst. 2020, 33, 1513–1524. [Google Scholar]
- Khassanov, Y.; Mussakhojayeva, S.; Mirzakhmetov, A.; Adiyev, A.; Nurpeiissov, M.; Varol, H.A. A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. arXiv 2020, arXiv:2009.10334 2020. [Google Scholar]
- Goldwater, S.; Jurafsky, D.; Manning, C.D. Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase ASR error rates. In Proceedings of the ACL-08: HLT; Association for Computational Linguistics: Cedarville, OH, USA, 2008; pp. 380–388. [Google Scholar]
- Lukeš, D.; Kopřivová, M.; Komrsková, Z.; Poukarová, P. Pronunciation variants and ASR of colloquial speech: A case study on Czech. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Das, A.; Li, J.; Zhao, R.; Gong, Y. Advancing connectionist temporal classification with attention modeling. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4769–4773. [Google Scholar]
- Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711 2012. [Google Scholar]
- Sainath, T.N.; He, Y.; Li, B.; Narayanan, A.; Pang, R.; Bruguier, A.; Chang, S.-y.; Li, W.; Alvarez, R.; Chen, Z. A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6059–6063. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
- Chiu, C.-C.; Sainath, T.N.; Wu, Y.; Prabhavalkar, R.; Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R.J.; Rao, K.; Gonina, E. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4774–4778. [Google Scholar]
- Krishna, D. A Dual-Decoder Conformer for Multilingual Speech Recognition. arXiv 2021, arXiv:2109.032772021. [Google Scholar]
- Zeineldeen, M.; Xu, J.; Lüscher, C.; Michel, W.; Gerstenberger, A.; Schlüter, R.; Ney, H. Conformer-based hybrid ASR system for Switchboard dataset. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7437–7441. [Google Scholar]
- Watanabe, S.; Hori, T.; Kim, S.; Hershey, J.R.; Hayashi, T. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1240–1253. [Google Scholar] [CrossRef]
- Kim, S.; Hori, T.; Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4835–4839. [Google Scholar]
- Winata, G.I.; Madotto, A.; Shin, J.; Barezi, E.J.; Fung, P. On the effectiveness of low-rank matrix factorization for lstm model compression. arXiv 2019, arXiv:1908.09982. [Google Scholar]
- Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6124–6128. [Google Scholar]
- Mehrotra, A.; Dudziak, Ł.; Yeo, J.; Lee, Y.-y.; Vipperla, R.; Abdelfattah, M.S.; Bhattacharya, S.; Ishtiaq, S.; Ramos, A.G.C.; Lee, S. Iterative compression of end-to-end asr model using automl. arXiv 2020, arXiv:2008.02897. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
- Chang, H.-J.; Yang, S.-w.; Lee, H.-y. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7087–7091. [Google Scholar]
- Lv, Y.; Wang, L.; Ge, M.; Li, S.; Ding, C.; Pan, L.; Wang, Y.; Dang, J.; Honda, K. Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7992–7996. [Google Scholar]
- Lin, Z.; Liu, J.Z.; Yang, Z.; Hua, N.; Roth, D. Pruning redundant mappings in transformer models via spectral-normalized identity prior. arXiv 2020, arXiv:2010.01791. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Du, J.; Na, X.; Liu, X.; Bu, H. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv 2018, arXiv:1808.10583. [Google Scholar]
- Godfrey, J.J.; Holliman, E.C.; McDaniel, J. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International Conference on, IEEE Computer Society, San Francisco, CA, USA, 23–26 March 1992; pp. 517–520. [Google Scholar]
- Maekawa, K. Corpus of Spontaneous Japanese: Its design and evaluation. In Proceedings of the ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, 13–16 April 2003. [Google Scholar]
- Bang, J.-U.; Yun, S.; Kim, S.-H.; Choi, M.-Y.; Lee, M.-K.; Kim, Y.-J.; Kim, D.-H.; Park, J.; Lee, Y.-J.; Kim, S.-H. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Appl. Sci. 2020, 10, 6936. [Google Scholar] [CrossRef]
- Mamyrbayev, O.; Oralbekova, D.; Kydyrbekova, A.; Turdalykyzy, T.; Bekarystankyzy, A. End-to-end model based on RNN-T for Kazakh speech recognition. In Proceedings of the 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), Nagoya, Japan, 25–27 June 2021; pp. 163–167. [Google Scholar]
- Orken, M.; Dina, O.; Keylan, A.; Tolganay, T.; Mohamed, O. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 2022, 12, 8337. [Google Scholar] [CrossRef] [PubMed]
- Mamyrbayev, O.; Alimhan, K.; Zhumazhanov, B.; Turdalykyzy, T.; Gusmanova, F. End-to-end speech recognition in agglutinative languages. In Proceedings of the Intelligent Information and Database Systems: 12th Asian Conference, ACIIDS 2020, Phuket, Thailand, 23–26 March 2020; pp. 391–401. [Google Scholar]
- Mamyrbayev, O.Z.; Oralbekova, D.O.; Alimhan, K.; Nuranbayeva, B.M. Hybrid end-to-end model for Kazakh speech recognition. Int. J. Speech Technol. 2022, 1–10. [Google Scholar] [CrossRef]
- Toshniwal, S.; Kannan, A.; Chiu, C.-C.; Wu, Y.; Sainath, T.N.; Livescu, K. A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proceedings of the 2018 IEEE spoken language technology workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 369–375. [Google Scholar]
- Sriram, A.; Jun, H.; Satheesh, S.; Coates, A. Cold fusion: Training seq2seq models together with language models. arXiv 2017, arXiv:1708.06426. [Google Scholar]
- Huang, W.R.; Sainath, T.N.; Peyser, C.; Kumar, S.; Rybach, D.; Strohman, T. Lookup-table recurrent language models for long tail speech recognition. arXiv 2021, arXiv:2104.04552. [Google Scholar]
- Winata, G.I.; Wang, G.; Xiong, C.; Hoi, S. Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition. arXiv 2020, arXiv:2012.01687. [Google Scholar]
- Deng, K.; Cheng, G.; Yang, R.; Yan, Y. Alleviating asr long-tailed problem by decoupling the learning of representation and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 340–354. [Google Scholar] [CrossRef]
- Ren, J.; Yu, C.; Ma, X.; Zhao, H.; Yi, S. Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Process. Syst. 2020, 33, 4175–4186. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531 2015. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
KSC | |
---|---|
Training set | KSC-train |
Utterance | 147,236 |
Tail word number | 152,566 |
Head word number | 4625 |
Tail word frequency | 558,919 |
Head word frequency | 1,044,443 |
Frequency ratio of tail and head | 65.2:34.8 |
Development set | KSC-dev |
Utterance | 3283 |
Tail word frequency | 11,288 |
Head word frequency | 21,730 |
Frequency ratio of tail and head | 65.8:34.2 |
Testing set | KSC-test |
Utterance | 3334 |
Tail word frequency | 13,505 |
Head word frequency | 21,830 |
Frequency ratio of tail and head | 61.7:38.2 |
Model | Number of Parameters | WER |
---|---|---|
Traditional hybrid structure | ||
DNN-HMM [13] | - | 13.7% |
End-to-end architecture | ||
BLSTM-CTC [13] | - | 28.8% |
Transformer-CTC | 31.7 M | 12.85% |
Baseline (Conformer-CTC) | 47.6 M | 10.36% |
Baseline + low rank(r = 128) | 47.6 M | 10.33% |
Baseline + low rank(r = 64) | 44.1 M | 10.85% |
Baseline + low rank(r = 32) | 42.3 M | 11.00% |
Model | Size | ΔSize | ΔWER |
---|---|---|---|
Baseline (Conformer-CTC) | 182 MB | 0 | 0 |
Baseline + low rank (r = 128) | 182 MB | 0 | 0.03 |
Baseline + low rank (r = 64) | 168.5 MB | −13.5 MB | −0.49% |
Baseline + low rank (r = 32) | 161.8 MB | −20.2 MB | −0.64% |
Model | WER |
---|---|
DNN-HMM [13] | 13.7% |
BLSTM-CTC [13] | 28.8% |
Transformer-CTC | 12.85% |
Conformer no CTC | 15.38% |
Baseline (Conformer-CTC) | 10.36% |
Baseline + balanced softmax | 10.21% |
Word in KSC | Translation * | Number of Appearances | Baseline | Use balanced softmax |
---|---|---|---|---|
eгдe | older | one | eкiдe | eгдe |
пaзл | puzzle | one | пaзьiл | пaзл |
инeгe | to the needle | one | eнeгe | инeгe |
Күптi | a lot | one | Күттi | Күптi |
жұлын | spinal cord | one | жoлын | жұлын |
Кeceктepдi | pieces | one | Кeзeктepдi | Кeceктepдi |
қызy | hot | one | қызьiл | қызy |
Different Size Penalty Factor | WER |
---|---|
1 | 10.36% |
1.1 | 10.21% |
1.2 | 10.25% |
Position | WER |
---|---|
Training + Inference phases | 10.21% |
Training | 10.34% |
Inference phases | 10.34% |
Model | Parameters | Size | WER |
---|---|---|---|
Baseline | 47.6 M | 182 MB | 10.36% |
Baseline + low rank compression | 44.1 M | 168.5 MB | 10.85% |
Baseline + balanced softmax | 47.6 M | 182 MB | 10.21% |
Baseline + low rank compression + balanced softmax | 44.1 M | 168.5 MB | 10.54% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, T.; Yolwas, N.; Slamu, W. Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax. Appl. Sci. 2023, 13, 4642. https://doi.org/10.3390/app13074642
Guo T, Yolwas N, Slamu W. Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax. Applied Sciences. 2023; 13(7):4642. https://doi.org/10.3390/app13074642
Chicago/Turabian StyleGuo, Ting, Nurmemet Yolwas, and Wushour Slamu. 2023. "Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax" Applied Sciences 13, no. 7: 4642. https://doi.org/10.3390/app13074642