Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

282 Accesses
Explore all metrics

Abstract

With the fast advancement of AI technology in recent years, many excellent Data Augmentation (DA) approaches have been investigated to increase data efficiency in Natural Language Processing (NLP). The reliance on a large amount of data prohibits NLP models from performing tasks such as labelling enormous amounts of textual data, which require a substantial amount of time, money, and human resources; hence, a better model requires more data. Text DA technique rectifies the data by extending it, enhancing the model's accuracy and resilience. A novel lexical-based matching approach is the cornerstone of this work; it is used to improve the quality of the Machine Translation (MT) system. This study includes resource-rich Indic (i.e., Indo-Aryan and Dravidian language families) to examine the proposed techniques. Extensive experiments on a range of language pairs depict that the proposed method significantly improves scores in the enhanced dataset compared to the baseline system's BLEU, METEOR and ROUGE evaluation scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Extremely low-resource neural machine translation for Asian languages

Article Open access 01 December 2020

Improving Low-Resource NMT with Parser Generated Syntactic Phrases

Cross-lingual Machine Translation: An Analysis Model for Low Resource Languages

Data Availability

The datasets generated during and/or analysed during the current study are available in the [AI4Bharat/indicnlp_catalog] repository, [https://github.com/AI4Bharat/indicnlp_catalog]

Notes

References

Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):1–48
Article Google Scholar
Ding B, Liu L, Bing L, Kruengkrai C, Nguyen TH, Joty S, Si L, Miao C (2020). DAGA: Data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549
Chen J, Tam D, Raffel, C., Bansal, M., & Yang, D. (2021). An empirical survey of data augmentation for limited data learning in NLP. arXiv preprint arXiv:2106.07499
Xia M, Kong X, Anastasopoulos A, Neubig G (2019). Generalized data augmentation for low-resource translation. arXiv preprint arXiv:1906.03785
Wei J, Zou K (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196
Sennrich R, Haddow B, Birch A (2016). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709
Xie Z, Wang SI, Li J, Lévy D, Nie A, Jurafsky D, Ng AY (2017). Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573
Joshi P, Santy S, Budhiraja A, Bali K, Choudhury M (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 6282–6293
Koehn P, Knowles R (2017) Six Challenges for Neural Machine Translation. In: Proceedings of the First Workshop on Neural Machine Translation, Vancouver, Canada, Aug. pp. 28–39
Dewangan S, Alva S, Joshi N, Bhattacharyya P (2021) Experience of neural machine translation between indian languages. Mach Transl 35(1):71–99
Article Google Scholar
Saxena S, Chauhan S, Arora P, Daniel P, (2022). Unsupervised SMT: an analysis of Indic languages and a low resource language, J Exp Theoret Artificial Intelligence
Saxena S, Chauhan S, Arora P et al. (2022) Explicitly unsupervised statistical machine translation analysis on five Indian languages using automatic evaluation metrics. Sādhanā 47, 106
Zheng Z, Yue X, Huang S, Chen J, Birch A (2020) Towards making the most of context in neural machine translation. arXiv preprint arXiv:2002.07982
Junczys-Dowmunt M (2019). Microsoft translator at WMT 2019: Towards large-scale document-level neural machine translation. arXiv preprint arXiv:1907.06170
Koehn P, Senellart J. 2010. Convergence of Translation Memory and Statistical Machine Translation. In: Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry, pages 21–32, Denver, Colorado, USA. Association for Machine Translation in the Americas
Ortega J, Sánchez-Martínez F, Forcada ML (2016) Fuzzy-match repair using black-box machine translation systems: what can be expected? In Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track (pp. 27-39)
Gao F, Zhu J, Wu L, Xia Y, Qin T, Cheng X, ..., Liu TY (2019). Soft contextual data augmentation for neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5539-5544)
Post M, Vilar D (2018) Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 1314–1324. New Orleans, Louisiana
Google Scholar
Dinu G, Mathur P, Federico M, Al-Onaizan Y (2019) Training neural machine translation to apply terminology constraints. arXiv preprint arXiv:1906.01105
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440
Wang WY, Yang D (2015). That’s so annoying!!!: A lexical and frame-semantic embedding-based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2557-2563)
Andreas J (2020). Good-enough compositional data augmentation. arXiv preprint arXiv:1904.09545
Jozefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning. PMLR, Lille, France, volume 37 of Proceedings of Machine Learning Research, pp. 2342– 2350
Koehn P (2009) Statistical machine translation. Cambridge University Press
Book Google Scholar
Och FJ (2003). Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting of the Association for Computational Linguistics (pp. 160-167)
Chauhan S, Saxena S, Daniel P (2021). Monolingual and parallel corpora for Kangri low resource language. arXiv preprint arXiv:2103.11596
Artetxe M, Labaka G, Agirre E. (2018). Unsupervised Statistical Machine Translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3632–3642, Brussels, Belgium. Association for Computational Linguistics
Papineni K, Roukos S, Ward T, Zhu WJ (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318)
Banerjee S, Lavie A (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72)
Chauhan S, Saxena S, Daniel P (2022) Analysis of neural machine translation KANGRI language by unsupervised and semi supervised methods. IETE J Res, pp. 1-11

Download references

Funding

The research received no grant/funding in public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology, Hamirpur, India
Shefali Saxena, Ayush Gupta & Philemon Daniel

Authors

Shefali Saxena
View author publications
You can also search for this author in PubMed Google Scholar
Ayush Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Philemon Daniel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shefali Saxena.

Ethics declarations

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Saxena, S., Gupta, A. & Daniel, P. Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language. Multimed Tools Appl 83, 64255–64269 (2024). https://doi.org/10.1007/s11042-023-18086-8

Download citation

Received: 27 November 2022
Revised: 31 October 2023
Accepted: 29 December 2023
Published: 15 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11042-023-18086-8

Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extremely low-resource neural machine translation for Asian languages

Improving Low-Resource NMT with Parser Generated Syntactic Phrases

Cross-lingual Machine Translation: An Analysis Model for Low Resource Languages

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Efficient Data Augmentation via lexical matching for boosting performance on Statistical Machine Translation for Indic and a Low-resource language

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extremely low-resource neural machine translation for Asian languages

Improving Low-Resource NMT with Parser Generated Syntactic Phrases

Cross-lingual Machine Translation: An Analysis Model for Low Resource Languages

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now