Abstract
In this chapter, we define a collaboration framework that enables Wikipedia editors to generate new articles while they help development of Machine Translation (MT) systems by providing post-edition logs. This collaboration framework was tested with editors of Basque Wikipedia. Their post-editing of Computer Science articles has been used to improve the output of a Spanish to Basque MT system called Matxin. For the collaboration between editors and researchers, we selected a set of 100 articles from the Spanish Wikipedia. These articles would then be used as the source texts to be translated into Basque using the MT engine. A group of volunteers from Basque Wikipedia reviewed and corrected the raw MT translations. This collaboration ultimately produced two main benefits: (i) the change logs that would potentially help improve the MT engine by using an automated statistical post-editing system, and (ii) the growth of Basque Wikipedia. The results show that this process can improve the accuracy of an Rule Based MT (RBMT) system in nearly 10 % benefiting from the post-edition of 50,000 words in the Computer Science domain. We believe that our conclusions can be extended to MT engines involving other less-resourced languages lacking large parallel corpora or frequently updated lexical knowledge, as well as to other domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
There are around 700,000 speakers, around 25 % of the total population of the Basque Country.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
http://siuc01.si.ehu.es/~jipsagak/OpenMT_Wiki/Eskuliburua_Euwikipedia+Omegat+Matxin.pdfhttp://siuc01.si.ehu.es/ ∼ jipsagak/OpenMT_Wiki/Eskuliburua_Euwikipedia+Omegat+Matxin.pdf
- 19.
References
Alegria I, Diaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2007) Transfer-based MT from Spanish into Basque: reusability, standardization and open source. In: CICLing 2007. Lecture notes in computer science, vol 4394. Springer, Berlin/New York, pp 374–384
Alegria I, Diaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2011) Matxin-Informatika: Versión del traductor Matxin adaptada al dominio de la informática. In: Proceedings of the XXVII Congreso SEPLN, Huelva, Spain, pp 321–322
Boitet C, Huynh CP, Nguyen HT, Bellynck V (2010) The iMAG concept: multilingual access gateway to an elected web sites with incremental quality increase through collaborative post-edition of MT pretranslations. In: Proceedings of Traitement Automatique du Langage Naturel, TALN, Montréal
Diaz de Ilarraza A, Labaka G, Sarasola K (2008) Statistical post-editing: a valuable method in domain adaptation of RBMT systems. In: Proceedings of MATMT2008 workshop: mixing approaches to machine translation, Euskal Herriko Unibersitatea, Donostia, pp 35–40
Dugast L, Senellart J, Koehn P (2007) Statistical post-editing on SYSTRAN’s rule-based translation system. In: Proceedings of the second workshop on statistical machine translation, Prague, pp 220–223
Dugast L, Senellart J, Koehn P (2009) Statistical post editing and dictionary extraction: Systran/Edinburgh submissions for ACL-WMT2009. In: Proceedings of the fourth workshop on statistical machine translation, Athens, pp 110–114
Isabelle P, Goutte C, Simard M (2007) Domain adaptation of MT systems through automatic post-editing. In: Proceedings of the MT Summit XI, Copenhagen, pp 255–261
Lagarda AL, Alabau V, Casacuberta F, Silva R, Díaz-de-Liaño E (2009) Statistical post-editing of a rule-based machine translation system. In: Proceedings of NAACL HLT 2009. Human language technologies: the 2009 annual conference of the North American chapter of the ACL, Short Papers, Boulder, pp 217–220
Mayor A, Diaz de Ilarraza A, Labaka G, Lersundi M, Sarasola K (2011) Matxin, an open-source rule-based machine translation system for Basque. Mach Transl J 25(1):53–82
Potet M, Esperança-Rodier E, Blanchon H, Besacier L (2011) Preliminary experiments on using users’ post-editions to enhance a SMT system. In: Forcada ML, Depraetere H, Vandeghinste V (eds) Proceedings of the 15th conference of the European association for machine translation, Leuven, Belgium, pp 161–168
Simard M, Ueffing N, Isabelle P, Kuhn R (2007) Rule-based translation with statistical phrase-based post-editing. In: Proceedings of the second workshop on statistical machine translation, Prague, pp 203–206
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2007) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Biennial conference of the association for machine translation in the Americas (AMTA), Cambridge, Massachusetts, USA, pp 223–231
Way A (2010) Machine translation. In: Clark A, Fox C, Lappin S (eds) The handbook of computational linguistics and natural language processing. Wiley-Blackwell, Oxford, pp 531–573
Acknowledgements
This research was supported in part by the Spanish Ministry of Education and Science (OpenMT2, TIN2009-14675-C03-01) and by the Basque Government (Berbatek project, IE09–262). We are indebted to all the collaborators in the project and especially to the editors of the Basque Wikipedia. Elhuyar and Julen Ruiz helped us to collect resources for the customization of the RBMT engine to the domain of Computer Science.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Alegria, I. et al. (2013). Reciprocal Enrichment Between Basque Wikipedia and Machine Translation. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-35085-6_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)