Abstract
We present the first publicly available machine translation (MT) system for Basque. The fact that Basque is both a morphologically rich and less-resourced language makes the use of statistical approaches difficult, and raises the need to develop a rule-based architecture which can be combined in the future with statistical techniques. The MT architecture proposed reuses several open-source tools and is based on a unique XML format to facilitate the flow between the different modules, which eases the interaction among different developers of tools and resources. The result is the rule-based Matxin MT system, an open-source toolkit, whose first implementation translates from Spanish to Basque. We have performed innovative work on the following tasks: construction of a dependency analyser for Spanish, use of rich linguistic information to translate prepositions and syntactic functions (such as subject and object markers), construction of an efficient module for verbal chunk transfer, and design and implementation of modules for ordering words and phrases, independently of the source language.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abney S (1991) Parsing by chunks. In: Berwick R, Abney S, Tenny C (eds) Principle-based parsing. Kluwer, Boston, pp 257–278
Agirre E, Arregi X, Arriola J, Artola X, Díaz de Ilarraza A, Sarasola K (1995) Different issues in the design of a general-purpose Lexical Database for Basque. In: First workshop on Applications of natural language to databases, Versailles, pp 299–313
Agirre E, Atutxa A, Labaka G, Lersundi M, Mayor A, Sarasola K (2009) Use of rich linguistic information to translate prepositions and grammar cases to Basque. In: EAMT-2009: proceedings of the 13th annual conference of the European Association for Machine Translation. Barcelona, Spain, pp 58–65
Alam YS (2004) Decision Trees for sense disambiguation of prepositions case of over. In: Moldovan D, Girju R (eds) HLT-NAACL 2004: workshop on Computational lexical semantics, Boston, MA, USA, pp 52–59
Alcázar A. (2007) Consumer Eroski parallel corpus. Int J Basque Linguist Philol 41(2): 1–10
Alegria I, Artola X, Sarasola K, Urkia M (1996) Automatic morphological analysis of Basque. Lit Linguist Comput 11(4): 193–203
Alegria I, Artola X, Sarasola K (1997) Improving a robust morphological analyser using lexical transducers. In: Mitkov R, Nicolov N (eds) Recent advances in natural language processing, vol 136 of Current issues in linguistic theory (CILT). John Benjamins, Amsterdam, pp 97–110
Aldezabal I, Aranzabe M, Gojenola K, Sarasola K, Atutxa A (2002) Learning argument/adjunct distinction for Basque. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition. Philadelphia, PA, pp 42–50
Alegria I, Urkia M (2002) Morfologia konputazionala. Euskararen morfologiaren deskribapena. UEU, Basque Country
Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2005a) An FST grammar for verb chain transfer in a Spanish–Basque MT System. In: Finite-state methods and natural language processing, vol 4002, Germany, pp 295–296
Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K, Forcada M, Ortiz S, Padró L (2005b) An Open Architecture for Transfer-based Machine Translatio between Spanish and Basque. In: Proceedings of the MT Summit X. workshop: OSMaTran, Open-Source Machine Translation, Phuket, Thailand, pp 7–14
Alegria I, Arregi X, Artola X, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2008a) Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages, Hyderabad, India, pp 59–64
Alegria I, Casillas A, Díaz de Ilarraza A, Igartua J, Labaka G, Lersundi M, Mayor A, Sarasola K (2008b) Spanish-to-Basque MultiEngine Machine Translation for a restricted domain. In: AMTA-2008: MT at work: proceedings of the eighth conference of the Association for Machine Translation in the Americas, Waikiki, Hawai’i, pp 37–45
Armentano-Oller C, Carrasco RC, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F, Scalco MA (2006) Open-source Portuguese–Spanish machine translation. In: Proceedings of the 7th international workshop on Computational processing of written and spoken Portuguese, PROPOR, Rio de Janeiro, Brazil, pp 50–59
Arrieta K, Díaz de Ilarraza A, Hernáez I, Iturraspe U, Leturia I, Navas E, Sarasola K (2008) AnHitz, development and integration of language, speech and visual technologies for Basque. In: Second international symposium on Universal communication, Osaka, Japan, pp 338–343
Atserias J, Comelles E, Mayor A (2005) TXALA, un analizador libre de dependencias para el castellano. Procesamiento del Lenguaje Natural 35: 455–456
Atserias J, Casas B, Comelles E, González M, Padró L, Padró M (2006) FreeLing 1.3: syntactic and semantic services in an open-source NLP library. In: Proceedings of the 5th international conference on Language resources and evaluation (LREC’06), Genoa, Italy, pp 48–55
Boitet C, Bey Y, Tomokiyo M, Cao C, Blanchon H (2006) IWSLT-06: experiments with commercial MT systems and lessons from subjective evaluations. In: Proceedings of the international workshop on Spoken language translation, IWSLT-06, Kyoto, Japan, pp 23–30
Brants T, Skut W, Uszkoreit H (1999) Syntactic annotation of a German Newspaper corpus. In: Proceedings of the ATALA Treebank workshop, Paris, France, pp 69–76
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference, pp 249–256
Calvo H, Gelbukh A (2006) DILUCT: an open-source Spanish dependency parser based on rules, heuristics, and selectional preferences. In: Natural language processing and information systems, LNCS 3999. Springer, Heidelberg, pp 164–175
Carrera J, Castellón I, Lloberes M, Padró L, Tinkova N (2008) Dependency grammars in FreeLing. Procesamiento del Lenguaje Natural 41: 21–28
Carreras X, Chao I, Padró L, Padró M (2004) FreeLing: An open-source suite of language analyzers. In: Proceedings of the 4th international conference on Language resources and evaluation (LREC’04), Lisbon, Portugal
Civit M (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Universidad de Barcelona, Barcelona, Spain
Coheur L, Mamede N, Bés GG (2004) From a surface analysis to a dependency structure. In: Workshop on recent advances in dependency grammar (Coling 2004), Geneva, Switzerland, pp 77–81
Díaz de Ilarraza A, Mayor A, Sarasola K (1999) Reusability of wide-coverage linguistic resources in the construction of an English–Basque Machine Translation System. IAI Working Paper 36, University of the Saarland, Saarbrücken, Germany
Díaz de Ilarraza A, Lersundi M, Mayor A, Sarasola K (2000a) Etiquetado semiautomático del rasgo semántico de animicidad para su uso en un sistema de traducción automática. Procesamiento del Lenguaje Natural 26: 147–152
Díaz de Ilarraza A, Mayor A, Sarasola K (2000b) Building a Lexicon for an English–Basque machine translation system from heterogeneous wide-coverage dictionaries. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 2.1–2.9
Díaz de Ilarraza A, Mayor A, Sarasola K (2000c) Reusability of wide-coverage linguistic resources in the construction of a multilingual machine translation system. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 12.1–12.9
Díaz de Ilarraza A, Mayor A, Sarasola K (2001) Inclusión del par castellano–euskara en un prototipo de traducción automática multilingüe. In: Proceedings of the second international workshop on Spanish language processing and language technologies, Jaén, Spain, pp 107–111
Díaz de Ilarraza A, Mayor A, Sarasola K (2002) Semiautomatic labelling of semantic features. In: Proceedings of the 19th international conference on Computational linguistics (COLING 2002), Taipei, Taiwan, pp 1–7
Du J, He Y, Penkale S, Way A (2009) MaTrEx: the DCU MT system for WMT 2009. In: Fourth workshop on Statistical machine translation, Athens, Greece, pp 95–99.
Elhuyar (2000) Elhuyar Hiztegia. Elhuyar Hizkuntz Zerbitzuak
Forcada M, Bonev BI, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sanchez G, Sánchez-Martínez F, Armentano-Oller C, Montava MA, Tyers FM, Ginestí-Rosell M (2009) Documentation of the open-source shallow-transfer machine translation platform apertium. Technical report, Departament de Llenguatges i Sistemes Informatics. Universitat d’Alacant. http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf
Ginestí-Rosell M, Ramírez-Sánchez G, Ortiz-Rojas S, Tyers F, Forcada M (2009) Development of a Basque to Spanish machine translation system.. Procesamiento del Lenguaje Natural 43: 187–195
Goutte C (2006) Automatic evaluation of Machine Translation quality. Presentation at the European Community. http://www.xrce.xerox.com/Publications/Attachments/2006-002/MTeval.pdf
Hulden M (2009) Foma: a Finite-state compiler and library. In: Proceedings of EACL 2009, pp, 29–32
Hutchins W, Somers HL (1992) An introduction to machine translation. Academic Press, London
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand, pp 79–86
Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: Proceedings of the workshop on Statistical machine translation, New York City, NY, USA, pp 102–121
Koskenniemi K (1983) Two-level morphology: a general computational model for word-form recognition and production. Department of General Linguistics, University of Helsinki. Publications, No. 11
Labaka G (2010) EUSMT: incorporating linguistic information into SMT for a morphologically rich language. PhD thesis, University of the Basque Country, Donostia, Basque Country
Labaka G, Stroppa N, Way A, Sarasola K (2007) Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation. In: Machine Translation Summit XI: proceedings, Copenhagen, Denmark, pp 297–304
Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. Technical report, EAGLES Expert Advisory Group on Language Engineering Standards, Istituto di Linguistica Computazionale, Pisa, Italy
Lloberes M, Castellón I, Padró L (2010) Spanish FreeLing Dependency grammar. In: Proceedings of the international conference on Language resources and evaluation, LREC 2010, Valletta, Malta, pp 693–699
Mamidi R (2004) Disambiguating prepositions for machine translation using lexical semantic resources. In: Proceedings of the ‘National Seminar on Theoretical and Applied Aspects of Lexical Semantics’ organized by Centre of Advanced Study in Linguistics. Hyderabad, India
Mayor A (2007) Matxin: Erregeletan oinarritutako itzulpen automatikoko sistema baten eraikuntza estaldura handiko baliabide linguistikoak berrerabiliz (Matxin: construction of a rule-based MT system reusing wide coverage linguistic resources). PhD thesis, University of the Basque Country, Donostia, Basque Country
Mayor A, Tyers FM (2009) Matxin: moving towards language independence. In: FreeRBMT’2009, Alacant, Spain, pp 11–17
Naskar SK, Bandyopadhyay S (2006) Handling of prepositions in English to Bengali Machine Translation. In: Proceedings of the EACL workshop on Prepositions, Trento, Italy, pp 89–94
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of Machine Translation. In: 40th annual meeting of the association for computational linguistics (ACL), Philadelphia, PA, USA, pp 311–318
Przybocki M, Sanders G, Le A (2006) Edit distance: a metric for Machine Translation evaluation. In: Proceedings of LREC-2006: fifth international conference on language resources and evaluation, Genoa, Italy, pp 2038–2043
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231
Streiter O, Scannell KP, Stuflesser M (2006) Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Mach Transl 20(4): 267–289
Stroppa N, Groves D, Way A, Sarasola K (2006) Example-Based Machine Translation of the Basque Language. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, ‘Visions for the Future of Machine Translation’, Cambridge, MA, USA, pp 232–241
Tantug AC, Oflazer K, El-Kahlout ID (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco, pp 1493–1499. http://www.lrec-conf.org/proceedings/lrec2008/
Trujillo A (1992a) Locations in the Machine Translation of Prepositional Phrases. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Proceedings of the Conference. Montréal, Canada, pp 13–20
Trujillo A (1992b) Spatial lexicalization in the translation of prepositional phrases. In: 30th annual meeting of the association for computational linguistics, proceedings of the conference, Newark, Delaware, USA, pp 306–308
Turian JP, Shen L, Melamed I (2003) Evaluation of machine translation and its evaluation. In: Proceedings of the MT Summit IX, New Orleans, USA, pp 386–393
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mayor, A., Alegria, I., Díaz de Ilarraza, A. et al. Matxin, an open-source rule-based machine translation system for Basque. Machine Translation 25, 53–82 (2011). https://doi.org/10.1007/s10590-011-9092-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9092-y