Abstract
In this paper, we introduce an ongoing project for the development of a parallel treebank for Italian, English and French. The treebank is annotated in a dependency format, namely the one designed in the Turin University Treebank (TUT), hence the choice to call such new resource Par(allel)TUT. The project aims at creating a resource which can be useful in particular for translation research. Therefore, beyond constantly enriching the treebank with new and heterogeneous data, so as to build a dynamic and balanced multilingual treebank, the current stage of the project is devoted to the design of a tool for the alignment of data, which takes into account syntactic knowledge as annotated in this kind of resource. The paper focuses in particular on the study of translational divergences and their implications for the development of the alignment tool. The paper provides an overview of the treebank, with its current content and the peculiarities of the annotation format, the description of the classes of translational divergences which could be encountered in the treebank, together with a proposal for their alignment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Contrarily to work on statistical machine translation, phrase alignment in this work is intended as an alignment between linguistically motivated phrases.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
http://www.statmt.org/europarl/; the section used is ep_00_01_17.
- 15.
Namely the “Help” section, at https://www.facebook.com/help/345121355559712/.
- 16.
The section used is jrc52006DC243.
- 17.
- 18.
https://wit3.fbk.eu/; we retrieved the texts used for training of MT systems, downloaded from https://wit3.fbk.eu/mt.php?release=2012-02.
- 19.
As for the sentence count, we would like to clarify that some sub-corpora, especially the UDHR, are featured by the presence of short headings (e.g. ‘Article 1’) that we did not considered for calculating the average sentence length, even if they were treated as separate sentences according to the parser segmentation criteria.
- 20.
In general, considering the sources from which the texts of ParTUT have been retrieved, it can be assumed that they are not all original, but drafted in one or more languages and then translated into the others.
- 21.
In this paper, we report examples of sentences (or fragments of sentences) in all the languages involved. The glosses for non-English examples are then provided; they are intended as literal and do not necessarily correspond to the correct English expression.
- 22.
In the Italian TUT there is a third component (omitted here and in the current ParTUT annotation) concerning the semantic role of the dependent with respect to its governor.
- 23.
The TUTtoPenn converter can be downloaded at http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/.
- 24.
- 25.
A semi-automatic alignment has also been performed with LF Aligner (http://sourceforge.net/projects/aligner/).
- 26.
These labels are used to identify the treebank fragment we refer to in the examples: they indicate section_language#sentencenumber.
- 27.
Since in the ParTUT texts translation direction is unknown, we consider the two transformation strategies as counterparts one of each other and put them in the same subclass, while other works rather considered them as separate categories [8]. We applied the same principle even for the cases of addition/deletion, mentioned below.
- 28.
In this example, in particular, we observe both additions and deletions while comparing the English sentence to the French version.
- 29.
References
Bosco C., Mazzei A.: The EVALITA dependency parsing task: from 2007 to 2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)
Bosco C., Mazzei A., Lavelli A.: Looking back to the EVALITA constituency parsing task: 2007–2011. In: Proceedings of Evalita 2011, Evaluation of Natural Language and Speech Tools for Italian. LNCS/LNAI, Springer (2012)
Bosco, C., Simi, M., Montemagni, S.: Converting Italian Treebanks: towards an Italian stanford dependency treebank. In: Proceedings of the ACL’13 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW&ID), Sofia, Bulgaria (2013)
Bucholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of CoNLL (2006)
Catford, J.C.: A Linguistic Theory of Translation: An Essay on Applied Linguistics. Oxford University Press, Oxford (1965)
Cettolo, M., Ghirardi, F., Federico M.: WIT3: a web inventory of transcribed talks. In: Proceedings of the 16th EAMT Conference, Trento, Italy (2012)
Copestake, A., Flickinger, D., Pollard, C., Sag, C.: Minimal recursion semantics: an introduction. Res. Lang. Comput. 3(4), 281–332 (2005)
Cyrus, L.: Building a resource for studying translation shifts. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova, Italy (2006)
de Marneffe, M-C., Manning, C. D.: The stanford typed dependencies representation. In: Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation (CrossParser’08), Manchester, United Kingdom (2008)
Ding, Y., Palmer, M.: Automatic learning of parallel dependency treelet pairs. In: Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04) (2004)
Ding, Y., Gildea, D., Palmer, M.: An algorithm for word-level alignment of parallel dependency trees. In: The 9th Machine Translation Summit of the International Association for Machine Translation (2003)
Dyvik, H., Meurer, P., Rosén, V., De Smedt, K.: Linguistically motivated parallel parsebanks. In: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8) (2009)
Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Carvalheiro, C., Costa F., Castro, S.: ParDeepBank: multiple parallel deep treebanking. In: Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories (2012)
Fox, H.J.: Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 conference on Empirical methods in Natural Language Processing (EMNL’02) (2002)
Hajič, J., Zemánek, P.: Prague Arabic dependency treebank: development in data and tools. In: Proceedings of NEMLAR the NEMLAR Conference on Arabic Language Resources and Tools (2003)
Hearne, M, Tinsley, J., Zhechev, V., Way, A.: Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (2007)
Hudson, R.: Word Grammar. Blackwell, Oxford (1984)
Koehn P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand (2005)
Lavie, A., Parlikar, A., Ambati, V.: Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation (SSST’08) (2008)
Lesmo, L.: The Turin University Parser at Evalita 2009. In: Proceedings of Evalita’09, Reggio Emilia, Italy (2009)
Ma, Y., Ozdowska, S., Sun, Y., Way, A.: Improving word alignment using syntactic dependencies. In: Proceeding of the Second ACL Workshop on Syntax and Structure in Statistical Translation (SSST-2) (2008)
Mareček, D., Žabortský, Z., Novák, V.: Automatic alignment of Czech and English deep syntactic dependency tree. In: Proceedings of the 12th EAMT Conference (2008)
Menezes A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the Workshop on Data-driven Methods in Machine Translation at ACL-2001 (2001)
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: From Research to Real Users, Tiburon, California (2002)
Nakazawa, T., Kurohashi, S.: Bayesian subtree alignment model based on dependency trees. In: Proceedings of 5th Joint Conference on Natural Language Processing, Chiang Mai, Thailand (2011)
Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007 (2007)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. In: Computational Linguistics, vol .29(1). MIT Press, Cambridge (2003)
Osborne, T., Putnam, M., Gross, T.: Catenae: introducing a novel unit of syntactic analysis. In: Syntax, 15(4) (2012)
Ozdowska, S.: Using bilingual dependencies to align words in English/French parallel corpora. In: Proceedings of the ACL Student Research Workshop (2005)
Sanguinetti, M., Bosco, C., Cupi, L.: Exploiting catenae in a parallel treebank alignment. In: Proceedings of the 9th Language Resources and Evaluation Conference (LREC’14). Reykjavik, Iceland (2014)
Simov, K., Osenova, P., Laskova, L., Savkov, A., Kancheva, S.: Bulgarian-English parallel treebank: word and semantic level alignment. In: Proceedings of Recent Advances in Natural Language Processing (RANLP), Hissar, Bulgaria (2011)
Simov, K., Osenova, P.: Bulgarian-English treebank: desing and implementation. In: Linguist. Issues Lang. Technol. - LiLT 7(14) (2012)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of Language Resources and Evaluation Conference (LREC’06), Genova (2006)
Tiedemann, J., Kotzé, G.: Building a large machine-aligned parallel treebank. In: Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT’08) (2009)
Vinay, J.P., Darbelnet, J.: Comparative Stylistics of French and English. John Benjamins, Amsterdam and Philadelphia (1958)
Zhechev, V., Way, A.: Automatic generation of parallel treebanks. In: 22nd International Conference on Computational Linguistics (COLING 2008) (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Sanguinetti, M., Bosco, C. (2015). PartTUT: The Turin University Parallel Treebank. In: Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds) Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, vol 589. Springer, Cham. https://doi.org/10.1007/978-3-319-14206-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-14206-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14205-0
Online ISBN: 978-3-319-14206-7
eBook Packages: EngineeringEngineering (R0)