Abstract
In this paper we propose a computational method for determining the syntactic similarity between languages. We investigate multiple approaches and metrics, showing that the results are consistent across methods. We report results on 16 languages belonging to various language families. The analysis that we conduct is adaptable to any languages, as far as resources are available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The authors also present a brief history of the syntactic approaches and acknowledge the work of [30], a pioneer in this field and a forerunner of their approach.
- 2.
We use dependency parsing [21], so we rely on dependency trees to compute the syntactic similarity.
- 3.
Tagging and parsing accuracy for each language: Bg: 0.96,0.87; Cs: 0.98,0.82; Da: 0.95,0.82; De: 0.92,0.67; El: 0.96,0.82; En: 0.94,0.85; Es: 0.95,0.59; Et: 0.94,0.80; Fi: 0.93,0.75; Fr: 0.96,0.61; Hu: 0.92,0.79; It: 0.97,0.47; Nl: 0.89,0.77; Pt: 0.96,0.83; Ro: 0.95,0.82; Sv: 0.95,0.84.
- 4.
We believe that our investigation is not negatively influenced by the choice of corpus because we are consistent across all experiments in terms of text gender and we report results obtained solely by comparison between languages on the same dataset. In future work, we intend to apply the proposed methods on other datasets as well (for example, the EUR-Lex Corpus [2]).
- 5.
While the effect of translation cannot be denied, we rely on the fact that the interpreters/translators of Europarl are native speakers of the target language, which reduces the impact of the source language on translations significantly (as opposed to the translations performed by language learners, for example).
- 6.
We repeated Exp. #1 and #2 using the rank distance [7] instead of the edit distance, and there were no significant differences in the results.
- 7.
Developing of Romanian far from the big Romance kernel made the contact of Romanian with Romance languages difficult until the 18th century. Instead, from the 9th to the 17th century there was a significant cultural influence of the South Slavic languages (especially Old Slavic), due in part to the exclusive use of Old Church Slavonic for religious purposes, which lead to giving South Slavic “the status of a cultural superstrate language” [36].
- 8.
- 9.
References
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of VLDB 2005, pp. 301–312 (2005)
Baisa, V., Michelfeit, J., Medved, M., Jakubícek, M.: European union language resources in sketch engine. In: Proceedings of LREC 2016, pp. 2799–2803 (2016)
Bortolussi, L., Sgarro, A., Dinu, L.P.: Measures of fuzzy disarray in linguistic typology. In: Proceedings of IPMU 2008, pp. 167–172 (2008)
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., Specia, L.: Findings of the 2012 workshop on statistical machine translation. In: Proceedings of WMT 2012, pp. 10–51 (2012)
Charniak, E., Knight, K., Yamada, K.: Syntax-based language models for statistical machine translation. In: Proceedings of the 9th Machine Translation Summit (2003)
Ciobanu, A.M., Dinu, L.P.: An etymological approach to cross-language orthographic similarity. Application on Romanian. In: Proceedings of EMNLP 2014, pp. 1047–1058 (2014)
Dinu, A., Dinu, L.P.: On the syllabic similarities of romance languages. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 785–788. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30586-6_88
Dryer, M.S.: 81 order of subject, object, and verb. In: The World Atlas of Language Structures, pp. 330–333 (2005)
Duma, M., Vertan, C., Menzel, W.: A new syntactic metric for evaluation of machine translation. In: Proceedings of the ACL Student Research Workshop, pp. 130–135 (2013)
Dunn, M., Greenhill, S., Levinson, S., Gray, R.: Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473(7345), 79–82 (2011)
Dunn, M., Terrill, A., Reesink, G., Foley, R., Levinson, S.: Structural phylogenetics and the reconstruction of ancient language history. Science 309(5743), 2072–2075 (2005)
Eger, S., Hoenen, A., Mehler, A.: Language classification from bilingual word embedding graphs. In: Proceedings of COLING 2016, Technical Papers, pp. 3507–3518 (2016)
Eger, S., Schenk, N., Mehler, A.: Towards semantic language classification: inducing and clustering semantic association networks from Europarl. In: Proceedings of *SEM 2015, pp. 127–136 (2015)
Futrell, R., Mahowald, K., Gibson, E.: Quantifying word order freedom in dependency corpora. In: Proceedings of Depling 2015, pp. 91–100 (2015)
Ganitkevitch, J., Cao, Y., Weese, J., Post, M., Callison-Burch, C.: Joshua 4.0: packing, PRO, and paraphrases. In: Proceedings of WMT 2012, pp. 283–291 (2012)
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of SETQA-NLP 2008, pp. 49–57 (2008)
Gray, R., Atkinson, Q.: Language tree divergences support the Anatolian theory of Indo-European origin. Nature 426, 435–439 (2003)
Greenberg, J.H.: Language in the Americas. Stanford University Press, Stanford (1987)
Johannsen, A., Hovy, D., Søgaard, A.: Cross-lingual syntactic variation over age and gender. In: Proceedings of CoNLL 2015, pp. 103–112 (2015)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)
Kübler, S., McDonald, R., Nivre, J.: Dependency parsing. Synth. Lect. Hum. Lang. Technol. 1(1), 1–127 (2009)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1965)
Liu, D., Gildea, D.: Syntactic features for evaluation of machine translation. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 25–32 (2005)
Longobardi, G., et al.: Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)
Longobardi, G., Guardiano, C.: Evidence for syntax as a signal of historical relatedness. Lingua 119(11), 1679–1706 (2009)
Martins, A.F.T., Almeida, M.B., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Proceedings of ACL 2013, Short Papers, vol. 2, pp. 617–622 (2013)
McMahon, A., McMahon, R.: Finding families: quantitative methods in language classification. Trans. Philol. Soc. 101(1), 7–55 (2003)
Nagata, R., Whittaker, E.: Reconstructing an Indo-European family tree from non-native English texts. In: Proceedings of ACL 2013, Long Papers, vol. 1, pp. 1137–1147 (2013)
Nerbonne, J., Wiersma, W.: A measure of aggregate syntactic distance. In: Proceedings of the Workshop on Linguistic Distances, pp. 82–90 (2006)
Nichols, J.: Linguistic Diversity in Space and Time. University of Chicago Press, Chicago (1992)
Nichols, J., Warnow, T.: Tutorial on computational linguistic phylogeny. Lang. Linguist. Compass 2(5), 760–820 (2008)
Niehues, J., Zhang, Y., Mediani, M., Herrmann, T., Cho, E., Waibel, A.: The Karlsruhe Institute of Technology translation systems for the WMT 2012. In: Proceedings of WMT 2012, pp. 349–355 (2012)
Nivre, J.: Towards a universal grammar for natural language processing. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 3–16. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_1
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. In: Proceedings of LREC 2012, pp. 2089–2096 (2012)
Schulte, K.: Loanwords in Romanian. In: Loanwords in the World’s Languages: A Comparative Handbook, pp. 230–259 (2009)
Vilar, D.: DFKI’s SMT system for WMT 2012. In: Proceedings of WMT 2012, pp. 382–387 (2012)
Zeman, D.: Data issues of the multilingual translation matrix. In: Proceedings of WMT 2012, pp. 395–400 (2012)
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Acknowledgments
We thank the anonymous reviewers for their helpful and constructive comments. This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS/CCCDI UEFISCDI, project number PN-III-P2-2.1-53BG/2016, within PNCDI III.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ciobanu, A.M., Dinu, L.P., Sgarro, A. (2018). Towards a Map of the Syntactic Similarity of Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-77113-7_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77112-0
Online ISBN: 978-3-319-77113-7
eBook Packages: Computer ScienceComputer Science (R0)