Towards a Map of the Syntactic Similarity of Languages

Alina Maria Ciobanu^14,15,
Liviu P. Dinu^14,15 &
Andrea Sgarro^15,16

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

982 Accesses

Abstract

In this paper we propose a computational method for determining the syntactic similarity between languages. We investigate multiple approaches and metrics, showing that the results are consistent across methods. We report results on 16 languages belonging to various language families. The analysis that we conduct is adaptable to any languages, as far as resources are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Language Comparison via Network Topology

On measurement of distances between texts in dictionary-based content analysis

Article 05 July 2024

Scale-Dependent Relationships in Natural Language

Article 04 January 2021

Notes

1.
The authors also present a brief history of the syntactic approaches and acknowledge the work of [30], a pioneer in this field and a forerunner of their approach.
2.
We use dependency parsing [21], so we rely on dependency trees to compute the syntactic similarity.
3.
Tagging and parsing accuracy for each language: Bg: 0.96,0.87; Cs: 0.98,0.82; Da: 0.95,0.82; De: 0.92,0.67; El: 0.96,0.82; En: 0.94,0.85; Es: 0.95,0.59; Et: 0.94,0.80; Fi: 0.93,0.75; Fr: 0.96,0.61; Hu: 0.92,0.79; It: 0.97,0.47; Nl: 0.89,0.77; Pt: 0.96,0.83; Ro: 0.95,0.82; Sv: 0.95,0.84.
4.
We believe that our investigation is not negatively influenced by the choice of corpus because we are consistent across all experiments in terms of text gender and we report results obtained solely by comparison between languages on the same dataset. In future work, we intend to apply the proposed methods on other datasets as well (for example, the EUR-Lex Corpus [2]).
5.
While the effect of translation cannot be denied, we rely on the fact that the interpreters/translators of Europarl are native speakers of the target language, which reduces the impact of the source language on translations significantly (as opposed to the translations performed by language learners, for example).
6.
We repeated Exp. #1 and #2 using the rank distance [7] instead of the edit distance, and there were no significant differences in the results.
7.
Developing of Romanian far from the big Romance kernel made the contact of Romanian with Romance languages difficult until the 18th century. Instead, from the 9th to the 17th century there was a significant cultural influence of the South Slavic languages (especially Old Slavic), due in part to the exclusive use of Old Church Slavonic for religious purposes, which lead to giving South Slavic “the status of a cultural superstrate language” [36].
8.
www.bing.com/translator.
9.
www.translate.google.com.

References

Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of VLDB 2005, pp. 301–312 (2005)
Google Scholar
Baisa, V., Michelfeit, J., Medved, M., Jakubícek, M.: European union language resources in sketch engine. In: Proceedings of LREC 2016, pp. 2799–2803 (2016)
Google Scholar
Bortolussi, L., Sgarro, A., Dinu, L.P.: Measures of fuzzy disarray in linguistic typology. In: Proceedings of IPMU 2008, pp. 167–172 (2008)
Google Scholar
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., Specia, L.: Findings of the 2012 workshop on statistical machine translation. In: Proceedings of WMT 2012, pp. 10–51 (2012)
Google Scholar
Charniak, E., Knight, K., Yamada, K.: Syntax-based language models for statistical machine translation. In: Proceedings of the 9th Machine Translation Summit (2003)
Google Scholar
Ciobanu, A.M., Dinu, L.P.: An etymological approach to cross-language orthographic similarity. Application on Romanian. In: Proceedings of EMNLP 2014, pp. 1047–1058 (2014)
Google Scholar
Dinu, A., Dinu, L.P.: On the syllabic similarities of romance languages. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 785–788. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30586-6_88
Chapter Google Scholar
Dryer, M.S.: 81 order of subject, object, and verb. In: The World Atlas of Language Structures, pp. 330–333 (2005)
Google Scholar
Duma, M., Vertan, C., Menzel, W.: A new syntactic metric for evaluation of machine translation. In: Proceedings of the ACL Student Research Workshop, pp. 130–135 (2013)
Google Scholar
Dunn, M., Greenhill, S., Levinson, S., Gray, R.: Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473(7345), 79–82 (2011)
Article Google Scholar
Dunn, M., Terrill, A., Reesink, G., Foley, R., Levinson, S.: Structural phylogenetics and the reconstruction of ancient language history. Science 309(5743), 2072–2075 (2005)
Article Google Scholar
Eger, S., Hoenen, A., Mehler, A.: Language classification from bilingual word embedding graphs. In: Proceedings of COLING 2016, Technical Papers, pp. 3507–3518 (2016)
Google Scholar
Eger, S., Schenk, N., Mehler, A.: Towards semantic language classification: inducing and clustering semantic association networks from Europarl. In: Proceedings of *SEM 2015, pp. 127–136 (2015)
Google Scholar
Futrell, R., Mahowald, K., Gibson, E.: Quantifying word order freedom in dependency corpora. In: Proceedings of Depling 2015, pp. 91–100 (2015)
Google Scholar
Ganitkevitch, J., Cao, Y., Weese, J., Post, M., Callison-Burch, C.: Joshua 4.0: packing, PRO, and paraphrases. In: Proceedings of WMT 2012, pp. 283–291 (2012)
Google Scholar
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of SETQA-NLP 2008, pp. 49–57 (2008)
Google Scholar
Gray, R., Atkinson, Q.: Language tree divergences support the Anatolian theory of Indo-European origin. Nature 426, 435–439 (2003)
Article Google Scholar
Greenberg, J.H.: Language in the Americas. Stanford University Press, Stanford (1987)
Google Scholar
Johannsen, A., Hovy, D., Søgaard, A.: Cross-lingual syntactic variation over age and gender. In: Proceedings of CoNLL 2015, pp. 103–112 (2015)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)
Google Scholar
Kübler, S., McDonald, R., Nivre, J.: Dependency parsing. Synth. Lect. Hum. Lang. Technol. 1(1), 1–127 (2009)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1965)
MathSciNet MATH Google Scholar
Liu, D., Gildea, D.: Syntactic features for evaluation of machine translation. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 25–32 (2005)
Google Scholar
Longobardi, G., et al.: Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)
Article Google Scholar
Longobardi, G., Guardiano, C.: Evidence for syntax as a signal of historical relatedness. Lingua 119(11), 1679–1706 (2009)
Article Google Scholar
Martins, A.F.T., Almeida, M.B., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Proceedings of ACL 2013, Short Papers, vol. 2, pp. 617–622 (2013)
Google Scholar
McMahon, A., McMahon, R.: Finding families: quantitative methods in language classification. Trans. Philol. Soc. 101(1), 7–55 (2003)
Article MathSciNet Google Scholar
Nagata, R., Whittaker, E.: Reconstructing an Indo-European family tree from non-native English texts. In: Proceedings of ACL 2013, Long Papers, vol. 1, pp. 1137–1147 (2013)
Google Scholar
Nerbonne, J., Wiersma, W.: A measure of aggregate syntactic distance. In: Proceedings of the Workshop on Linguistic Distances, pp. 82–90 (2006)
Google Scholar
Nichols, J.: Linguistic Diversity in Space and Time. University of Chicago Press, Chicago (1992)
Book Google Scholar
Nichols, J., Warnow, T.: Tutorial on computational linguistic phylogeny. Lang. Linguist. Compass 2(5), 760–820 (2008)
Article Google Scholar
Niehues, J., Zhang, Y., Mediani, M., Herrmann, T., Cho, E., Waibel, A.: The Karlsruhe Institute of Technology translation systems for the WMT 2012. In: Proceedings of WMT 2012, pp. 349–355 (2012)
Google Scholar
Nivre, J.: Towards a universal grammar for natural language processing. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 3–16. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_1
Chapter Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article Google Scholar
Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. In: Proceedings of LREC 2012, pp. 2089–2096 (2012)
Google Scholar
Schulte, K.: Loanwords in Romanian. In: Loanwords in the World’s Languages: A Comparative Handbook, pp. 230–259 (2009)
Google Scholar
Vilar, D.: DFKI’s SMT system for WMT 2012. In: Proceedings of WMT 2012, pp. 382–387 (2012)
Google Scholar
Zeman, D.: Data issues of the multilingual translation matrix. In: Proceedings of WMT 2012, pp. 395–400 (2012)
Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their helpful and constructive comments. This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS/CCCDI UEFISCDI, project number PN-III-P2-2.1-53BG/2016, within PNCDI III.

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania
Alina Maria Ciobanu & Liviu P. Dinu
Human Language Technologies Research Center, University of Bucharest, Bucharest, Romania
Alina Maria Ciobanu, Liviu P. Dinu & Andrea Sgarro
Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy
Andrea Sgarro

Authors

Alina Maria Ciobanu
View author publications
You can also search for this author in PubMed Google Scholar
Liviu P. Dinu
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Sgarro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liviu P. Dinu .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ciobanu, A.M., Dinu, L.P., Sgarro, A. (2018). Towards a Map of the Syntactic Similarity of Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-77113-7_44
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77112-0
Online ISBN: 978-3-319-77113-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics