While global patterns of human genetic diversity are increasingly well characterized, the diversi... more While global patterns of human genetic diversity are increasingly well characterized, the diversity of human languages remains less systematically described. Here, we outline the Grambank database. With over 400,000 data points and 2400 languages, Grambank is the largest comparative grammatical database available. The comprehensiveness of Grambank allows us to quantify the relative effects of genealogical inheritance and geographic proximity on the structural diversity of the world's languages, evaluate constraints on linguistic diversity, and identify the world's most unusual languages. An analysis of the consequences of language loss reveals that the reduction in diversity will be strikingly uneven across the major linguistic regions of the world. Without sustained efforts to document and revitalize endangered languages, our linguistic window into human history, cognition, and culture will be seriously fragmented.
Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Data... more Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Database of New Guinea Languages. PLoS ONE 10.10: e0141563.
Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph... more Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph; Englisch, Johannes; and Russell D. Gray (2021): Lexibank: A publicly available repository of standardized lexical datasets with automatically computed phonological and lexical features for more than 2000 language varieties [Dataset, Version 0.1]. Geneva: Zenodo. https://github.com/lexibank/lexibank-analysed
Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts use... more Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts used. Glottolog codes used to identify language of vernacular names.
Viking-Age archaeobotanical evidence per plant species, including sites where samples were found,... more Viking-Age archaeobotanical evidence per plant species, including sites where samples were found, interpretation, and references to source material.
This repository (see Tutorial.md for getting started) contains the code and data for reproducing ... more This repository (see Tutorial.md for getting started) contains the code and data for reproducing the analyses described in the chapter "Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison".
List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, K... more List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, Kristina & Tjuka, Annika & Hundt, Carolin & Forkel, Robert (eds.) 2021. Concepticon v2.5.0. A Resource for the Linking of Concept Lists. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available online at https://concepticon.clld.org
Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, ... more Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, and Robert Forkel (2018). Cross-Linguistic Transcription Systems (Version 1.1.1). Max Planck Institute for the Science of Human History: Jena. DOI: 10.5281/zenodo.1623511
This is the first release of our LingPy tutorial, accompanying the paper "Sequence Compariso... more This is the first release of our LingPy tutorial, accompanying the paper "Sequence Comparison in Computational Historical Linguistics" (List et al. 2018, Journal of Language Evolution, DOI: http://dx.doi.org/10.1093/jole/lzy006).
The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 <i>Glotto... more The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 <i>Glottolog 2.7</i>) spoken by 220 million people across southern and central India and surrounding countries (Steever SB. 1998 In <i>The Dravidian languages</i> (ed. SB Steever), pp. 1–39: 1). Neither the geographical origin of the Dravidian language homeland nor its exact dispersal through time are known. The history of these languages is crucial for understanding prehistory in Eurasia, because despite their current restricted range, these languages played a significant role in influencing other language groups including Indo-Aryan (Indo-European) and Munda (Austroasiatic) speakers. Here, we report the results of a Bayesian phylogenetic analysis of cognate-coded lexical data, elicited first hand from native speakers, to investigate the subgrouping of the Dravidian language family, and provide dates for the major points of diversification. Our results indicate that the Dravidi...
List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Conceptico... more List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Concepticon. A Resource for the Linking of Concept Lists. Jena: Max Planck Institute for the Science of Human History. Available online at http://concepticon.clld.org
Scholars have debated naturalistic theories of religion for thousands of years, but only recently... more Scholars have debated naturalistic theories of religion for thousands of years, but only recently have scientists begun to test predictions empirically. Existing databases contain few variables on religion, and are subject to Galton’s Problem because they do not suffi-ciently account for the non-independence of cultures or systematically differentiate the tradi-tional states of cultures from their contemporary states. Here we present Pulotu: the first quantitative cross-cultural database purpose-built to test evolutionary hypotheses of super-natural beliefs and practices. The Pulotu database documents the remarkable diversity of the Austronesian family of cultures, which originated in Taiwan, spread west to Madagascar and east to Easter Island–a region covering over half the world’s longitude. The focus of Austronesian beliefs range from localised ancestral spirits to powerful creator gods. A wide range of practices also exist, such as headhunting, elaborate tattooing, and the const...
Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics ... more Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics Online. Lund: Lund University. (DOI/URL: https://diacl.ht.lu.se/. ). Accessed on: 2019-02-07.
Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An inte... more Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An internal classification. Canberra: Pacific Linguistics.
While global patterns of human genetic diversity are increasingly well characterized, the diversi... more While global patterns of human genetic diversity are increasingly well characterized, the diversity of human languages remains less systematically described. Here, we outline the Grambank database. With over 400,000 data points and 2400 languages, Grambank is the largest comparative grammatical database available. The comprehensiveness of Grambank allows us to quantify the relative effects of genealogical inheritance and geographic proximity on the structural diversity of the world's languages, evaluate constraints on linguistic diversity, and identify the world's most unusual languages. An analysis of the consequences of language loss reveals that the reduction in diversity will be strikingly uneven across the major linguistic regions of the world. Without sustained efforts to document and revitalize endangered languages, our linguistic window into human history, cognition, and culture will be seriously fragmented.
Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Data... more Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Database of New Guinea Languages. PLoS ONE 10.10: e0141563.
Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph... more Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph; Englisch, Johannes; and Russell D. Gray (2021): Lexibank: A publicly available repository of standardized lexical datasets with automatically computed phonological and lexical features for more than 2000 language varieties [Dataset, Version 0.1]. Geneva: Zenodo. https://github.com/lexibank/lexibank-analysed
Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts use... more Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts used. Glottolog codes used to identify language of vernacular names.
Viking-Age archaeobotanical evidence per plant species, including sites where samples were found,... more Viking-Age archaeobotanical evidence per plant species, including sites where samples were found, interpretation, and references to source material.
This repository (see Tutorial.md for getting started) contains the code and data for reproducing ... more This repository (see Tutorial.md for getting started) contains the code and data for reproducing the analyses described in the chapter "Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison".
List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, K... more List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, Kristina & Tjuka, Annika & Hundt, Carolin & Forkel, Robert (eds.) 2021. Concepticon v2.5.0. A Resource for the Linking of Concept Lists. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available online at https://concepticon.clld.org
Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, ... more Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, and Robert Forkel (2018). Cross-Linguistic Transcription Systems (Version 1.1.1). Max Planck Institute for the Science of Human History: Jena. DOI: 10.5281/zenodo.1623511
This is the first release of our LingPy tutorial, accompanying the paper "Sequence Compariso... more This is the first release of our LingPy tutorial, accompanying the paper "Sequence Comparison in Computational Historical Linguistics" (List et al. 2018, Journal of Language Evolution, DOI: http://dx.doi.org/10.1093/jole/lzy006).
The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 <i>Glotto... more The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 <i>Glottolog 2.7</i>) spoken by 220 million people across southern and central India and surrounding countries (Steever SB. 1998 In <i>The Dravidian languages</i> (ed. SB Steever), pp. 1–39: 1). Neither the geographical origin of the Dravidian language homeland nor its exact dispersal through time are known. The history of these languages is crucial for understanding prehistory in Eurasia, because despite their current restricted range, these languages played a significant role in influencing other language groups including Indo-Aryan (Indo-European) and Munda (Austroasiatic) speakers. Here, we report the results of a Bayesian phylogenetic analysis of cognate-coded lexical data, elicited first hand from native speakers, to investigate the subgrouping of the Dravidian language family, and provide dates for the major points of diversification. Our results indicate that the Dravidi...
List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Conceptico... more List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Concepticon. A Resource for the Linking of Concept Lists. Jena: Max Planck Institute for the Science of Human History. Available online at http://concepticon.clld.org
Scholars have debated naturalistic theories of religion for thousands of years, but only recently... more Scholars have debated naturalistic theories of religion for thousands of years, but only recently have scientists begun to test predictions empirically. Existing databases contain few variables on religion, and are subject to Galton’s Problem because they do not suffi-ciently account for the non-independence of cultures or systematically differentiate the tradi-tional states of cultures from their contemporary states. Here we present Pulotu: the first quantitative cross-cultural database purpose-built to test evolutionary hypotheses of super-natural beliefs and practices. The Pulotu database documents the remarkable diversity of the Austronesian family of cultures, which originated in Taiwan, spread west to Madagascar and east to Easter Island–a region covering over half the world’s longitude. The focus of Austronesian beliefs range from localised ancestral spirits to powerful creator gods. A wide range of practices also exist, such as headhunting, elaborate tattooing, and the const...
Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics ... more Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics Online. Lund: Lund University. (DOI/URL: https://diacl.ht.lu.se/. ). Accessed on: 2019-02-07.
Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An inte... more Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An internal classification. Canberra: Pacific Linguistics.
The amount of data from languages spoken all over the world is rapidly increasing. Traditional ma... more The amount of data from languages spoken all over the world is rapidly increasing. Traditional manual methods in historical linguistics need to face the challenges brought by this influx of data. Automatic approaches to word comparison could provide invaluable help to pre-analyze data which can be later enhanced by experts. In this way, computational approaches can take care of the repetitive and schematic tasks leaving experts to concentrate on answering interesting questions. Here we test the potential of automatic methods to detect etymologically related words (cognates) in cross-linguistic data. Using a newly compiled database of expert cognate judgments across five different language families, we compare how well different automatic approaches distinguish related from unrelated words. Our results show that automatic methods can identify cognates with a very high degree of accuracy , reaching 89% for the best-performing method Infomap. We identify the specific strengths and weaknesses of these different methods and point to major challenges for future approaches. Current automatic approaches for cognate detection—although not per-fect—could become an important component of future research in historical linguistics.
This is the documentation for LingPy-2.6.1, the most recent release of the LingPy library for qua... more This is the documentation for LingPy-2.6.1, the most recent release of the LingPy library for quantitative tasks in historical linguistics. This release is very stable, it does not contain too many new features, but rather tries to provide a very stable version of the algorithms.
The past decades have seen substantial growth in digital data on the world's languages. At the sa... more The past decades have seen substantial growth in digital data on the world's languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, the majority of published datasets lack standardization which makes their comparison difficult. Here, we present the first step to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that increase the FAIRness of linguistic data. We test the Lexibank workflow on a collection of 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.
The amount of available digital data for the languages of the world is constantly increasing. Unf... more The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.
The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted fra... more The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. In its current form, it has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations , ranging from studies on semantic change, patterns of conceptualization, and linguistic pale-ontology. But CLICS has also been criticized for obvious shortcomings, ranging from the underlying dataset, which still contains many errors, up to the limits of cross-linguistic colexification studies in general. Building on recent standardization efforts reflected in the Cross-Linguistic Data Formats initiative (CLDF) and novel approaches for fast, efficient, and reliable data aggregation, we have created a new database for cross-linguistic colexifications, which not only supersedes the original CLICS database in terms of coverage but also offers a much more principled procedure for the creation, curation and aggregation of datasets. The paper presents the new database and discusses its major features.
With increasing amounts of digitally available data from all over the world, manual annotation of... more With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multilingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which has not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists cannot only automatically search for cognates in lexical data, they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy, and then illustrate in concrete workflows, how automatic sequence comparison can be applied to multilingual word lists. The goal is to provide the readers with all information they need to (a) carry out cognate detection and alignment analyses in LingPy, (b) select the appropriate algorithms for the appropriate task, (c) evaluate how well automatic cognate detection algorithms perform compared to experts, and (d) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.
Language is one of the most complex of human traits. There are many hypotheses about how it origi... more Language is one of the most complex of human traits. There are many hypotheses about how it originated , what factors shaped its diversity, and what ongoing processes drive how it changes. We present the Causal Hypotheses in Evolutionary Linguistics Database (CHIELD, https://chield.excd.org/), a tool for expressing, exploring, and evaluating hypotheses. It allows researchers to integrate multiple theories into a coherent narrative, helping to design future research. We present design goals, a formal specification, and an implementation for this database. Source code is freely available for other fields to take advantage of this tool. Some initial results are presented, including identifying conflicts in theories about gossip and ritual, comparing hypotheses relating population size and morphological complexity, and an author relation network.
**To download free**, follow the info at: https://iecor.clld.org
The origins of the Indo-Euro... more **To download free**, follow the info at: https://iecor.clld.org
The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9000 years before present (yr B.P.), while others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6000 yr B.P. Here we present an extensive database of Indo-European core vocabulary that eliminates past inconsistencies in cognate coding. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, with a subsequent branch northward onto the steppe and then across Europe. We reconcile this hybrid hypothesis with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent.
Uploads
Papers
The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9000 years before present (yr B.P.), while others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6000 yr B.P. Here we present an extensive database of Indo-European core vocabulary that eliminates past inconsistencies in cognate coding. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, with a subsequent branch northward onto the steppe and then across Europe. We reconcile this hybrid hypothesis with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent.