Simon J . Greenhill

Max Planck Institute for the Science of Human History, Linguistic and Cultural Evolution, Faculty Member

The Australian National University, School of Culture, History and Language, Post-Doc

Followers

411

Following

178

Co-authors

Mentions

529

Public Views

less

InterestsView All (14)

Uploads

Papers

Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss

by Simon J . Greenhill, India Pearey, and Daniel Prestipino

Science Advances, 2023

While global patterns of human genetic diversity are increasingly well characterized, the diversi... more While global patterns of human genetic diversity are increasingly well characterized, the diversity of human languages remains less systematically described. Here, we outline the Grambank database. With over 400,000 data points and 2400 languages, Grambank is the largest comparative grammatical database available. The comprehensiveness of Grambank allows us to quantify the relative effects of genealogical inheritance and geographic proximity on the structural diversity of the world's languages, evaluate constraints on linguistic diversity, and identify the world's most unusual languages. An analysis of the consequences of language loss reveals that the reduction in diversity will be strikingly uneven across the major linguistic regions of the world. Without sustained efforts to document and revitalize endangered languages, our linguistic window into human history, cognition, and culture will be seriously fragmented.

Download

Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison

The Open Handbook of Linguistic Data Management

Download

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Data... more

Lexibank: A publicly available repository of standardized lexical datasets with automatically computed phonological and lexical features for more than 2000 language varieties

Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph... more Cite the dataset as: List, Johann-Mattis; Forkel, Robert; Greenhill, Simon J.; Rzymski, Christoph; Englisch, Johannes; and Russell D. Gray (2021): Lexibank: A publicly available repository of standardized lexical datasets with automatically computed phonological and lexical features for more than 2000 language varieties [Dataset, Version 0.1]. Geneva: Zenodo. https://github.com/lexibank/lexibank-analysed

Ancestral state reconstruction summaries from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Probabilities of presence/absence of a trait at root nodes.

Cultural trait (plant-use) codes from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts use... more

Archaeobotanical data from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Viking-Age archaeobotanical evidence per plant species, including sites where samples were found,... more

Supplementary code tutorial and data for "Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison

This repository (see Tutorial.md for getting started) contains the code and data for reproducing ... more

CLLD Concepticon 2.5.0

List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, K... more List, Johann Mattis & Rzymski, Christoph & Greenhill, Simon & Schweikhard, Nathanael & Pianykh, Kristina & Tjuka, Annika & Hundt, Carolin & Forkel, Robert (eds.) 2021. Concepticon v2.5.0. A Resource for the Linking of Concept Lists. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available online at https://concepticon.clld.org

Treemaker: V1.0

A Python tool for constructing a newick formatted tree from a set of classifications.

Cross-Linguistic Transcription Systems

Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, ... more

Lingpy Tutorial. Version 1.0

This is the first release of our LingPy tutorial, accompanying the paper "Sequence Compariso... more

drav_cov_est_ucln_yule_no_burnin.trees.zip from A Bayesian phylogenetic study of the Dravidian language family

The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 Glotto... more The Dravidian language family consists of about 80 varieties (Hammarström H. 2016 Glottolog 2.7) spoken by 220 million people across southern and central India and surrounding countries (Steever SB. 1998 In The Dravidian languages (ed. SB Steever), pp. 1–39: 1). Neither the geographical origin of the Dravidian language homeland nor its exact dispersal through time are known. The history of these languages is crucial for understanding prehistory in Eurasia, because despite their current restricted range, these languages played a significant role in influencing other language groups including Indo-Aryan (Indo-European) and Munda (Austroasiatic) speakers. Here, we report the results of a Bayesian phylogenetic analysis of cognate-coded lexical data, elicited first hand from native speakers, to investigate the subgrouping of the Dravidian language family, and provide dates for the major points of diversification. Our results indicate that the Dravidi...

Clld/Concepticon-Data: Clld Concepticon 1.1

List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Conceptico... more

Coelho et al. DATA-ESM from Drivers of geographical patterns of North American language diversity

The data used to run the analysis

SHH-Dlce/Python-Nexus V2.0.0

python-nexus - Generic nexus (.nex, .trees) reader/writer for python

clics/clics3: CLICS3 pre-release

No description provided.

RESEARCH ARTICLE Pulotu: Database of Austronesian Supernatural Beliefs and Practices

Scholars have debated naturalistic theories of religion for thousands of years, but only recently... more Scholars have debated naturalistic theories of religion for thousands of years, but only recently have scientists begun to test predictions empirically. Existing databases contain few variables on religion, and are subject to Galton’s Problem because they do not suffi-ciently account for the non-independence of cultures or systematically differentiate the tradi-tional states of cultures from their contemporary states. Here we present Pulotu: the first quantitative cross-cultural database purpose-built to test evolutionary hypotheses of super-natural beliefs and practices. The Pulotu database documents the remarkable diversity of the Austronesian family of cultures, which originated in Taiwan, spread west to Madagascar and east to Easter Island–a region covering over half the world’s longitude. The focus of Austronesian beliefs range from localised ancestral spirits to powerful creator gods. A wide range of practices also exist, such as headhunting, elaborate tattooing, and the const...

lexibank/diacl: Diachronic Atlas of Comparative Linguistics

Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics ... more

lexibank/tryonsolomon: Tryon and Hackman's Solomon Islands Languages: An internal classification

Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An inte... more

Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss

by Simon J . Greenhill, India Pearey, and Daniel Prestipino

Science Advances, 2023

Download

Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison

The Open Handbook of Linguistic Data Management

Download

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

Cite the source of the dataset as: Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Data... more

Lexibank: A publicly available repository of standardized lexical datasets with automatically computed phonological and lexical features for more than 2000 language varieties

Ancestral state reconstruction summaries from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Probabilities of presence/absence of a trait at root nodes.

Cultural trait (plant-use) codes from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Codes used in ESM1-6 to define the diversity of ways in which plants are used and plant parts use... more

Archaeobotanical data from Historical, archaeological and linguistic evidence test the phylogenetic inference of Viking-Age plant use

Viking-Age archaeobotanical evidence per plant species, including sites where samples were found,... more

Supplementary code tutorial and data for "Managing Historical Linguistic Data for Computational Phylogenetics and Computer-Assisted Language Comparison

This repository (see Tutorial.md for getting started) contains the code and data for reproducing ... more

CLLD Concepticon 2.5.0

Treemaker: V1.0

A Python tool for constructing a newick formatted tree from a set of classifications.

Cross-Linguistic Transcription Systems

Cite as Johann-Mattis List, Cormac Anderson, Tiago Tresoldi, Christoph Rzymski, Simon Greenhill, ... more

Lingpy Tutorial. Version 1.0

This is the first release of our LingPy tutorial, accompanying the paper "Sequence Compariso... more

drav_cov_est_ucln_yule_no_burnin.trees.zip from A Bayesian phylogenetic study of the Dravidian language family

Clld/Concepticon-Data: Clld Concepticon 1.1

List, Johann-Mattis & Cysouw, Michael & Greenhill, Simon & Forkel, Robert (eds.) 2018. Conceptico... more

Coelho et al. DATA-ESM from Drivers of geographical patterns of North American language diversity

The data used to run the analysis

SHH-Dlce/Python-Nexus V2.0.0

python-nexus - Generic nexus (.nex, .trees) reader/writer for python

clics/clics3: CLICS3 pre-release

No description provided.

RESEARCH ARTICLE Pulotu: Database of Austronesian Supernatural Beliefs and Practices

lexibank/diacl: Diachronic Atlas of Comparative Linguistics

Cite the source dataset as Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics ... more

lexibank/tryonsolomon: Tryon and Hackman's Solomon Islands Languages: An internal classification

Cite the source dataset as Tryon, D.T. and Hackman, B.D. 1983. Solomon Islands Languages: An inte... more

The Potential of Automatic Word Comparison for Historical Linguistics

by Johann-Mattis List and Simon J . Greenhill

The amount of data from languages spoken all over the world is rapidly increasing. Traditional ma... more The amount of data from languages spoken all over the world is rapidly increasing. Traditional manual methods in historical linguistics need to face the challenges brought by this influx of data. Automatic approaches to word comparison could provide invaluable help to pre-analyze data which can be later enhanced by experts. In this way, computational approaches can take care of the repetitive and schematic tasks leaving experts to concentrate on answering interesting questions. Here we test the potential of automatic methods to detect etymologically related words (cognates) in cross-linguistic data. Using a newly compiled database of expert cognate judgments across five different language families, we compare how well different automatic approaches distinguish related from unrelated words. Our results show that automatic methods can identify cognates with a very high degree of accuracy , reaching 89% for the best-performing method Infomap. We identify the specific strengths and weaknesses of these different methods and point to major challenges for future approaches. Current automatic approaches for cognate detection—although not per-fect—could become an important component of future research in historical linguistics.

Download

LingPy Documentation (Version 2.6.1)

by Johann-Mattis List and Simon J . Greenhill

This is the documentation for LingPy-2.6.1, the most recent release of the LingPy library for qua... more

Download

Lexibank, a public repository of standardized wordlists with computed phonological and lexical features

by Johann-Mattis List and Simon J . Greenhill

Scientific Data, 2022

The past decades have seen substantial growth in digital data on the world's languages. At the sa... more The past decades have seen substantial growth in digital data on the world's languages. At the same time, the demand for cross-linguistic datasets has been increasing, as witnessed by numerous studies devoted to diverse questions on human prehistory, cultural evolution, and human cognition. Unfortunately, the majority of published datasets lack standardization which makes their comparison difficult. Here, we present the first step to increase the comparability of cross-linguistic lexical data. We have designed workflows for the computer-assisted lifting of datasets to Cross-Linguistic Data Formats, a collection of standards that increase the FAIRness of linguistic data. We test the Lexibank workflow on a collection of 100 lexical datasets from which we derive an aggregated database of wordlists in unified phonetic transcriptions covering more than 2000 language varieties. We illustrate the benefits of our approach by showing how phonological and lexical features can be automatically inferred, complementing and expanding existing cross-linguistic datasets.

Download

Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics

by Johann-Mattis List, Simon J . Greenhill, Martin Haspelmath, and Anna Siewierska

Scientific Data, 2018

The amount of available digital data for the languages of the world is constantly increasing. Unf... more The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.

Download

An Improved Database of Cross-Linguistic Colexifications Assembling Lexical Data with Help of Cross-Linguistic Data Formats

by Johann-Mattis List, Simon J . Greenhill, and Cormac Anderson

Linguistic Typology, 2018

The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted fra... more The Database of Cross-Linguistic Colexifications (CLICS), has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. In its current form, it has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations , ranging from studies on semantic change, patterns of conceptualization, and linguistic pale-ontology. But CLICS has also been criticized for obvious shortcomings, ranging from the underlying dataset, which still contains many errors, up to the limits of cross-linguistic colexification studies in general. Building on recent standardization efforts reflected in the Cross-Linguistic Data Formats initiative (CLDF) and novel approaches for fast, efficient, and reliable data aggregation, we have created a new database for cross-linguistic colexifications, which not only supersedes the original CLICS database in terms of coverage but also offers a much more principled procedure for the creation, curation and aggregation of datasets. The paper presents the new database and discusses its major features.

Download

Sequence Comparison in Computational Historical Linguistics Phonetic Alignments and Cognate Detection with LingPy 2.6

by Johann-Mattis List, Mary Walworth, Simon J . Greenhill, and Tiago Tresoldi

Journal of Language Evolution, 2018

With increasing amounts of digitally available data from all over the world, manual annotation of... more With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multilingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which has not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists cannot only automatically search for cognates in lexical data, they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy, and then illustrate in concrete workflows, how automatic sequence comparison can be applied to multilingual word lists. The goal is to provide the readers with all information they need to (a) carry out cognate detection and alignment analyses in LingPy, (b) select the appropriate algorithms for the appropriate task, (c) evaluate how well automatic cognate detection algorithms perform compared to experts, and (d) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.

Download

Toward a Comprehensive Subgrouping of Vanuatu Languages

by Mary Walworth and Simon J . Greenhill

Vanuatu Languages Conference, 2018

CHIELD: the causal hypotheses in evolutionary linguistics database

by Ruth Singer, Simon J . Greenhill, Guillaume Jacques, Kit (Christopher) Opie, and Anton Killin

Journal of Language Evolution, 2020

Language is one of the most complex of human traits. There are many hypotheses about how it origi... more Language is one of the most complex of human traits. There are many hypotheses about how it originated , what factors shaped its diversity, and what ongoing processes drive how it changes. We present the Causal Hypotheses in Evolutionary Linguistics Database (CHIELD, https://chield.excd.org/), a tool for expressing, exploring, and evaluating hypotheses. It allows researchers to integrate multiple theories into a coherent narrative, helping to design future research. We present design goals, a formal specification, and an implementation for this database. Source code is freely available for other fields to take advantage of this tool. Some initial results are presented, including identifying conflicts in theories about gossip and ritual, comparing hypotheses relating population size and morphological complexity, and an author relation network.

Download

Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages

by Paul Heggarty, Cormac Anderson, Matthew Scarborough, Benedict King, Remco Bouckaert, Lechosław Jocz, Martin Joachim Kümmel, Erik Anonby, Matthew Boutilier, Cassandra Freiberg, Robert Tegethoff, Kim Schulte, Ganesh Kumar Gupta, Simon J . Greenhill, and Russell Gray

Science, 2023

**To download free**, follow the info at: https://iecor.clld.org The origins of the Indo-Euro... more **To download free**, follow the info at: https://iecor.clld.org

The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9000 years before present (yr B.P.), while others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6000 yr B.P. Here we present an extensive database of Indo-European core vocabulary that eliminates past inconsistencies in cognate coding. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, with a subsequent branch northward onto the steppe and then across Europe. We reconcile this hybrid hypothesis with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent.

Download

Simon J . Greenhill

Uploads

Papers

Log In