home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

Universal Dependencies

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 600 contributors producing over 200 treebanks in over 150 languages (see the bottom of this page for updated numbers from the latest release). If you are new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.

Short introduction to UD
UD annotation guidelines
More information on UD:
Query UD treebanks online:
- PML Tree Query maintained by the Charles University in Prague
- TEITOK maintained by the Charles University in Prague
- Grew-match maintained by Inria in Nancy
- INESS maintained by the University of Bergen
Download UD treebanks

If you want to receive news about Universal Dependencies, you can subscribe to the UD mailing list. If you want to discuss individual annotation questions, use the Github issue tracker.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Abaza 1 <1K Northwest Caucasian

Abaza treebanks

ATB <1K

UD_Abaza-ATB is a treebank based on [Spoken corpus of Abaza](http://lingconlab.ru/spoken_abaza/).

Contributors: Alexey Koshevoy, Anastasia Panova, Ilya Makarchuk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Abkhaz 1 6K Northwest Caucasian

Abkhaz treebanks

AbNC 6K

UD_Abkhaz-AbNC is a treebank based on texts from the Abkhaz National Corpus, [AbNC](https://clarino.uib.no/abnc).

Contributors: Paul Meurer
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Afrikaans 1 49K IE, Germanic

Afrikaans treebanks

AfriBooms 49K

UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

Contributors: Peter Dirix, Liesbeth Augustinus, Daniel van Niekerk
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Akkadian 2 25K Afro-Asiatic, Semitic

Akkadian treebanks

RIAO 23K

162 royal inscriptions of four early Neo-Assyrian kings.

Contributors: Mikko Luukko, Aleksi Sahala, Sam Hardwick, Krister Lindén
Repository master dev
README
Treebank hub page
Download

PISANDUB 1K

A small set of sentences from Babylonian royal inscriptions.

Contributors: Kamil Kopacewicz
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Akuntsu 1 1K Tupian, Tupari

Akuntsu treebanks

TuDeT 1K

UD_Akuntsu-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/akun1241"> Akuntsú</a>. The sentences stem from the grammatical description by Aragon (2014) and Aragon's field work. Sentence annotation and documentation by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos.

Contributors: Carolina Aragon, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Albanian 2 4K IE, Albanian

Albanian treebanks

TSA <1K

The UD Albanian Treebank is a small treebank for Standard Albanian, developed within a project framework at Uppsala University. The data was extracted from Wikipedia.

Contributors: Marsida Toska
Repository master dev
README
Treebank hub page
Download

STAF 3K

The UD-Albanian-STAF (Saarbruecken Treebank of Albanian Fiction) is a treebank of the Albanian language, comprising 202 randomly selected sentences from six fictional books published between 1963 and 2004.

Contributors: Luigi Talamo, Edita Luftiu, Nelda Kote, Rozana Rushiti, Anila Cepani
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Albanian treebanks.

Language documentation

See the language documentation page.

Amharic 1 10K Afro-Asiatic, Semitic

Amharic treebanks

ATT 10K

UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.

Contributors: Binyam Ephrem, Gashaw Arutie, Tsegay Woldemariam, Juan Ignacio Navarro Horñiacek
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ancient Greek 3 456K IE, Greek

Ancient Greek treebanks

PTNK 39K

UD Ancient Greek PTNK contains portions of the Septuagint according to the Codex Alexandrinus.

Contributors: Daniel Swanson
Repository master dev
README
Treebank hub page
Download

PROIEL 214K

UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Perseus 202K

This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

See the language documentation page.

Ancient Hebrew 1 39K Afro-Asiatic, Semitic

Ancient Hebrew treebanks

PTNK 39K

UD Ancient Hebrew PTNK contains portions of the Biblia Hebraic Stuttgartensia with morphological annotations from [ETCBC](https://github.com/etcbc/bhsa).

Contributors: Daniel Swanson
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Apurina 1 <1K Arawakan

Apurina treebanks

UFPA <1K

This is an Apurinã treebank consisting of sentences from a grammatical description of the language by Maília Fernanda.

Contributors: Marília Fernanda, Sidney Facundes, Bruna Lima Padovani, Jack Rueter, Niko Partanen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Arabic 3 1,042K Afro-Asiatic, Semitic

Arabic treebanks

PUD 20K

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

NYUAD 738K

The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

Contributors: Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

PADT 282K

The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Shadi Saleh
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Arabic treebanks.

Language documentation

See the language documentation page.

Armenian 2 94K IE, Armenian

Armenian treebanks

BSUT 41K

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the V. Brusov State University in Yerevan.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

ArmTDP 52K

A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Armenian treebanks.

Language documentation

See the language documentation page.

Assyrian 1 <1K Afro-Asiatic, Semitic

Assyrian treebanks

AS <1K

The Uppsala Assyrian Treebank is a small treebank for Modern Standard Assyrian. The corpus is collected and annotated manually. The data was randomly collected from different textbooks and a short translation of The Merchant of Venice.

Contributors: Mary Yako
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Azerbaijani 1 <1K Turkic, Southwestern

Azerbaijani treebanks

TueCL <1K

This is a small treebank of grammatical examples for Azerbaijani. The treebank tries to be neutral about the particular variety (North or South Azerbaijani, hence uses the ISO code for the macrolanguage (`az`).

Contributors: Soudabeh Eslami, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bambara 1 13K Mande

Bambara treebanks

CRB 13K

The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

Contributors: Katya Aplonova, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Basque 1 121K Basque

Basque treebanks

BDT 121K

The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.

Contributors: Maria Jesus Aranzabe, Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Iakes Goenaga, Koldo Gojenola, Larraitz Uria
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bavarian 1 15K IE, Germanic

Bavarian treebanks

MaiBaam 15K

MaiBaam is manually annotated with part-of-speech tag, syntactic dependencies, and German lemmas. The treebank encompasses diverse text genres (wiki articles and discussions, grammar examples, fiction, and commands for virtual assistants) and dialects from the North, Central and South Bavarian areas as well as the dialectal transition areas in between.

Contributors: Verena Blaschke, Barbara Kovačić, Siyao Peng, Miriam Winkler, Barbara Plank
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Beja 1 11K Afro-Asiatic, Cushitic

Beja treebanks

Autogramm 11K

A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.

Contributors: Martine Vanhove, Rayan Ziane, Sylvain Kahane, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Belarusian 1 305K IE, Slavic

Belarusian treebanks

HSE 305K

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

Contributors: Olga Lyashevskaya, Angelika Peljak-Łapińska, Daria Petrova, Yana Shishkina
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bengali 1 <1K IE, Indic

Bengali treebanks

BRU <1K

The BRU Bengali treebank has been created at Begum Rokeya University, Rangpur, by the members of Semantics Lab.

Contributors: Siratun Jannat, Mizanur Rahoman, Shafi Sourov, Jannatul Ferdaousi, Syeda Shahzadi, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bhojpuri 1 6K IE, Indic

Bhojpuri treebanks

BHTB 6K

The [Bhojpuri](https://en.wikipedia.org/wiki/Bhojpuri_language) UD Treebank (BHTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

Contributors: Atul Kr. Ojha, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bororo 1 6K Bororoan

Bororo treebanks

BDT 6K

UD_Bororo-BDT is a compilation of annotated sentences in [Bororo](https://glottolog.org/resource/languoid/id/boro1282). The corpus encompasses sentences derived from diverse sources: grammar examples, mythological narratives, fieldwork material, and other sources. Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Breton 1 10K IE, Celtic

Breton treebanks

KEB 10K

UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bulgarian 1 156K IE, Slavic

Bulgarian treebanks

BTB 156K

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.

Contributors: Kiril Simov, Petya Osenova, Martin Popel
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Buryat 1 10K Mongolic

Buryat treebanks

BDT 10K

The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

Contributors: Elena Badmaeva, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cantonese 1 13K Sino-Tibetan, Chinese

Cantonese treebanks

HK 13K

A Cantonese treebank (in Traditional Chinese characters) of film subtitles and of legislative proceedings of Hong Kong, parallel with the Chinese-HK treebank.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cappadocian 2 4K IE, Greek

Cappadocian treebanks

TueCL 4K

This is a treebank of Pharasiot, a critically endangered Greek dialect originally spoken near Cappadocia. The source material is fairy tales collected during field study.

Contributors: Eleni Vligouridou, Inessa Iliadou, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

AMGiC <1K

The "Asia Minor Greek in Contact" treebank (AMGiC, UD_AMGiC) is compiled from sentences entailing contact-induced morphosyntactic phenomena (CIMSP) that are a result of the contact between Greek and Turkish varieties in Anatolia and in adjacent regions. The sentences are traced in Asia Minor Greek (AMG) dialectal sources. In addition to the UD analysis, the AMGiC treebank provides information concerning the sociolinguistic context within which CIMSP arise.

Contributors: Konstantinos Sampanis, Prokopis Prokopidis, Furkan Akkurt
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Cappadocian treebanks.

Language documentation

See the language documentation page.

Catalan 1 553K IE, Romance

Catalan treebanks

AnCora 553K

Catalan data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

Contributors: Héctor Martínez Alonso, Elena Pascual, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cebuano 1 1K Austronesian, Central Philippine

Cebuano treebanks

GJA 1K

UD_Cebuano_GJA is a collection of annotated Cebuano sample sentences randomly taken from three different sources: community-contributed samples from the website Tatoeba, a Cebuano grammar book by Bunye & Yap (1971) and Tanangkinsing's reference grammar on Cebuano (2011). This project is currently work in progress.

Contributors: Glyd Aranes
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Chinese 7 309K Sino-Tibetan, Chinese

Chinese treebanks

Beginner 19K

A treebank of Chinese sentences adapted for learner of level A1 to C1 (HSK1 to 5) collected on the [Chinese Grammar Wiki](https://resources.allsetlearning.com/chinese/grammar/\) (CC BY-NC-SA 3.0 License) website. The treebank was manually annotated by researchers of Paris Nanterre University (Modyco) in the mSUD annotation schema (morpheme level Surface Universal Dependencies).

Contributors: Kirian Guiller, Yidi Huang, Yixuan Li, Qishen Wu, Bruno Guillaume, Sylvain Kahane, Kim Gerdes
Repository master dev
README
Treebank hub page
Download

PUD 21K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
Repository master dev
README
Treebank hub page
Download

HK 9K

A Traditional Chinese treebank of film subtitles and of legislative proceedings of Hong Kong, parallel with the Cantonese-HK treebank.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

CFL 7K

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

Contributors: John Lee, Herman Leung, Keying Li
Repository master dev
README
Treebank hub page
Download

PatentChar 4K

A treebank of Chinese patent application texts collected from the Chinese patent office's website CNIPA. The sentences are randomly selected from the patent claims of the IPC section "G" from November 2017 to September 2018.

Contributors: Yixuan Li, Kim Gerdes, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

GSDSimp 123K

Simplified Chinese Universal Dependencies dataset converted from the GSD (traditional) dataset with manual corrections.

Contributors: Peng Qi, Koichi Yasuoka
Repository master dev
README
Treebank hub page
Download

GSD 123K

Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

Contributors: Mo Shen, Ryan McDonald, Daniel Zeman, Peng Qi
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Chinese treebanks.

Language documentation

See the language documentation page.

Chukchi 1 6K Chukotko-Kamchatkan

Chukchi treebanks

HSE 6K

This data is a manual annotation of the corpus from multimedia annotated corpus of the [Chuklang](http://chuklang.ru/) project, a dialectal corpus of the Amguema variant of Chukchi.

Contributors: Francis Tyers, Karina Mischenkova
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Classical Armenian 1 88K IE, Armenian

Classical Armenian treebanks

CAVaL 88K

The present release includes the Classical Armenian translation of the Gospels and the first ten chapters of the "History of the Armenians" by Movses Khorenatsi. The annotation of the Gospels results from a rule-based conversion from the PROIEL annotation, manually corrected and extended with additional information. The annotation of the "History of the Armenians" has been performed by a UDPipe2 annotator and manually corrected.

Contributors: Petr Kocharov, Lilit Kharatyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Classical Chinese 2 433K Sino-Tibetan, Chinese

Classical Chinese treebanks

Kyoto 433K

Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.

Contributors: Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Yuan Li, Hiroyuki Shirasu, Kazunori Fujita
Repository master dev
README
Treebank hub page
Download

TueCL <1K

A dependency Treebank of "逍遥游(Enjoyment in Untroubled Ease)" written by Zhuangzi.

Contributors: Yifei Chen, John Wang, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Classical Chinese treebanks.

Language documentation

See the language documentation page.

Coptic 1 57K Afro-Asiatic, Egyptian

Coptic treebanks

Scriptorium 57K

UD Coptic contains manually annotated Sahidic Coptic texts, including Biblical texts, sermons, letters, and hagiography.

Contributors: Mitchell Abrams, Elizabeth Davidson, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Croatian 1 199K IE, Slavic

Croatian treebanks

SET 199K

The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the [hr500k](http://hdl.handle.net/11356/1183) corpus.

Contributors: Željko Agić, Nikola Ljubešić, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Czech 6 2,252K IE, Slavic

Czech treebanks

CAC 495K

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

FicTree 167K

FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

Contributors: Tomáš Jelínek, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

CLTT 36K

The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 2.0, created at the Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman, Martin Popel
Repository master dev
README
Treebank hub page
Download

PUD 18K

Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Poetry 6K

UD_Czech-Poetry contains random samples of Czech 19th-century poetry from the Corpus of Czech Verse parsed with UDPipe2 (trained on UD Czech-PDT 2.11) and manually corrected.

Contributors: Silvie Cinková
Repository master dev
README
Treebank hub page
Download

PDT 1,529K

The Czech-PDT UD treebank is based on the Prague Dependency Treebank – Consolidated 1.0 (PDT-C), created at the Charles University in Prague.

Contributors: Daniel Zeman, Jan Hajič
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish 1 100K IE, Germanic

Danish treebanks

DDT 100K

The Danish UD treebank is a conversion of the Danish Dependency Treebank.

Contributors: Anders Johannsen, Héctor Martínez Alonso, Barbara Plank
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Dutch 2 506K IE, Germanic

Dutch treebanks

LassySmall 297K

This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

Contributors: Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

Alpino 208K

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Dutch treebanks.

Language documentation

See the language documentation page.

Egyptian 1 14K Afro-Asiatic, Egyptian

Egyptian treebanks

UJaen 14K

Egyptian-UJaen is the first dependency treebank created for the morphosyntactic annotation of pre-Coptic Egyptian. Its current state (UD v2.15) consists of 1,573 sentences and 14,650 words manually annotated from texts written in Old Egyptian, mainly from the Pyramid Texts.

Contributors: Roberto Antonio Díaz Hernández
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

English 11 760K IE, Germanic

English treebanks

GUM 212K

Universal Dependencies syntax annotations from the GUM corpus (https://gucorpling.org/gum/)

Contributors: Siyao Peng, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

EWT 254K

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).

Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, Ethan Chi, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Nathan Schneider, Sam Bowman, Hanzhi Zhu, Daniel Galbraith, John Bauer
Repository master dev
README
Treebank hub page
Download

Atis 61K

UD Atis Treebank is a manually annotated treebank consisting of the sentences in the Atis (Airline Travel Informations) dataset which includes the human speech transcriptions of people asking for flight information on the automated inquiry systems.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız
Repository master dev
README
Treebank hub page
Download

ParTUT 49K

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

GENTLE 17K

Repository for the Genre Tests for Linguistic Evaluation (GENTLE) Corpus

Contributors: Tatsuya Aoyama, Shabnam Behzad, Luke Gessler, Lauren Levine, Yi-Ju Jessica Lin, Yang Janet Liu, Siyao Logan Peng, Yilun Zhu, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

PUD 21K

This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
Repository master dev
README
Treebank hub page
Download

LinES 94K

UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

Pronouns 1K

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, "hers", (independent) "his", (singular) "theirs", "mine", and (singular) "yours".

Contributors: Robert Munro
Repository master dev
README
Treebank hub page
Download

ESLSpok 21K

This repository includes the Dependency Treebank of Spoken L2 English (SL2E), which consists of Universal Dependency annotations for a random sample of sentences from the <a href="https://alaginrc.nict.go.jp/nict_jle/index_E.html" target="_blank">NICT JLE</a>, a corpus of spoken second language English. <a href="https://github.com/LCR-ADS-Lab/SL2E-Dependency-Treebank" target="_blank">The homepage of the project is here.</a>

Contributors: Kris Kyle, Masaki Eguchi, Aaron Miller, Ted Sither
Repository master dev
README
Treebank hub page
Download

CTeTex 9K

UD_English-CTeTex is a technical text corpus annotated in Universal Dependency syntax containing 196 software requirements.

Contributors: Naïma Hassert, Pierre André Ménard, Edith Galy
Repository master dev
README
Treebank hub page
Download

GUMReddit 16K

Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)

Contributors: Siyao Peng, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

Erzya 1 20K Uralic, Mordvin

Erzya treebanks

JR 20K

UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

Contributors: Jack Rueter, Francis Tyers, Elena Klementieva, Olga Erina, Ivan Riabov
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Estonian 2 529K Uralic, Finnic

Estonian treebanks

EDT 438K

UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,972 trees, 437,769 tokens.

Contributors: Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Andriela Rääbis, Liisi Torga
Repository master dev
README
Treebank hub page
Download

EWT 90K

UD EWT treebank consists of different genres of new media. The treebank contains 7,190 trees, 90,585 tokens.

Contributors: Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Dage Särg, Sandra Eiche, Andriela Rääbis
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Estonian treebanks.

Language documentation

See the language documentation page.

Faroese 2 50K IE, Germanic

Faroese treebanks

OFT 10K

This is a treebank of Faroese based on the Faroese Wikipedia.

Contributors: Daniel Zeman, Bjartur Mortensen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

FarPaHC 40K

UD_Faroese-FarPaHC is a conversion of the [Faroese Parsed Historical Corpus (FarPaHC)](https://github.com/einarfs/farpahc) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).

Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Anton Karl Ingason, Eiríkur Rögnvaldsson, Joel C. Wallenberg
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Faroese treebanks.

Language documentation

See the language documentation page.

Finnish 4 397K Uralic, Finnic

Finnish treebanks

TDT 202K

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

Contributors: Filip Ginter, Jenna Kanerva, Veronika Laippala, Niko Miekka, Anna Missilä, Stina Ojala, Sampo Pyysalo
Repository master dev
README
Treebank hub page
Download

FTB 159K

FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script and later manually revised.

Contributors: Jussi Piitulainen, Hanna Nurmi, Jack Rueter
Repository master dev
README
Treebank hub page
Download

PUD 15K

Contributors: Jenna Kanerva, Filip Ginter, Stina Ojala, Anna Missilä
Repository master dev
README
Treebank hub page
Download

OOD 19K

Finnish-OOD is an external out-of-domain test set for Finnish-TDT annotated natively into UD scheme.

Contributors: Jenna Kanerva
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Finnish treebanks.

Language documentation

See the language documentation page.

French 7 635K IE, Romance

French treebanks

GSD 400K

The **UD_French-GSD** was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.

Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni, Carly Dickerson, Guy Perrier
Repository master dev
README
Treebank hub page
Download

Sequoia 70K

**UD_French-Sequoia** is an automatic conversion of the [SUD_French-Sequoia](https://github.com/surfacesyntacticud/SUD_French-Sequoia) treebank, which comes from the former corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

Contributors: Marie Candito, Djamé Seddah, Guy Perrier, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

ParTUT 28K

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

ParisStories 42K

Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.

Contributors: Kim Gerdes, Sylvain Kahane, Menel Mahamdi
Repository master dev
README
Treebank hub page
Download

Rhapsodie 44K

A Universal Dependencies corpus for spoken French.

Contributors: Kim Gerdes, Sylvain Kahane, Mariam Nakhlé, Chunxiao Yan, Aline Etienne, Marine Courtin
Repository master dev
README
Treebank hub page
Download

PUD 24K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

FQB 23K

The corpus **UD_French-FQB** is an automatic conversion of the [French QuestionBank v1](http://alpage.inria.fr/Treebanks/FQB/), a corpus entirely made of questions.

Contributors: Djamé Seddah, Marie Candito, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Frisian Dutch 1 3K Code switching

Frisian Dutch treebanks

Fame 3K

UD_Frisian_Dutch-Fame is a selection of 400 sentences from the FAME! speech corpus by Yilmaz et al. (2016a, 2016b). The treebank is manually annotated using the UD scheme.

Contributors: Anouck Braggaar, Rob van der Goot
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Galician 3 187K IE, Romance

Galician treebanks

TreeGal 25K

The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña) and at CiTIUS (Universidade de Santiago de Compostela).

Contributors: Marcos Garcia, Xulia Sánchez-Rodríguez
Repository master dev
README
Treebank hub page
Download

PUD 23K

The Galician PUD is a treebank for Galician developed at CiTIUS (Universidade de Santiago de Compostela).

Contributors: Albina Sarymsakova, Xulia Sánchez-Rodríguez, Marcos Garcia
Repository master dev
README
Treebank hub page
Download

CTG 138K

The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.

Contributors: Xavier Gómez Guinovart
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Galician treebanks.

Language documentation

See the language documentation page.

Georgian 1 60K Kartvelian

Georgian treebanks

GLC 60K

The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Georgian Language Corpus (GLC) available at http://corpora.iliauni.edu.ge/ and sentences selected from Wiki in accordance with the 132 scientific fields.

Contributors: Irina Lobzhanidze
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

German 4 3,810K IE, Germanic

German treebanks

HDT 3,455K

UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

Contributors: Emanuel Borges Völker, Felix Hennig, Arne Köhn, Maximilan Wendt, Verena Blaschke, Nina Böbel, Leonie Weissweiler
Repository master dev
README
Treebank hub page
Download

GSD 292K

The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Slav Petrov, Wolfgang Seeker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Adriane Boyd
Repository master dev
README
Treebank hub page
Download

PUD 21K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

LIT 40K

This treebank aims at gathering texts of the German literary history. Currently, it hosts Fragments of the early Romanticism, i.e. aphorism-like texts mainly dealing with philosophical issues concerning art, beauty and related topics.

Contributors: Alessio Salomoni
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.

Gheg 1 15K IE, Albanian

Gheg treebanks

GPS 15K

UD Gheg Pear Stories (GPS) contains renarrations of Wallace Chafe's Pear Stories video (pearstories.org) by heritage speakers of Gheg Albanian living in Switzerland and speakers from Prishtina.

Contributors: Christian Ebert, Artan Islamaj, Adrian Kuqi, Barbara Sonnenhauser, Paul Widmer, Magdalena Plamada
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gothic 1 55K IE, Germanic

Gothic treebanks

PROIEL 55K

The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Greek 2 88K IE, Greek

Greek treebanks

GUD 25K

GUD is a resource for EL manually annotated for morphology and syntax. It is an ongoing project led by Stella Markantonatou and Vivian Stamou (hereinafter: the GUD team), both researchers at the [Institute for Language and Speech Processing](http://www.ilsp.gr/) (ILSP/Athena Research Centre).

Contributors: Stella Markantonatou, Vivian Stamou, Socrates Vak
Repository master dev
README
Treebank hub page
Download

GDT 63K

The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

Contributors: Prokopis Prokopidis
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Greek treebanks.

Language documentation

See the language documentation page.

Guajajara 1 9K Tupian, Maweti-Guarani

Guajajara treebanks

TuDeT 9K

UD_Guajajara-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/guaj1255">Guajajara</a>. Sentences stem from multiple sources such as descriptions of the language, short stories, dictionaries and translations from the New Testament. Sentence annotation and documentation by Lorena Martín Rodríguez and Fabrício Ferraz Gerardi.

Contributors: Lorena Martín Rodríguez, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Guarani 1 <1K Tupian, Maweti-Guarani

Guarani treebanks

OldTuDeT <1K

UD_Guarani-OldTuDeT is a collection of annotated texts in <a href="https://glottolog.org/resource/languoid/id/oldp1258">Old Guaraní</a>. All known sources in this language are being annotated: cathesisms, grammars (seventeenth and eighteenth century), sentences from dictionaries, and other texts. Sentence annotation and documentation by Fabrício Ferraz Gerardi and Lorena Martín Rodríguez.

Contributors: Fabrício Ferraz Gerardi, Lorena Martín Rodríguez
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gujarati 1 1K IE, Indic

Gujarati treebanks

GujTB 1K

GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.

Contributors: Maitrey Mehta, Mayank Jobanputra
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Gwichin 1 1K Na-Dene

Gwichin treebanks

TueCL 1K

UD_Gwichin-TueCL is a small treebank of Alaskan Gwich'in, an endangered Athabascan language, based on material located in the Alaska Native Language Archive.

Contributors: Matthew Andrews, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Haitian Creole 1 3K Creole

Haitian Creole treebanks

Autogramm 3K

This is a treebank of Haitian creole. It contains 144 sentences selected from 3 major genres: bible, literary texts, newspapers. Kreyòl (Kreyòl Ayisyen, Haitian Creole, iso-639-1: ht) is the main language of Haïti. The dialect described here is the Cap Haïtien dialect which differs slightly in its lexicon with Center and South varieties.

Contributors: Claudel Pierre-Louis, Sandra Jagodzińska, Sylvain Kahane, Agata Savary, Emmanuel Schang
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hausa 2 18K Afro-Asiatic, West Chadic

Hausa treebanks

SouthernAutogramm 14K

This treebank contains data of Southern Autogramm, for the Zaria dialect of Nigeria (Southern Hausa).

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

NorthernAutogramm 4K

This treebank contains data of Northern Autogramm, for the Ader dialect of Niger Republic (Northern Hausa).

Contributors: Bernard Caron
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hausa treebanks.

Language documentation

See the language documentation page.

Hebrew 3 368K Afro-Asiatic, Semitic

Hebrew treebanks

IAHLTwiki 140K

Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section (https://www.iahlt.org/)

Contributors: Amir Zeldes, Avner Algom, Noam Ordan, Yifat Ben Moshe, Shira Wigderson
Repository master dev
README
Treebank hub page
Download

IAHLTknesset 67K

Publicly available IAHLT UD Hebrew Treebank's Knesset section (https://www.iahlt.org/)

Contributors: Amir Zeldes, Avner Algom, Noam Ordan, Yifat Ben Moshe, Nick Howell, Shira Wigderson, Omer Strass, Israel Landau, Netanel Dahan, Yael Minerbi, Hilla Merhav, Emmanuelle Kowner, Shuli Wintner, Gili Goldin, Ella Rabinovhich, Vladimir Gurevich
Repository master dev
README
Treebank hub page
Download

HTB 160K

A Universal Dependencies Corpus for Hebrew.

Contributors: Yoav Goldberg, Reut Tsarfaty, Amir More, Shoval Sadde, Victoria Basmov, Yuval Pinter
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hebrew treebanks.

Language documentation

See the language documentation page.

Highland P. Nahuatl 1 10K Uto-Aztecan

Highland Puebla Nahuatl treebanks

ITML 10K

UD_Highland_Puebla_Nahuatl-ITML is a collection of texts in the Highland Puebla variety of Nahuatl (ISO-639: `azz`) spoken in 24 municipalities in the state of Mexico in Puebla. The treebank contains spoken monologue and dialogue, scientific texts translated from Spanish and some miscellaneous grammatical examples from a language course.

Contributors: Robert Pugh, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hindi 2 375K IE, Indic

Hindi treebanks

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

HDTB 351K

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Hittite 1 1K IE, Anatolian

Hittite treebanks

HitTB 1K

UD_Hittite-HitTB is a small Universal Dependencies treebank for Hittite, containing original sentences from Hoffner and Melchert's tutorial to A Grammar of the Hittite Language.

Contributors: Erik Andersen, Ben Rozonoyer
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hungarian 1 42K Uralic, Ugric

Hungarian treebanks

Szeged 42K

The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).

Contributors: Richárd Farkas, Katalin Simkó, Zsolt Szántó, Viktor Varga, Veronika Vincze
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Icelandic 4 1,183K IE, Germanic

Icelandic treebanks

IcePaHC 985K

UD_Icelandic-IcePaHC is a conversion of the [Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).

Contributors: Þórunn Arnardóttir, Hinrik Hafsteinsson, Einar Freyr Sigurðsson, Hildur Jónsdóttir, Kristín Bjarnadóttir, Anton Karl Ingason, Kristján Rúnarsson, Steinþór Steingrímsson, Joel C. Wallenberg, Eiríkur Rögnvaldsson
Repository master dev
README
Treebank hub page
Download

Modern 80K

UD_Icelandic-Modern is a conversion of the [modern additions](https://github.com/antonkarl/icecorpus/tree/master/additions2019) to the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.

Contributors: Kristján Rúnarsson, Þórunn Arnardóttir, Hinrik Hafsteinsson, Starkaður Barkarson, Hildur Jónsdóttir, Steinþór Steingrímsson, Einar Freyr Sigurðsson
Repository master dev
README
Treebank hub page
Download

PUD 18K

Icelandic-PUD is the Icelandic part of the Parallel Universal Dependencies (PUD) treebanks.

Contributors: Hildur Jónsdóttir
Repository master dev
README
Treebank hub page
Download

GC 99K

UD_Icelandic-GC is a conversion of the gold part of [GreynirCorpus](https://github.com/mideind/GreynirCorpus), which has been manually corrected and verified. The corpus is parsed into full constituency trees, and converted using [UDConverter-GreynirCorpus](https://github.com/thorunna/UDConverter-GreynirCorpus).

Contributors: Vilhjálmur Þorsteinsson, Hulda Óladóttir, Þórunn Arnardóttir, Sveinbjörn Þórðarson, Haukur Barri Símonarson, Katla Ásgeirsdóttir
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Icelandic treebanks.

Language documentation

See the language documentation page.

Indonesian 3 169K Austronesian, Malayo-Sumbawan

Indonesian treebanks

PUD 19K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman, Ika Alfina, Arawinda Dinakaramani, Muhammad Yudistira Hanifmuti, Jessica Naraiswari Arwidarasti, Yogi Lesmana Sulestio
Repository master dev
README
Treebank hub page
Download

GSD 122K

The Indonesian-GSD treebank was originally converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb) in 2015. In order to comply with the latest Indonesian annotation guidelines, the treebank has undergone a major revision between UD releases v2.8 and v2.9 (2021).

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Septina Dian Larasati, Ika Alfina
Repository master dev
README
Treebank hub page
Download

CSUI 28K

UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named [**Kethu**](https://github.com/ialfina/kethu) that was also a conversion from a constituency treebank built by [**Dinakaramani et al. (2015)**](https://github.com/famrashel/idn-treebank). We named this treebank **Indonesian-CSUI**, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.

Contributors: Ika Alfina, Jessica Naraiswari Arwidarasti, Muhammad Yudistira Hanifmuti, Arawinda Dinakaramani, Ruli Manurung, Fam Rashel, Andry Luthfi
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Indonesian treebanks.

Language documentation

See the language documentation page.

Irish 3 168K IE, Celtic

Irish treebanks

IDT 115K

A Universal Dependencies 4910-sentence treebank for modern Irish.

Contributors: Teresa Lynn, Jennifer Foster, Sarah McGuinness, Abigail Walsh, Jason Phelan, Kevin Scannell
Repository master dev
README
Treebank hub page
Download

TwittIrish 47K

A Universal Dependencies treebank of 2596 tweets in modern Irish.

Contributors: Lauren Cassidy, Teresa Lynn, Jennifer Foster, Sarah McGuinness
Repository master dev
README
Treebank hub page
Download

Cadhan 4K

This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a [separate section below](#bardic-segment).

Contributors: Kevin Scannell, Theodorus Fransen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Irish treebanks.

Language documentation

See the language documentation page.

Italian 10 1,001K IE, Romance

Italian treebanks

ISDT 298K

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

Contributors: Cristina Bosco, Alessandro Lenci, Simonetta Montemagni, Maria Simi
Repository master dev
README
Treebank hub page
Download

VIT 280K

The UD_Italian-VIT corpus was obtained by conversion from VIT (Venice Italian Treebank), developed at the Laboratory of Computational Linguistics of the Università Ca' Foscari in Venice (Delmonte et al. 2007; Delmonte 2009; http://rondelmo.it/resource/VIT/Browser-VIT/index.htm).

Contributors: Fabio Tamburini, Maria Simi, Cristina Bosco
Repository master dev
README
Treebank hub page
Download

Old 122K

Italian-Old is a treebank containing **Dante Alighieri's Comedy**, based on the 1994 Petrocchi edition and taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. The syntactic annotation has been done from scratch, following UD annotation scheme. It is a treebank of Old Italian, specifically Florentine. The Comedy was composed between approximately 1306 and 1321.

Contributors: Claudia Corbetta, Marco Passarotti, Flavio Massimiliano Cecchini, Giovanni Moretti
Repository master dev
README
Treebank hub page
Download

ParTUT 55K

UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

ParlaMint 20K

ParlaMint-It is a collection of transcriptions of parliamentary sessions of the Italian Senate annotated in Universal Dependencies. The corpus is part of a larger multilingual collection of parliamentary transcripts built during the ParlaMint project (https://www.clarin.eu/parlamint).

Contributors: Chiara Alzetta, Marta Sartor, Simonetta Montemagni, Giulia Venturi
Repository master dev
README
Treebank hub page
Download

TWITTIRO 29K

TWITTIRÒ-UD is a collection of ironic Italian tweets annotated in Universal Dependencies. The treebank can be exploited for the training of NLP systems to enhance their performance on social media texts, and in particular, for irony detection purposes.

Contributors: Alessandra T. Cignarella, Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

Valico 6K

Manually corrected Treebank of Learner Italian drawn from the Valico corpus and correspondent corrected sentences.

Contributors: Elisa Di Nuovo, Manuela Sanguinetti, Cristina Bosco, Alessandro Mazzei
Repository master dev
README
Treebank hub page
Download

PoSTWITA 124K

PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

MarkIT 40K

The MarkIT resource contains around 800 sentences extracted from students' essays manually annotated with syntactic depencendies. The treebank covers seven types of marked constructions, plus some ambiguous sentences whose syntax can be wrongly classified as marked.

Contributors: Teresa Paccosi, Alessio Palmero Aprosio, Sara Tonelli
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese 6 2,645K Japanese

Japanese treebanks

GSD 193K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSDLUW 150K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Ryan McDonald, Joakim Nivre, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 28K

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUDLUW 22K

Contributors: Mai Omura, Yusuke Miyao, Hiroshi Kanayama, Hiroshi Matsuda, Aya Wakasa, Kayo Yamashita, Masayuki Asahara, Takaaki Tanaka, Yugo Murawaki, Yuji Matsumoto, Kaoru Ito, Taishi Chika, Shinsuke Mori, Sumire Uematsu, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

BCCWJ 1,253K

Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
Repository master dev
README
Treebank hub page
Download

BCCWJLUW 995K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ). UD-Japanese-BCCWJLUW is the other word segmentation version of UD-Japanese-BCCWJ. We use **Long Unit Word (LUW)** as their syntactic word in UD definition.

Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Javanese 1 14K Austronesian, Javanese

Javanese treebanks

CSUI 14K

UD Javanese-CSUI is a dependency treebank in Javanese, a regional language in Indonesia with more than 68 million users. It was developed by Alfina et al. from the Faculty of Computer Science, Universitas Indonesia. The newest version has 1000 sentences and 14K words with manual annotation.

Contributors: Ika Alfina, Arlisa Yuliawati, Dipta Tanaya, Arawinda Dinakaramani, Daniel Zeman, Putri Rizqiyah, Sri Hartati Wijono
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kaapor 1 <1K Tupian, Maweti-Guarani

Kaapor treebanks

TuDeT <1K

**UD_Kaapor-TuDeT** is a collection of annotated sentences in [Ka'apor](https://glottolog.org/resource/languoid/id/urub1250). The project is a work in progress and the treebank is being updated on a regular basis.

Contributors: Fabrício Ferraz Gerardi, Carolina Aragon, Gustavo Godoy
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kangri 1 2K IE, Indic

Kangri treebanks

KDTB 2K

The Kangri UD Treebank (KDTB) is a part of the Universal Dependency treebank project.

Contributors: Shweta Chauhan, Shefali Saxena, Apoorva Jha, Philemon Daniel
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Karelian 1 3K Uralic, Finnic

Karelian treebanks

KKPP 3K

UD Karelian-KKPP is a manually annotated new corpus of Karelian made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

Contributors: Tommi A Pirinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Karo 1 2K Tupian, Ramarama

Karo treebanks

TuDeT 2K

UD_Karo-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/karo1306"> Karo</a>. The sentences stem from the only grammatical description of the language (Gabas, 1999) and from the sentences in the dictionary by the same author (Gabas, 2007). Sentence annotation and documentation by Fabrício Ferraz Gerardi.

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kazakh 1 10K Turkic, Northwestern

Kazakh treebanks

KTB 10K

The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

Contributors: Aibek Makazhanov, Jonathan North Washington, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Khunsari 1 <1K IE, Iranian

Khunsari treebanks

AHA <1K

The AHA Khunsari Treebank is a small treebank for contemporary Khunsari. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Khunsari speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kiche 1 10K Mayan

Kiche treebanks

IU 10K

UD Kʼicheʼ-IU is a treebank consisting of sentences from a variety of text domains but principally dictionary example sentences and linguistic examples.

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Komi Permyak 1 1K Uralic, Permic

Komi Permyak treebanks

UH 1K

This is a Komi-Permyak literary language treebank consisting of original and translated texts.

Contributors: Larisa Ponomareva, Niko Partanen, Jack Rueter, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Komi Zyrian 2 10K Uralic, Permic

Komi Zyrian treebanks

Lattice 8K

UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.

Contributors: Niko Partanen, KyungTae Lim, Thierry Poibeau, Jack Rueter
Repository master dev
README
Treebank hub page
Download

IKDP 2K

This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.

Contributors: Niko Partanen, Rogier Blokland, Michael Rießler, Jack Rueter
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Komi Zyrian treebanks.

Language documentation

See the language documentation page.

Korean 4 513K Korean

Korean treebanks

KSL 66K

UD_Korean-KSL is a dependency treebank of L2 Korean, featuring morpheme and Universal Dependency manual annotations for six hundred randomly sampled texts from the Kyung Hee Korean Learner Corpus (which is no longer available).

Contributors: Hakyung Sung, Gyu-Ho Shin
Repository master dev
README
Treebank hub page
Download

Kaist 350K

The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).

Contributors: Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

GSD 80K

The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

PUD 16K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Korean treebanks.

Language documentation

See the language documentation page.

Kurmanji 1 10K IE, Iranian

Kurmanji treebanks

MG 10K

The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

Contributors: Memduh Gökırmak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Kyrgyz 2 24K Turkic, Northwestern

Kyrgyz treebanks

KTMU 23K

UD_Kyrgyz-KTMU is dependency parsing based treebank in Kyrgyz language. Sentences were selected partly from Kyrgyz story and novel books, partly from Kyrgyz news websites.

Contributors: İbrahim Benli
Repository master dev
README
Treebank hub page
Download

TueCL 1K

This is a small treebank of grammatical examples for Kyrgyz.

Contributors: Bermet Chontaeva, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Kyrgyz treebanks.

Language documentation

See the language documentation page.

Latgalian 1 <1K IE, Baltic

Latgalian treebanks

Cairo <1K

UD_Latgalian-Cairo is an example treebank to provide minimal dataset for Latgalian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

Contributors: Lauma Pretkalniņa, Gunta Nešpore-Bērzkalne
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Latin 6 1,002K IE, Italic

Latin treebanks

ITTB 450K

Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

Contributors: Marco Passarotti, Marinella Testori, Daniel Zeman, Berta González Saavedra, Flavio Massimiliano Cecchini
Repository master dev
README
Treebank hub page
Download

LLCT 242K

This Universal Dependencies version of the **LLCT** (Late Latin Charter Treebank) consists of an automated conversion of the **LLCT2** treebank from the Latin Dependency Treebank (LDT) format into the Universal Dependencies standard.

Contributors: Timo Korkiakangas, Flavio Massimiliano Cecchini, Marco Passarotti
Repository master dev
README
Treebank hub page
Download

UDante 55K

The **UDante** treebank is based on the Latin texts of Dante Alighieri, taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Latin language, more precisely of **literary Medieval Latin** (XIVth century).

Contributors: Flavio Massimiliano Cecchini, Giovanni Moretti, Marco Passarotti, Rachele Sprugnoli, Daniela Corbetta, Federica Favero, Federica Gamba, Martina de Laurentiis, Giulia Pedonese, Andrea Peverelli, Elena Vagnoni, Mirko Tavoni
Repository master dev
README
Treebank hub page
Download

Perseus 29K

This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman, Federica Gamba
Repository master dev
README
Treebank hub page
Download

CIRCSE 18K

UD_Latin-CIRCSE is a repository of treebanks featuring Latin texts natively annotated at the CIRCSE Research Centre in Milan (https://centridiricerca.unicatt.it/circse/en.html) following the Universal Dependencies (UD) (https://universaldependencies.org) annotation scheme. The repository includes prose and poetry texts from different periods.

Contributors: Federica Iurescia, Federica Gamba, Flavio Massimiliano Cecchini, Francesco Mambrini, Giovanni Moretti, Marco Passarotti, Paolo Ruffolo
Repository master dev
README
Treebank hub page
Download

PROIEL 205K

The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Latin treebanks.

Language documentation

See the language documentation page.

Latvian 2 328K IE, Baltic

Latvian treebanks

LVTB 328K

Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).

Contributors: Lauma Pretkalniņa, Laura Rituma, Gunta Nešpore-Bērzkalne, Baiba Saulīte, Artūrs Znotiņš, Normunds Grūzītis
Repository master dev
README
Treebank hub page
Download

Cairo <1K

This is an example treebank made to ilustrate UD annotation choices made for Latvian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.

Contributors: Lauma Pretkalniņa, Laura Rituma, Baiba Saulīte, Gunta Nešpore-Bērzkalne
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Latvian treebanks.

Language documentation

See the language documentation page.

Ligurian 1 6K IE, Romance

Ligurian treebanks

GLT 6K

The Genoese Ligurian Treebank is a small, manually annotated collection of contemporary Ligurian prose. The focus of the treebank is written Genoese, the koiné variety of Ligurian which is associated with today's literary, journalistic and academic ligurophone sphere.

Contributors: Stefano Lusito, Jean Maillard
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Lithuanian 2 75K IE, Baltic

Lithuanian treebanks

HSE 5K

Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

ALKSNIS 70K

The Lithuanian dependency treebank ALKSNIS v3.0 (Vytautas Magnus University).

Contributors: Andrius Utka, Erika Rimkutė, Agnė Bielinskienė, Jolanta Kovalevskaitė, Loïc Boizou, Gabrielė Aleksandravičiūtė, Kristina Brokaitė, Daniel Zeman, Natalia Perkova, Bernadeta Griciūtė
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Lithuanian treebanks.

Language documentation

See the language documentation page.

Livvi 1 1K Uralic, Finnic

Livvi treebanks

KKPP 1K

UD Livvi-KKPP is a manually annotated new corpus of Livvi-Karelian made directly in the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.

Contributors: Tommi A Pirinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Low Saxon 1 22K IE, Germanic

Low Saxon treebanks

LSDC 22K

The UD Low Saxon LSDC dataset consists of sentences in 8 major Low Saxon dialect groups from both Germany and the Netherlands. These sentences are (or are to become) part of the LSDC dataset and represent the language from mostly the 19th and early 20th century in genres such as short stories, novels, speeches, letters and fairytales.

Contributors: Janine Siewert
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Luxembourgish 1 <1K IE, Germanic

Luxembourgish treebanks

LuxBank <1K

The LuxBank corpus currently consists of the translated Cairo Cicling examples, and will be extended to include examples from a national dataset. It is the first comprehensive tree bank dataset for Luxembourgish.

Contributors: Alistair Plum, Christoph Purschke, Caroline Döhmer, Anne-Marie Lutgen, Emilia Milano
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Macedonian 1 1K IE, Slavic

Macedonian treebanks

MTB 1K

The Macedonian-MTB treebank is a collection of annotated sentences taken from the Macedonian version of the Cairo CICLing Corpus and from the university textbook in syntax "Contemporary Macedonian Language 4" by Simov Sazdov.

Contributors: Vladimir Cvetkoski
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Madi 1 <1K Arawan

Madi treebanks

Jarawara <1K

UD_Madi-Jarawara is a collection of annotated sentences in Madí (Jarawara dialect) from a variety of sources, including grammar examples, oral stories, didatic material, and dictionary examples.

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Maghrebi Arabic French 1 19K Code switching

Maghrebi Arabic French treebanks

Arabizi 19K

A Universal Dependencies corpus for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. We added to the UD annotations NER annotations extending the French Treebank NER scheme (Sagot et al, 2012) and Offensive language classification and corrected many of the translations (still ongoing).

Contributors: Arij Riabi, Farah Essaidi, Amal Fethi, Menel Mahamdi, Djamé Seddah
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Makurap 1 <1K Tupian, Tupari

Makurap treebanks

TuDeT <1K

UD_Makuráp-TuDeT is a collection of annotated texts in Makuráp. The project is a work in progress and the treebank is being updated on a regular basis. The sentences are being annotated by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos, and Luan Cabral.

Contributors: Carolina Aragon, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Malayalam 1 2K Dravidian

Malayalam treebanks

UFAL 2K

Currently just a small sample of Malayalam grammatical examples.

Contributors: Abishek Stephen, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Maltese 1 44K Afro-Asiatic, Semitic

Maltese treebanks

MUDT 44K

MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.

Contributors: Slavomír Čéplö, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Manx 1 20K IE, Celtic

Manx treebanks

Cadhan 20K

This is the Cadhan Aonair UD treebank for Manx Gaelic, created by Kevin Scannell.

Contributors: Kevin Scannell
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Marathi 1 3K IE, Indic

Marathi treebanks

UFAL 3K

UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

Contributors: Vinit Ravishankar
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Mbya Guarani 2 13K Tupian, Maweti-Guarani

Mbya Guarani treebanks

Thomas 1K

UD Mbya_Guarani-Thomas is a corpus of Mbyá Guaraní (Tupian) texts collected by Guillaume Thomas. The current version of the corpus consists of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu, Caazapá Department, Paraguay.

Contributors: Guillaume Thomas
Repository master dev
README
Treebank hub page
Download

Dooley 11K

UD Mbya_Guarani-Dooley is a corpus of narratives written in Mbyá Guaraní (Tupian) in Brazil, and collected by Robert Dooley. Due to copyright restrictions, the corpus that is distributed as part of UD only contains the annotation (tags, features, relations) while the FORM and LEMMA columns are empty.

Contributors: Guillaume Thomas
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Mbya Guarani treebanks.

Language documentation

See the language documentation page.

Middle French 1 12K IE, Romance

Middle French treebanks

PROFITEROLE 12K

UD_Middle_French-PROFITEROLE is the Middle French section of the PROFITEROLE corpus, the Old French section is UD_OLD_FRENCH-PROFITEROLE.

Contributors: Sophie Prévost, Eric Villemonte de la Clergerie, Mathilde Regnault, Loïc Grobol, Benoît Crabbé, Mathieu Dehouck, Alexei Lavrentiev
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Moksha 1 4K Uralic, Mordvin

Moksha treebanks

JR 4K

Erme Universal Dependencies annotated texts Moksha are the origin of UD_Moksha-JR with annotation (CoNLL-U) for texts in the Moksha language, it originally consists of a sample from a number of fiction authors writing originals in Moksha.

Contributors: Jack Rueter, Maria Levina, Nadezhda Kabaeva, Judit Molnár, Khalid Alnajjar
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Munduruku 1 1K Tupian, Munduruku

Munduruku treebanks

TuDeT 1K

UD_Munduruku-TuDeT is a collection of annotated sentences in [Mundurukú](http://www.endangeredlanguages.com/lang/2981). The project is a work in progress and the treebank is being updated on a regular basis.

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Naija 1 140K Creole

Naija treebanks

NSC 140K

A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).

Contributors: Bernard Caron, Emmett Strickland, Marine Courtin, Kim Gerdes, Bruno Guillaume, Sylvain Kahane, Chika Kennedy Ajede, Emeka Onwuegbuzia, Samson Tella
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Nayini 1 <1K IE, Iranian

Nayini treebanks

AHA <1K

The AHA Nayini Treebank is a small treebank for contemporary Nayini. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Nayini speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Neapolitan 1 <1K IE, Romance

Neapolitan treebanks

RB <1K

This treebank contains example sentences in Neapolitan, translated by a native speaker.

Contributors: Rodolfo Basile
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Nheengatu 1 19K Tupian, Maweti-Guarani

Nheengatu treebanks

CompLin 19K

The [UD_Nheengatu-CompLin](https://aclanthology.org/2024.propor-2.8) is a treebank of [Nheengatu](https://glottolog.org/resource/languoid/id/nhen1239) or Nhengatu (ISO-639: `yrl`), also known, inter alia, as Modern Tupi and *Língua Geral Amazônica*. It comprises sentences from diverse published sources, e.g., spontaneous speech, grammatical descriptions, fables, myths, coursebooks, and dictionaries.

Contributors: Leonel Figueiredo de Alencar
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

North Sami 1 26K Uralic, Sami

North Sami treebanks

Giella 26K

This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

Contributors: Trond Trosterud, Lene Antonsen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Northwest Gbaya 1 2K Niger-Congo, Gbaya-Manza-Ngbaka

Northwest Gbaya treebanks

Autogramm 2K

A Universal Dependencies corpus for Northwest Gbaya, a member of the Gbaya branch of the Atlantic-Congo phylum. The language is mainly spoken by about 250,000 speakers in Central African Republic.

Contributors: Paulette Roulon
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Norwegian 2 611K IE, Germanic

Norwegian treebanks

Bokmaal 310K

The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. The current version of NDT has been automatically converted to the UD scheme by Ingerid Løyning Dale, Per Erik Solberg and Andre Kåsen at the Norwegian Language Bank at the National Library of Norway. This conversion builds to a large extent on previous conversions by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle, Ingerid Løyning Dale, Per Erik Solberg, Andre Kåsen
Repository master dev
README
Treebank hub page
Download

Nynorsk 301K

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle, Ingerid Løyning Dale, Per Erik Solberg, Andre Kåsen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Old Church Slavonic 1 198K IE, Slavic

Old Church Slavonic treebanks

PROIEL 198K

The Old Church Slavonic (OCS) UD treebank is based on canonical Old Church Slavonic data from the PROIEL and TOROT treebanks.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old East Slavic 4 553K IE, Slavic

Old East Slavic treebanks

RNC 168K

`UD_Old_East_Slavic-RNC` is a sample of the Middle Russian corpus (1300-1700), a part of the Russian National Corpus. The data were originally annotated according to the RNC and extended UD-Russian morphological schemas and UD 2.4 dependency schema.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

Ruthenian 111K

The Ruthenian UD treebank includes texts written in the territories of modern Belarus, Lithuania, Ukraine, and Poland in ca. 1300-1700. A sample of legal and nonfiction texts is drawn from the Ruthenian Corpus.

Contributors: Olga Lyashevskaya, Dmitri Sitchinava, Maria Shvedova
Repository master dev
README
Treebank hub page
Download

TOROT 246K

UD\_Old\_East\_Slavic-TOROT is a conversion of a selection of Old East Slavonic and Middle Russian data from the Tromsø Old Russian and OCS Treebank (TOROT), which was originally annotated in PROIEL dependency format.

Contributors: Hanne Eckhoff
Repository master dev
README
Treebank hub page
Download

Birchbark 27K

UD Old\_East\_Slavic-Birchbark is based on the RNC Corpus of Birchbark Letters and includes documents written in 1025-1500 in an East Slavic vernacular (letters, household and business records, records for church services, spell against diseases, and other short inscriptions). The treebank is manually syntactically annotated in the UD 2.0 scheme, morphological and lexical annotation is a conversion of the original RNC annotation.

Contributors: Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Old East Slavic treebanks.

Language documentation

See the language documentation page.

Old French 1 237K IE, Romance

Old French treebanks

PROFITEROLE 237K

UD_Old_French-PROFITEROLE is an expansion of the previous UD_Old_French-SRCMF (which was a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).

Contributors: Sophie Prévost, Aurélie Collomb, Kim Gerdes, Isabelle Tellier, Marine Courtin, Alexei Lavrentiev, Céline Guillot-Barbance, Loïc Grobol, Mathilde Regnault
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old Irish 2 <1K IE, Celtic

Old Irish treebanks

DipSGG <1K

A Universal Dependencies treebank for the Old Irish glosses of St. Gall.

Contributors: Adrian Doyle
Repository master dev
README
Treebank hub page
Download

DipWBG <1K

A Universal Dependencies treebank for the Old Irish Würzburg glosses.

Contributors: Adrian Doyle
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Old Irish treebanks.

Language documentation

See the language documentation page.

Old Turkish 1 <1K Turkic, Northeastern

Old Turkish treebanks

Clausal <1K

This repository contains an [Old Turkish](https://iso639-3.sil.org/code/otk) treebank built upon Old Turkic script texts.

Contributors: Mehmet Oguz Derin, Takahiro Harada
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ottoman Turkish 2 9K Turkic, Southwestern

Ottoman Turkish treebanks

DUDU <1K

An Ottoman Turkish dependency treebank annotated in UD style. Created by Enes Yılandiloğlu.

Contributors: Enes Yılandiloğlu
Repository master dev
README
Treebank hub page
Download

BOUN 8K

An Ottoman Turkish dependency treebank annotated in UD style. Created by [Şaziye Betül Özateş](https://sb-b.github.io/), Tarık Emre Tıraş, Efe Eren Genç from Boğaziçi University, and Esma Fatıma Bilgin Taşdemir from Medeniyet University.

Contributors: Şaziye Betül Özateş, Tarık Emre Tıraş, Efe Eren Genç, Esma Fatıma Bilgin Taşdemir
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ottoman Turkish treebanks.

Language documentation

See the language documentation page.

Pashto 1 <1K IE, Iranian

Pashto treebanks

Sikaram <1K

The Pashto-Sikaram treebank is a native UD treebank with manually annotated texts from various sources.

Contributors: Ján Faryad, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Paumari 1 <1K Arawan

Paumari treebanks

TueCL <1K

This is a small treebank of Paumari, a low-resource Amazonian language.

Contributors: Annika Ott, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Persian 2 654K IE, Iranian

Persian treebanks

PerDT 501K

The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Please refer to the follwoing work, if you use this data: * Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian. "The Persian Dependency Treebank Made Universal". 2020 (to appear).

Contributors: Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, Alireza Nourian
Repository master dev
README
Treebank hub page
Download

Seraji 152K

The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

Contributors: Mojgan Seraji, Filip Ginter, Joakim Nivre, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Pesh 1 2K Chibchan, Pesh

Pesh treebanks

ChibErgIS 2K

A Universal Dependencies corpus for Pesh (aka Paya), a member of the Chibchan language family. The language is spoken by about 500 speakers in Honduras.

Contributors: Natalia Cáceres Arandia, Claudine Chamoreau, Sylvain Kahane, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Phrygian 1 1K IE, Greek

Phrygian treebanks

KUL 1K

UD Phrygian-KUL annotates the New Phrygian subcorpus of the ancient Phrygian language as part of a Master's thesis in linguistics at KU Leuven.

Contributors: Oggi Peeters
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Polish 3 499K IE, Slavic

Polish treebanks

PDB 349K

The Polish PDB-UD treebank is automatically converted from the Polish Dependency Bank 2.0 (PDB 2.0). Both treebanks were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

Contributors: Alina Wróblewska, Daniel Zeman, Jan Mašek, Rudolf Rosa
Repository master dev
README
Treebank hub page
Download

LFG 130K

The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.

Contributors: Agnieszka Patejuk, Adam Przepiórkowski
Repository master dev
README
Treebank hub page
Download

PUD 18K

This is the Polish portion of the Parallel Universal Dependencies (PUD) treebanks, created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).

Contributors: Alina Wróblewska
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Polish treebanks.

Language documentation

See the language documentation page.

Pomak 1 34K IE, Slavic

Pomak treebanks

Philotis 34K

The Pomak UD treebank is derived from the Pomak Dependency Treebank, a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

Contributors: Ritván Karahóǧa, Vivian Stamou, Stella Markantonatou
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Portuguese 7 1,545K IE, Romance

Portuguese treebanks

PetroGold 250K

UD_Portuguese-PetroGold is a fully revised treebank which consists of academic texts from the oil & gas domain in Brazilian Portuguese.

Contributors: Elvis de Souza, Cláudia Freitas, Aline Silveira, Tatiana Cavalcanti, Maria Clara Castro, Wograine Evelyn
Repository master dev
README
Treebank hub page
Download

Porttinari 168K

Porttinari-base [(Duran et al., 2023)](https://sol.sbc.org.br/index.php/stil/article/view/25443/25264) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese [(Pardo et al., 2021)](https://sol.sbc.org.br/index.php/stil/article/view/17778/17612), following the "Universal Dependencies" international grammar framework [(de Marneffe et al., 2021)](https://aclanthology.org/2021.cl-2.11/).

Contributors: Magali Sanches Duran, Lucelene Lopes, Maria das Graças Volpe Nunes, Thiago Alexandre Salgueiro Pardo
Repository master dev
README
Treebank hub page
Download

Bosque 227K

This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.

Contributors: Alexandre Rademaker, Cláudia Freitas, Elvis de Souza, Aline Silveira, Tatiana Cavalcanti, Wograine Evelyn, Luisa Rocha, Isabela Soares-Bastos, Eckhard Bick, Fabricio Chalub, Guilherme Paulino-Passos, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
Repository master dev
README
Treebank hub page
Download

CINTIL 475K

CINTIL-UDep is a dependency bank of Portuguese that is treebanked with Universal Dependencies. It contains over 38K annotated sentences (and 476K tokens), of mostly newspaper text.

Contributors: Mariana Avelãs, António Branco, Marisa Campos, Catarina Carvalheiro, Rita Carvalho, Sérgio Castro, Francisco Costa, Cláudia Martins, Rita Pereira, Sílvia Pereira, Clara Pinto, Andreia Querido, Joana Ramos, João Silva, Sara Silveira
Repository master dev
README
Treebank hub page
Download

DANTEStocks 80K

DANTEStocks (Di Felippo et al., 2024) is a collection of Brazilian Portuguese tweets on the stock market domain that is part of Porttinari (“PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" framework (de Marneffe et al., 2021).

Contributors: Ariani Di Felippo, Norton Trevisan Roman, Thiago Alexandre Salgueiro Pardo, Bryan Khelven da Silva Barbosa, Maria das Graças Volpe Nunes
Repository master dev
README
Treebank hub page
Download

GSD 318K

The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Alexandre Rademaker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Fabricio Chalub, Carlos Ramisch, Juan Belieni, Vanessa Berwanger Wille, Rodrigo Pintucci
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva, Alexandre Rademaker, Elvis de Souza
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Romanian 5 941K IE, Romance

Romanian treebanks

RRT 218K

The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.

Contributors: Verginica Barbu Mititelu, Elena Irimia, Cenel-Augusto Perez, Radu Ion, Radu Simionescu, Martin Popel
Repository master dev
README
Treebank hub page
Download

SiMoNERo 146K

SiMoNERo is a medical corpus of contemporary Romanian.

Contributors: Maria Mitrofan, Verginica Barbu Mititelu
Repository master dev
README
Treebank hub page
Download

TueCL 4K

The Romanian Social Media Sexist Language UD Treebank is a reference treebank in Universal Dependencies (UD) format for Romanian sexist language. Currently small, it comprises a subset of tweets sourced from [CoRoSeOf](https://github.com/DianaHoefels/CoRoSeOf).

Contributors: Diana Hoefels, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

ArT <1K

The UD treebank ArT is a treebank of the Aromanian dialect of the Romanian language in UD format.

Contributors: Verginica Barbu Mititelu, Mihaela Cristescu, Manuela Nevaci
Repository master dev
README
Treebank hub page
Download

Nonstandard 572K

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0

Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca, Roman Untilov, Petru Rebeja
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Russian 5 1,896K IE, Slavic

Russian treebanks

Taiga 197K

Universal Dependencies treebank is based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 and GramEval-2020 shared tasks collections.

Contributors: Olga Lyashevskaya, Olga Rudina, Natalia Vlasova, Anna Zhuravleva
Repository master dev
README
Treebank hub page
Download

Poetry 64K

UD_Russian-Poetry contains samples of Russian poetry written in 19th – early 21th centuries. The treebank is based on the Poetry Corpus of the Russian National Corpus.

Contributors: Olga Lyashevskaya, Natalia Vlasova, Dmitri Sitchinava
Repository master dev
README
Treebank hub page
Download

SynTagRus 1,517K

Russian data from the SynTagRus corpus.

Contributors: Kira Droganova, Olga Lyashevskaya, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSD 97K

Russian Universal Dependencies Treebank annotated and converted by Google.

Contributors: Ryan McDonald, Vitaly Nikolaev, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

PUD 19K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Russian treebanks.

Language documentation

See the language documentation page.

Sanskrit 2 208K IE, Indic

Sanskrit treebanks

UFAL 1K

A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

Contributors: Puneet Dwivedi, Daniel Zeman, Erica Biagetti
Repository master dev
README
Treebank hub page
Download

Vedic 206K

The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB). Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.

Contributors: Salvatore Scarlata, Elia Ackermann, Oliver Hellwig, Erica Biagetti, Paul Widmer, Sven Sellmer
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Sanskrit treebanks.

Language documentation

See the language documentation page.

Scottish Gaelic 1 89K IE, Celtic

Scottish Gaelic treebanks

ARCOSG 89K

A treebank of Scottish Gaelic based on the [Annotated Reference Corpus Of Scottish Gaelic (ARCOSG)](https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG).

Contributors: Colin Batchelor
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Serbian 1 97K IE, Slavic

Serbian treebanks

SET 97K

The Serbian UD treebank is based on the [SETimes-SR](http://hdl.handle.net/11356/1200) corpus and additional news documents from the Serbian web.

Contributors: Tanja Samardžić, Nikola Ljubešić
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Sinhala 1 <1K IE, Indic

Sinhala treebanks

STB <1K

This treebank consists contemporary written Sinhala text taken from a 10M corpus maintained by UCSC, Sri Lanka. The corpus contains novels, short stories, Sinhala translations, critiques and Sinhala newspapers.

Contributors: Liyanage Chamila, Sarveswaran Kengatharaiyer
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Skolt Sami 1 2K Uralic, Sami

Skolt Sami treebanks

Giellagas 2K

The UD Skolt Sami Giellagas treebank is based almost entirely on spoken Skolt Sami corpora.

Contributors: Jack Rueter, Markus Juutinen, Francis Tyers, Tommi A Pirinen, Mika Hämäläinen
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Slovak 1 106K IE, Slavic

Slovak treebanks

SNK 106K

The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

Contributors: Katarína Gajdošová, Mária Šimková, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Slovenian 2 365K IE, Slavic

Slovenian treebanks

SSJ 267K

The SSJ treebank is the reference UD treebank for Slovenian, consisting of approximately 13,000 sentences and 267,097 tokens from fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. As of UD release 2.10 in May 2022, the original version of the SSJ UD treebank has been partially manually revised and extended with new manually annotated data.

Contributors: Kaja Dobrovoljc, Tomaž Erjavec, Simon Krek
Repository master dev
README
Treebank hub page
Download

SST 98K

The Spoken Slovenian Treebank (SST) is a manually annotated collection of transcribed audio recordings featuring spontaneous speech in various everyday situations. It includes 344 unique speech events (documents) amounting to approximately 10 hours of speech, encompassing a total of 6,108 utterances and 98,396 tokens.

Contributors: Kaja Dobrovoljc, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Slovenian treebanks.

Language documentation

See the language documentation page.

Soi 1 <1K IE, Iranian

Soi treebanks

AHA <1K

The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Soi speakers.

Contributors: AmirHossein Mojiri Foroushani, Hamid Aghaei, Amir Ahmadi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

South Levantine Arabic 1 <1K Afro-Asiatic, Semitic

South Levantine Arabic treebanks

MADAR <1K

The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project. TO-DO: Add 20 annotated sentences from CCC as a train set.

Contributors: Shorouq Zahra
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Spanish 4 1,031K IE, Romance

Spanish treebanks

AnCora 568K

Spanish data from the [AnCora](http://clic.ub.edu/corpus/) corpus.

Contributors: Héctor Martínez Alonso, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSD 431K

The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Miguel Ballesteros, Héctor Martínez Alonso, Ryan McDonald, Elena Pascual, Natalia Silveira, Daniel Zeman, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
Repository master dev
README
Treebank hub page
Download

COSER 8K

The COSER UD Treebank (COSER-UD) is the first syntactically annotated corpus of spoken Spanish, based on a sample of the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish".

Contributors: Johnatan Bonilla
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Spanish Sign Language 1 1K Sign Language

Spanish Sign Language treebanks

LSE 1K

The Universal Dependency treebank for Spanish Sign Language (Lengua de Signos Española [LSE], ISO 639-3: ssp) was developed by the GRADES group at the University of Vigo.

Contributors: José María García-Miguel, Carmen Cabeza
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Swedish 3 206K IE, Germanic

Swedish treebanks

Talbanken 96K

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

Contributors: Joakim Nivre, Aaron Smith, Victor Norrman
Repository master dev
README
Treebank hub page
Download

LinES 90K

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

PUD 19K

Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

Contributors: Joakim Nivre, Bernadeta Griciūtė, Victor Norrman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Swedish Sign Language 1 1K Sign Language

Swedish Sign Language treebanks

SSLC 1K

The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.

Contributors: Moa Gärdenfors, Carl Börstell, Robert Östling, Lars Wallin, Mats Wirén
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Swiss German 1 1K IE, Germanic

Swiss German treebanks

UZH 1K

_UD\_Swiss\_German-UZH_ is a tiny manually annotated treebank of 100 sentences in different Swiss German dialects and a variety of text genres.

Contributors: Noëmi Aepli
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tagalog 2 1K Austronesian, Central Philippine

Tagalog treebanks

TRG <1K

UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.

Contributors: Stephanie Samson, Daniel Zeman, Mary Ann C. Tan
Repository master dev
README
Treebank hub page
Download

Ugnayan 1K

Ugnayan is a manually annotated Tagalog treebank currently composed of educational fiction and nonfiction text. The treebank is under development at the University of the Philippines.

Contributors: Angelina Aquino
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tamil 2 12K Dravidian

Tamil treebanks

TTB 9K

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

Contributors: Loganathan Ramasamy, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

MWTT 2K

MWTT - Modern Written Tamil Treebank has sentences taken primarily from a text called "A Grammar of Modern Tamil by Thomas Lehmann (1993). This initial release has 536 sentences of various lengths, and all of these are added as the test set.

Contributors: Sarveswaran Kengatharaiyer, Parameswari Krishnamurthy, Keerthana Balasubramani
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tamil treebanks.

Language documentation

See the language documentation page.

Tatar 1 2K Turkic, Northwestern

Tatar treebanks

NMCTT 2K

UD Tatar-NMCTT is a manually annotated corpus of the Tatar language based on the text from Tatar-Inform (tatar-inform.tatar), an online news website.

Contributors: Chihiro Taguchi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Teko 1 2K Tupian, Maweti-Guarani

Teko treebanks

TuDeT 2K

UD_Teko-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/emer1243"> Tekó (Emérillon) </a>. The sentences stem from the only grammatical description of the language (Rose, 2011). Sentence annotation and documantation by Uliana Vedenina and Fabrício Ferraz Gerardi.

Contributors: Uliana Vedenina, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Telugu 1 6K Dravidian

Telugu treebanks

MTG 6K

The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

Contributors: Taraka Rama, Sowmya Vajjala
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Telugu English 1 <1K Code switching

Telugu English treebanks

TECT <1K

UD Telugu_English-TECT is a Telugu-English code-switching treebank.

Contributors: Anishka Vissamsetty
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Thai 1 22K Tai-Kadai

Thai treebanks

PUD 22K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tswana 1 <1K Niger-Congo, Bantoid

Tswana treebanks

Popapolelo <1K

UD Tswana-Popapolelo is a translation of the 20 Cairo Cicling sentences (https://github.com/UniversalDependencies/cairo) annotated with XPOS, UPOS and dependency relations.

Contributors: Ansu Berg, Roald Eiselen, Tanja Gaustad, Rigardt Pretorius
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tupinamba 1 4K Tupian, Maweti-Guarani

Tupinamba treebanks

TuDeT 4K

UD_Tupinamba-TuDeT is a collection of annotated sentences in [Tupinambá](https://glottolog.org/resource/languoid/id/tupi1273). All known sources in this language are being annotated: cathecisms, letters, poems, theater plays, and grammars (sixteenth and seventeenth century). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Turkish 9 735K Turkic, Southwestern

Turkish treebanks

Kenet 178K

Turkish-Kenet UD Treebank is the biggest treebank of Turkish. It consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Arife Betül Yenice, Bilge Nas Arıcan, Ezgi Sanıyar
Repository master dev
README
Treebank hub page
Download

Penn 183K

Turkish version of the Penn Treebank. It consists of a total of 9,560 manually annotated sentences and 87,367 tokens. (It only includes sentences up to 15 words long.)

Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Neslihan Kara, Bilge Nas Arıcan, Merve Özçelik, Deniz Baran Aslan
Repository master dev
README
Treebank hub page
Download

Tourism 91K

Turkish Tourism is a domain specific treebank consisting of 19,750 manually annotated sentences and 92,200 tokens. These sentences were taken from the original customer reviews of a tourism company.

Contributors: Aslı Kuzgun, Neslihan Cesur, Olcay Taner Yıldız, Oğuzhan Kuyrukçu, Büşra Marşan, Bilge Nas Arıcan, Neslihan Kara, Deniz Baran Aslan, Ezgi Sanıyar, Cengiz Asmazoğlu
Repository master dev
README
Treebank hub page
Download

Atis 45K

This treebank is a translation of English ATIS (Airline Travel Information System) corpus (see References). It consists of 5432 sentences.

Contributors: Mehmet Köse, Olcay Taner Yıldız
Repository master dev
README
Treebank hub page
Download

GB 17K

This is a treebank annotating example sentences from a comprehensive grammar book of Turkish.

Contributors: Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

FrameNet 19K

Turkish FrameNet consists of 2,700 manually annotated example sentences and 19,221 tokens. Its data consists of the sentences taken from the Turkish FrameNet Project. The annotated sentences can be filtered according to the semantic frame category of the root of the sentence.

Contributors: Neslihan Cesur, Aslı Kuzgun, Olcay Taner Yıldız, Büşra Marşan, Oğuzhan Kuyrukçu, Bilge Nas Arıcan, Ezgi Sanıyar, Neslihan Kara, Merve Özçelik
Repository master dev
README
Treebank hub page
Download

BOUN 125K

A Turkish dependency treebank annotated in UD style. Created by the members of [TABILAB](https://tabilab.cmpe.boun.edu.tr/) from Boğaziçi University.

Contributors: Büşra Marşan, Salih Furkan Akkurt, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Gözde Berk, Seyyit Talha Bedir, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
Repository master dev
README
Treebank hub page
Download

IMST 58K

The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak&Eryiğit, 2018; Sulubacak et al., 2016).

Contributors: Utku Türk, Şaziye Betül Özateş, Büşra Marşan, Salih Furkan Akkurt, Çağrı Çöltekin, Gülşen Cebiroğlu Eryiğit, Memduh Gökırmak, Hüner Kaşıkara, Umut Sulubacak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

PUD 16K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin, Utku Türk, Furkan Atmaca, Şaziye Betül Özateş, Abdullatif Köksal, Balkız Öztürk Başaran, Tunga Güngör, Arzucan Özgür
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Turkish German 1 37K Code switching

Turkish German treebanks

SAGT 37K

UD Turkish-German SAGT is a Turkish-German code-switching treebank that is developed as part of the [SAGT](https://www.ims.uni-stuttgart.de/en/research/projects/sagt/) project.

Contributors: Özlem Çetinoğlu, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Ukrainian 2 174K IE, Slavic

Ukrainian treebanks

IU 122K

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]

Contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko
Repository master dev
README
Treebank hub page
Download

ParlaMint 51K

UD_Ukrainian-ParlaMint is a collection of Ukrainian parliamentary transcripts annotated in Universal Dependencies. The texts are published on the official website of the Ukrainian parliament (https://www.rada.gov.ua/documents/Stenbul_pz/) and are taken for UD_Ukrainian-ParlaMint from the Ukrainian section of the ParlaMint project (https://www.clarin.eu/parlamint).

Contributors: Maria Shvedova, Arsenii Lukashevskyi
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ukrainian treebanks.

Language documentation

See the language documentation page.

Umbrian 1 <1K IE, Italic

Umbrian treebanks

IKUVINA <1K

UD_Umbrian-IKUVINA is a dependency treebank rendering of the Iguvine tablets ([Wikipedia](https://en.wikipedia.org/wiki/Iguvine_Tablets)). The seven bronze tablets describe religious ceremonies performed by the Umbrian people in Italy before the rise of the Roman empire. The corpus will eventually contain all the tablets. But as of May 2022, only tablet I is release with partial morphological analysis and partial lemmatisation. (POS tagging and Dependency trees are complete)

Contributors: Mathieu Dehouck
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Upper Sorbian 1 11K IE, Slavic

Upper Sorbian treebanks

UFAL 11K

A small treebank of Upper Sorbian based mostly on Wikipedia.

Contributors: Daniel Zeman, Anna Nedoluzhko
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Urdu 1 138K IE, Indic

Urdu treebanks

UDTB 138K

The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Uyghur 1 40K Turkic, Southeastern

Uyghur treebanks

UDT 40K

The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.

Contributors: Marhaba Eli, Daniel Zeman, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Uzbek 1 5K Turkic, Southeastern

Uzbek treebanks

UT 5K

This is the first Uzbek UD treebank.

Contributors: Arofat Akhundjanova
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Veps 1 1K Uralic, Finnic

Veps treebanks

VWT 1K

UD Veps-VWT is a manually annotated corpus of Veps made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts written in Central Veps dialect.

Contributors: Käbi Laan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Vietnamese 2 59K Austro-Asiatic, Viet-Muong

Vietnamese treebanks

VTB 58K

The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).

Contributors: Lương Nguyễn Thị, Linh Hà Mỹ, Phương Lê Hồng, Huyền Nguyễn Thị Minh
Repository master dev
README
Treebank hub page
Download

TueCL 1K

This treebank includes a set of sentences from [OPUS](https://opus.nlpl.eu/), sourced from subtitles, talks, and educational videos.

Contributors: Hoa Do, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Vietnamese treebanks.

Language documentation

See the language documentation page.

Warlpiri 1 <1K Pama-Nyungan, Western

Warlpiri treebanks

UFAL <1K

A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.

Contributors: Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Welsh 1 52K IE, Celtic

Welsh treebanks

CCG 52K

UD Welsh-CCG (Corpws Cystrawennol y Gymraeg) is a treebank of Welsh, annotated according to the Universal Dependencies guidelines.

Contributors: Johannes Heinecke, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Western Armenian 1 122K IE, Armenian

Western Armenian treebanks

ArmTDP 122K

A Universal Dependencies treebank for Western Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Western S.P. Nahuatl 1 10K Uto-Aztecan

Western Sierra Puebla Nahuatl treebanks

ITML 10K

UD Western Sierra Puebla Nahuatl-IU is a treebank consisting of sentences from written fiction and non-fiction, spontaenous speech, and grammar examples.

Contributors: Robert Pugh, Marivel Huerta Mendez, Mitsuya Sasaki, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Wolof 1 44K Niger-Congo, Northern Atlantic

Wolof treebanks

WTB 44K

UD_Wolof-WTB is a natively manual developed treebank for Wolof. Sentences were collected from encyclopedic, fictional, biographical, religious texts and news.

Contributors: Bamba Dione
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Xavante 1 1K Macro-Je

Xavante treebanks

XDT 1K

UD_Xavante-XDT is a collection of annotated sentences in [Xavante](https://glottolog.org/resource/languoid/id/xava1240). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](http://languagestructure.github.io/), Ivan Roksandic.

Contributors: Ivan Roksandic, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Xibe 1 15K Tungusic

Xibe treebanks

XDT 15K

The UD Xibe Treebank is a corpus of the Xibe language (ISO 639-3: *sjo*) containing manually annotated syntactic trees under the Universal Dependencies. Sentences come from three sources: grammar book examples, newspaper (Cabcal News) and Xibe textbooks.

Contributors: He Zhou, Juyeon Chung, Elena Klyachko, Francis Tyers, Sandra Kübler
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yakut 1 1K Turkic, Northeastern

Yakut treebanks

YKTDT 1K

UD_Yakut-YKTDT is a collection Yakut ([Sakha]) sentences (https://glottolog.org/resource/languoid/id/yaku1245). The project is work-in-progress and the treebank is being updated on a regular basis.

Contributors: Tatiana Merzhevich, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yoruba 1 8K Niger-Congo, Defoid

Yoruba treebanks

YTB 8K

Parts of the Yoruba Bible and of the Yoruba edition of Wikipedia, hand-annotated natively in Universal Dependencies.

Contributors: Adédayọ̀ Olúòkun, Daniel Zeman, Seyi Williams, Ọlájídé Ishola
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Yupik 1 2K Eskimo-Aleut

Yupik treebanks

SLI 2K

UD_Yupik-SLI is a treebank of St. Lawrence Island Yupik (ISO 639-3: ess) that has been manually annotated at the morpheme level, based on a finite-state morphological analyzer by [Chen et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.326). The word-level annotation, merging multiword expressions, is provided in not-to-release/ess_sli-ud-test.merged.conllu. More information about the treebank can be found in our publication (AmericasNLP, 2021).

Contributors: Hyunji Hayley Park, Lane Schwartz, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Zaar 1 17K Afro-Asiatic, West Chadic

Zaar treebanks

Autogramm 17K

A Universal Dependencies corpus for Zaar (aka Sayanci), a member of the Chadic branch of the Afro-Asiatic phylum. The language is mainly spoken by about 200,000 speakers in the Bogoro and Tafawa Balewa local governments of Bauchi State, Nigeria.

Contributors: Sylvain Kahane, Bruno Guillaume, Bernard Caron, Katharine Jiang
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Possible Future Extensions

People have expressed interest in providing annotated data for the following languages but no valid data has been provided so far.

Akkadian 1 117K Afro-Asiatic, Semitic

Akkadian treebanks

MCONG 117K

UD_Akkadian-MCONG is a treebank of normalized Akkadian sentences drawn mostly from Neo-Assyrian corpora lemmatized on [Oracc](http://oracc.museum.upenn.edu/). Sentences are annotated for lemma, syntactic dependencies, and morphological features. The treebank contains approximately 112,000 words.

Contributors: Matthew Ong
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Akkadian treebanks.

Language documentation

See the language documentation page.

Amharic 2 - Afro-Asiatic, Semitic

Amharic treebanks

ADT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Dawit J. Tilahun
Repository master dev
README
Treebank hub page
Download

Inku -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Josiah Solomon
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Archaic Irish 1 - IE, Celtic

Archaic Irish treebanks

OGAM -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Adrian Ó Dubhghaill
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Assamese 1 - IE, Indic

Assamese treebanks

AsTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Shikhar Sarma
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Balatipone 1 - Bororoan

Balatipone treebanks

BDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Bengali 3 - IE, Indic

Bengali treebanks

CMUPAN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Aditi Chaudhary
Repository master dev
README
Treebank hub page
Download

PUD -

This is a part of the Parallel Universal Dependencies (PUD) treebanks originally created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

Contributors: Pritha Majumdar, Deepak Alok, Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Sabdakosh -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Andrew Thomas Dyer, Riffat Sharmin
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bhojpuri 1 - IE, Indic

Bhojpuri treebanks

BhEn -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Classical Nahuatl 1 - Uto-Aztecan

Classical Nahuatl treebanks

FloCo -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Robert Pugh, Marivel Huerta Mendez, Mitsuya Sasaki, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Cuicatec 1 - Oto-Manguean

Cuicatec treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Cusco Quechua 1 - Quechuan

Cusco Quechua treebanks

Squoia -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Annette Rios, Francis Tyers, Trey Jagiella, Josephine Douglas
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Czech 1 1,191K IE, Slavic

Czech treebanks

PCEDT 1,191K

The Czech-PCEDT UD treebank is based on the Prague Czech-English Dependency Treebank 2.0 (PCEDT), created at the Charles University in Prague.

Contributors: Anna Nedoluzhko, Michal Novák, Silvie Cinková, Marie Mikulová, Jiří Mírovský, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish 1 - IE, Germanic

Danish treebanks

DTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Natalie Schluter
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Dargwa 1 - Nakh-Daghestanian, Lak-Dargwa

Dargwa treebanks

Mehweb -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sasha Kozhukhar, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

English 1 - IE, Germanic

English treebanks

BhEn -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

French 1 - IE, Romance

French treebanks

CrapBank -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Djamé Seddah
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Frisian 1 - IE, Germanic

Frisian treebanks

Frysk -

The UD Frisian-FA-RuG treebank is a West Frisian treebank.

Contributors: Wilbert Heeringa, Gosse Bouma, Hans Van de Velde
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Gedeo 1 <1K Afro-Asiatic, Cushitic

Gedeo treebanks

GDT <1K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Dawit J. Tilahun
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Georgian 1 1K Kartvelian

Georgian treebanks

GEOWIKI 1K

**UD_Georgian-GEOWIKI** is a general corpus derived from randomly selected texts from Georgian Wikipedia. It contains 385 sentences that were automatically tokenized and tagged using the TreeTagger tool, available at [TreeTagger for Georgian](https://github.com/SophikoComp/TreeTagger-for-Georgian). The tagged data was then semi-automatically converted to the Universal Dependencies (UD) format, followed by manual corrections to ensure accuracy.

Contributors: Mate Didebashvili
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Greek 3 - IE, Greek

Greek treebanks

Cretan -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Stella Markantonatou, Socrates Vak
Repository master dev
README
Treebank hub page
Download

Lesbian -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Stavros Bompolas, Stella Markantonatou, Antonios Anastasopoulos, Vivian Stamou
Repository master dev
README
Treebank hub page
Download

Messinian -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Stella Markantonatou, Katerina Mouzou
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Greek treebanks.

Language documentation

See the language documentation page.

Hiligaynon 1 <1K Austronesian, Central Philippine

Hiligaynon treebanks

HTB <1K

UD Hiligaynon-HTB is a UD treebank containing sentences manually-annotated from grammar books [PALI Language Texts](https://www.hawaiiopen.org/bookseries/pali-language-texts-philippines/) made available by University of Hawaii Press.

Contributors: Mary Ann C. Tan
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hindi 1 4K IE, Indic

Hindi treebanks

Convers 4K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Riyaz Ahmad Bhat
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hindi treebanks.

Language documentation

See the language documentation page.

Huave 1 - Huavean

Huave treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Italian 1 - IE, Romance

Italian treebanks

KIPARLA -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Ludovica Pannitto
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Italian treebanks.

Language documentation

See the language documentation page.

Japanese 2 - Japanese

Japanese treebanks

JDD -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Reina Akama, Mai Omura, Masayuki Asahara
Repository master dev
README
Treebank hub page
Download

JDDLUW -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Reina Akama, Mai Omura, Masayuki Asahara
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Kabyle 1 23K Afro-Asiatic, Berber

Kabyle treebanks

ADPT 23K

UD UD_Kabyle-ADPT (Association pour le Développement et la Promotion de Tamazight) is a treebank of Berber (Kabyle variant), annotated according to the Universal Dependencies guidelines.

Contributors: Lakhdar Aliane
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kadiweu 1 - Guaicuruan

Kadiweu treebanks

Unicamp -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Filomena Sandalo
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kannada 1 - Dravidian

Kannada treebanks

CMUPAN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Aditi Chaudhary
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Khoekhoe 1 - Khoe-Kwadi

Khoekhoe treebanks

KDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Michael Hahn, Levi Namaseb
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kiga 1 - Niger-Congo, Bantoid

Kiga treebanks

EKigaTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: David Bamutura
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Komi 1 <1K Uralic, Permic

Komi treebanks

OldPermic <1K

This is an Universal Dependencies treebank of Old Permic. The treebank is currently under progress, and will be published completely in the next Universal Dependencies release (v2.14), which is scheduled for May 15, 2024 (data freeze on May 1).

Contributors: Niko Partanen, Jack Rueter, Rogier Blokland
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Korean 2 - Korean

Korean treebanks

Penn -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Jinho Choi, Narae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

Sejong -

Please add a summary section to the treebank readme file

Contributors: Jaemin Cho
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Korean treebanks.

Language documentation

See the language documentation page.

Ladino 1 - IE, Romance

Ladino treebanks

BOUN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Laz 1 2K Kartvelian

Laz treebanks

BOUN 2K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk, Kaan Bayar, Ayşegül Dilara Özercan, Görkem Yiğit Öztürk, Betül Bilgin
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Magahi 2 7K IE, Indic

Magahi treebanks

MGTB 7K

The [Magahi](https://en.wikipedia.org/wiki/Magahi_language) UD Treebank (MGTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.

Contributors: Mohit Raj, Deepak Alok, Ritesh Kumar, Atul Kr. Ojha, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD -

Contributors: Deepak Alok, Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mandyali 1 <1K IE, Indic

Mandyali treebanks

MDTB <1K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Shweta Chauhan
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mansi 1 - Uralic, Ugric

Mansi treebanks

CoWS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Csilla Horváth
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Marathi 1 205K IE, Indic

Marathi treebanks

CMUPAN 205K

UD_Marathi-CMUPAN is a semi-automatically created treebank, it is based on the treebanks released by KCIS, IIIT-Hyderabad (https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/)

Contributors: Aditi Chaudhary
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Middle Irish 2 <1K IE, Celtic

Middle Irish treebanks

CritMITB <1K

Annotation of the classic Scela Mucce Meic Dathó ("The tale of Mac Dathó's pig").

Contributors: Ben Rozonoyer, Erik Andersen
Repository master dev
README
Treebank hub page
Download

DipMITB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Adrian Ó Dubhghaill
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Mongolian 1 - Mongolic

Mongolian treebanks

MTLR -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Siqin Bai
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ndengeleko 1 - Niger-Congo, Bantoid

Ndengeleko treebanks

NTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Mariel Aquino
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Nenets 1 - Uralic, Samoyedic

Nenets treebanks

Tundra -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Nikolett Mus
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Nepali 1 - IE, Indic

Nepali treebanks

NTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Kiran Dhakal
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Nkore 1 - Niger-Congo, Bantoid

Nkore treebanks

ENkoreTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: David Bamutura
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Occitan 2 - IE, Romance

Occitan treebanks

OCOR -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Maarten Janssen
Repository master dev
README
Treebank hub page
Download

TTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Aleksandra Haddad
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Odia 1 <1K IE, Indic

Odia treebanks

ODTB <1K

The Odia UD Treebank (ODTB) is a part of the Universal Dependency treebank project.

Contributors: Shantipriya Parida, Kalyanamalini Sahoo, Atul Kr. Ojha, Saraswati Sahoo, Satya Ranjan Dash, Bijayalaxmi Dash
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Old English 2 - IE, Germanic

Old English treebanks

OEDT -

This is a 25,000 word UD treebank of Old English. The text has been retrieved from Martín Arista, Javier (ed.), et al. 2023. ParCorOEv3 [www.nerthusproject.com]. The treebank is a revised version of the dataset of Domínguez Barragán, S. 2024. Universal Dependencies of Old English. PhD Dissertation, University of La Rioja.

Contributors: Javier Martín Arista, Dario Metola
Repository master dev
README
Treebank hub page
Download

TueCL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fanyi Meng, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Old Japanese 1 3K Japanese

Old Japanese treebanks

LMJ 3K

UD_Old_Japanese-LMJ is a collection of annotated texts in Late Middle Japanese, starting with Book 9 from he celebrated gunki monogatari (war tale) *Heike Monogatari*.

Contributors: Stanislav Reichert, Fabrício Ferraz Gerardi
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Palenquero 1 - Creole

Palenquero treebanks

COL -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Daniel Casas
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Papiamento 2 - Creole

Papiamento treebanks

AW -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Urso Wieske
Repository master dev
README
Treebank hub page
Download

CW -

If you can read this sentence, then we are still working on our first release.

Contributors: Urso Wieske
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pashto 1 - IE, Iranian

Pashto treebanks

Zuhra -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fatima Tuz Zuhra
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Persian 1 - IE, Iranian

Persian treebanks

IPerUDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Roya Kabiri
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Persian treebanks.

Language documentation

See the language documentation page.

Pnar 1 - Austro-Asiatic, Khasian

Pnar treebanks

PTB -

UD Pnar-PTB is a conversion from the Ring (2017) dataset ([doi:10.21979/N9/KVFGBZ](http://dx.doi.org/10.21979/N9/KVFGBZ)) that underpins a grammatical description of the Pnar language (Ring 2015, [http://hdl.handle.net/10356/62519](http://hdl.handle.net/10356/62519)). The corpus consists of folktales and interviews transcribed, translated, and interlinearized.

Contributors: Hiram Ring
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pontic 1 - IE, Greek

Pontic treebanks

BOUN -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Utku Türk
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Portuguese 1 - IE, Romance

Portuguese treebanks

DHBB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Alexandre Rademaker
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Portuguese treebanks.

Language documentation

See the language documentation page.

Prakrit 1 <1K IE, Indic

Prakrit treebanks

DIPI <1K

**UD Prakrit-DIPI** (*Digitising Imperial Prakrit Inscriptions*) is a UD-annotated corpus of the Ashokan Prakrit inscriptions and edicts (parallel texts written in various dialects) representing an early stage of Middle Indo-Aryan. This corpus aims to facilitate comparative work on the Ashokan dialects with the help of new computational methods.

Contributors: Aryaman Arora, Adam Farris
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Punjabi 1 6K IE, Indic

Punjabi treebanks

PunTB 6K

**PunTB** (a very imaginative acronym for **Pun**jabi **T**ree**b**ank) is an in-progress treebank of Punjabi in the Gurmukhi script, aiming to cover a wide range of genres and formats.

Contributors: Aryaman Arora
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Puno Quechua 1 - Quechuan

Puno Quechua treebanks

UIBK -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Elwin Huaman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Romanian 1 - IE, Romance

Romanian treebanks

Moldovan -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Olesea Caftanatov
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Romansh 2 - IE, Romance

Romansh treebanks

Rumgr -

Please add a summary section to the treebank readme file

Contributors: Sascha Brawer, Martin Cantieni
Repository master dev
README
Treebank hub page
Download

Sursilv -

Please add a summary section to the treebank readme file

Contributors: Sascha Brawer, Martin Cantieni
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Seri 1 - Hokan, Seri

Seri treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Shipibo Konibo 1 - Pano-Tacanan

Shipibo Konibo treebanks

PUCP -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Ronald Ahmed Cárdenas Acosta
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sindhi 2 6K IE, Indic

Sindhi treebanks

Isra -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Mutee-u Rahman
Repository master dev
README
Treebank hub page
Download

MazharDootio 6K

The Sindhi Universal Dependency Treebank was automatically converted from Sindhi Dependency Treebank (SDTB) which is part of an ongoing effort of creating multi-layered treebanks for Sindhi.

Contributors: Mazhar Dootio
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sinhala 1 - IE, Indic

Sinhala treebanks

IITS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Aryaman Arora, Adam Farris
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Somali 1 - Afro-Asiatic, Cushitic

Somali treebanks

STB -

Please add a summary section to the treebank readme file

Contributors: Morgan Nilsson
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sorani 1 - IE, Iranian

Sorani treebanks

MG -

Please add a summary section to the treebank readme file

Contributors: Memduh Gökırmak
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Spanish 1 <1K IE, Romance

Spanish treebanks

SVarT <1K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Johnatan Bonilla
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Spanish English 1 - Code switching

Spanish English treebanks

Miami -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Robert Pugh, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Swahili 1 - Niger-Congo, Bantoid

Swahili treebanks

OPUSGV -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Kenneth Steimel
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tagabawa 1 <1K Austronesian, Central Philippine

Tagabawa treebanks

GJA <1K

UD_Tagabawa_GJA is a collection of anntoated Bagobo-Tagabawa sentences taken from different sources. It is currently under development.

Contributors: Glyd Aranes
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tagalog 1 340K Austronesian, Central Philippine

Tagalog treebanks

NewsCrawl 340K

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Elsie Marie Or, Angelina Aquino, Lester James Miranda
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Tagalog treebanks.

Language documentation

See the language documentation page.

Tetun 1 - Austronesian, Malayo-Polynesian

Tetun treebanks

TUDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Gabriel de Jesus
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Thai 1 - Tai-Kadai

Thai treebanks

TDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Siriluck Rattananiyomkul, Sylvain Kahane, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Tigrinya 1 - Afro-Asiatic, Semitic

Tigrinya treebanks

Keren -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Josiah Solomon
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tunisian Arabic 1 - Afro-Asiatic, Semitic

Tunisian Arabic treebanks

NAxLAT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Rayan Ziane
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Turkish 1 142K Turkic, Southwestern

Turkish treebanks

ULU 142K

The UD_Turkish-ULU Treebank, is an automatic conversion of the ULU Treebank

Contributors: Metin Bilgin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Turkish treebanks.

Language documentation

See the language documentation page.

Tuwari 1 - Sepik

Tuwari treebanks

Autogramm -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sylvain Loiseau
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Uspanteko 1 - Mayan

Uspanteko treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Uyghur 1 - Turkic, Southeastern

Uyghur treebanks

LDS -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Faruk Mardan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Uzbek 1 - Turkic, Southeastern

Uzbek treebanks

UzUDT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sanatbek Matlatipov, Elmurod Kuriyozov
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Western S.P. Nahuatl 1 - Uto-Aztecan

Western Sierra Puebla Nahuatl treebanks

MesoTree -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Retired Treebanks

The following treebanks have been part of one or more UD releases in the past but they are no longer maintained and they have been excluded from the most recent release.

English 1 97K IE, Germanic

English treebanks

ESL 97K

UD English-ESL / Treebank of Learner English (TLE) contains manual POS tag and dependency annotations for 5,124 English as a Second Language (ESL) sentences drawn from the Cambridge Learner Corpus First Certificate in English (FCE) dataset.

Contributors: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz, Margarita Misirpashayeva
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

French 1 573K IE, Romance

French treebanks

FTB 573K

The Universal Dependency version of the French Treebank (Abeillé et al., 2003), hereafter UD_French-FTB, is a treebank of sentences from the newspaper Le Monde, initially manually annotated with morphological information and phrase-structure and then converted to the Universal Dependencies annotation scheme.

Contributors: Marie Candito, Bruno Guillaume, Teresa Lynn, Héctor Martínez Alonso, Benoît Sagot, Djamé Seddah, Eric Villemonte de la Clergerie
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Hindi English 1 26K Code switching

Hindi English treebanks

HIENCS 26K

The Hindi-English Code-switching treebank is based on code-switching tweets of Hindi and English multilingual speakers (mostly Indian) on Twitter. The treebank is manually annotated using UD sceheme. The training and evaluations sets were seperately annotated by different annotators using UD v2 and v1 guidelines respectively. The evaluation sets are automatically converted from UD v1 to v2.

Contributors: Riyaz Ahmad Bhat, Irshad Ahmad Bhat
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Japanese 2 204K Japanese

Japanese treebanks

Modern 14K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Corpus of Historical Japanese' (CHJ).

Contributors: Mai Omura, Masayuki Asahara, Yuta Takahashi
Repository master dev
README
Treebank hub page
Download

KTC 189K

Please add a summary section to the treebank readme file

Contributors: Masayuki Asahara, Hiroshi Kanayama, Yuji Matsumoto, Yusuke Miyao, Shunsuke Mori, Takaaki Tanaka, Sumire Uematsu
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

See the language documentation page.

Norwegian 1 55K IE, Germanic

Norwegian treebanks

NynorskLIA 55K

This Norwegian treebank is based on the LIA treebank of transcribed spoken Norwegian dialects. The treebank has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Andre Kaasen
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Norwegian treebanks.

Language documentation

See the language documentation page.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Download

The data is released through LINDAT/CLARIN.

The next release (v2.16) is scheduled for May 15, 2025 (data freeze on May 1).
Version 2.15 treebanks are available at http://hdl.handle.net/11234/1-5787. 296 treebanks, 168 languages, released November 15, 2024.
Version 2.14 treebanks are archived at http://hdl.handle.net/11234/1-5502. 283 treebanks, 161 languages, released May 15, 2024.
Version 2.13 treebanks are archived at http://hdl.handle.net/11234/1-5287. 259 treebanks, 148 languages, released November 15, 2023.
Version 2.12 treebanks are archived at http://hdl.handle.net/11234/1-5150. 245 treebanks, 141 languages, released May 15, 2023.
Version 2.11 treebanks are archived at http://hdl.handle.net/11234/1-4923. 243 treebanks, 138 languages, released November 15, 2022.
Version 2.10 treebanks are archived at http://hdl.handle.net/11234/1-4758. 228 treebanks, 130 languages, released May 15, 2022.
Version 2.9 treebanks are archived at http://hdl.handle.net/11234/1-4611. 217 treebanks, 122 languages, released November 15, 2021.
Version 2.8 treebanks are archived at http://hdl.handle.net/11234/1-3687. 202 treebanks, 114 languages, released May 15, 2021.
Version 2.7 treebanks are archived at http://hdl.handle.net/11234/1-3424. 183 treebanks, 104 languages, released November 15, 2020.
Version 2.6 treebanks are archived at http://hdl.handle.net/11234/1-3226. 163 treebanks, 92 languages, released May 15, 2020.
Version 2.5 treebanks are archived at http://hdl.handle.net/11234/1-3105. 157 treebanks, 90 languages, released November 15, 2019.
Version 2.4 treebanks are archived at http://hdl.handle.net/11234/1-2988. 146 treebanks, 83 languages, released May 15, 2019.
Version 2.3 treebanks are archived at http://hdl.handle.net/11234/1-2895. 129 treebanks, 76 languages, released November 15, 2018.
Version 2.2 treebanks are archived at http://hdl.handle.net/11234/1-2837. 122 treebanks, 71 languages, released July 1, 2018.
Version 2.1 treebanks are archived at http://hdl.handle.net/11234/1-2515. 102 treebanks, 60 languages, released November 15, 2017.
Version 2.0 treebanks are archived at http://hdl.handle.net/11234/1-1983. 70 treebanks, 50 languages, released March 1, 2017.
- Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 81 treebanks, 49 languages, released May 18, 2017.
Version 1.4 treebanks are archived at http://hdl.handle.net/11234/1-1827. 64 treebanks, 47 languages, released November 15, 2016.
Version 1.3 treebanks are archived at http://hdl.handle.net/11234/1-1699. 54 treebanks, 40 languages, released May 15, 2016.
Version 1.2 treebanks are archived at http://hdl.handle.net/11234/1-1548. 37 treebanks, 33 languages, released November 15, 2015.
Version 1.1 treebanks are archived at http://hdl.handle.net/11234/LRT-1478. 19 treebanks, 18 languages, released May 15, 2015.
Version 1.0 treebanks are archived at http://hdl.handle.net/11234/1-1464. 10 treebanks, 10 languages, released January 15, 2015.
In general, we intend to have regular treebank releases every six months. The v2.0 and v2.2 releases were brought forward because of their usage in the CoNLL 2017 and 2018 Multilingual Parsing Shared Tasks.