2022
pdf
bib
abs
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi
|
Bence Nyéki
|
Svetla Koeva
|
Marko Tadić
|
Vanja Štefanec
|
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Maria Mitrofan
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
pdf
bib
abs
DiaBiz – an Annotated Corpus of Polish Call Center Dialogs
Piotr Pęzik
|
Gosia Krawentek
|
Sylwia Karasińska
|
Paweł Wilk
|
Paulina Rybińska
|
Anna Cichosz
|
Angelika Peljak-Łapińska
|
Mikołaj Deckert
|
Michał Adamczyk
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces DiaBiz, a large, annotated, multimodal corpus of Polish telephone conversations conducted in varied business settings, comprising 4036 call centre interactions from nine different domains, i.e. banking, energy services, telecommunications, insurance, medical care, debt collection, tourism, retail and car rental. The corpus was developed to boost the development of third-party speech recognition engines, dialog systems and conversational intelligence tools for Polish. Its current size amounts to nearly 410 hours of recordings and over 3 million words of transcribed speech. We present the structure of the corpus, data collection and transcription procedures, challenges of punctuating and truecasing speech transcripts, dialog structure annotation and discuss some of the ecological validity considerations involved in the development of such resources.
2020
pdf
bib
abs
The MARCELL Legislative Corpus
Tamás Váradi
|
Svetla Koeva
|
Martin Yamalov
|
Marko Tadić
|
Bálint Sass
|
Bartłomiej Nitoń
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Radu Ion
|
Elena Irimia
|
Maria Mitrofan
|
Vasile Păiș
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraz Repar
|
Matjaž Rihtar
|
Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
2018
pdf
bib
Increasing the Accessibility of Time-Aligned Speech Corpora with Spokes Mix
Piotr Pęzik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2014
pdf
bib
abs
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm
|
Hans Uszkoreit
|
Sophia Ananiadou
|
Núria Bel
|
Audronė Bielevičienė
|
Lars Borin
|
António Branco
|
Gerhard Budin
|
Nicoletta Calzolari
|
Walter Daelemans
|
Radovan Garabík
|
Marko Grobelnik
|
Carmen García-Mateo
|
Josef van Genabith
|
Jan Hajič
|
Inma Hernáez
|
John Judge
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Joseph Mariani
|
John McNaught
|
Maite Melero
|
Monica Monachini
|
Asunción Moreno
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Stelios Piperidis
|
Adam Przepiórkowski
|
Eiríkur Rögnvaldsson
|
Michael Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Koenraad De Smedt
|
Marko Tadić
|
Paul Thompson
|
Dan Tufiş
|
Tamás Váradi
|
Andrejs Vasiļjevs
|
Kadri Vider
|
Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.
2012
pdf
bib
abs
Towards a comprehensive open repository of Polish language resources
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Adam Przepiórkowski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland site containing an exhaustive collection of Polish LRTs. Current work is focused on the creation of new LRTs and, esp., the enhancement of existing LRTs, such as parallel corpora, annotated corpora of written and spoken Polish and morphological dictionaries to be made available via the META-SHARE repository.
2010
pdf
bib
abs
Recent Developments in the National Corpus of Polish
Adam Przepiórkowski
|
Rafał L. Górski
|
Marek Łaziński
|
Piotr Pęzik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.