Ana Castro Salgado

Followers

Following

Public Views

Interests

Uploads

Papers

Iberian Academy Dictionaries as Lexical Resources

Download

Portuguese Language Resources

UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense al... more UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community.publishersversionpublishe

Download

The advent of a new lexicographical portuguese project

UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to prod... more UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to produce high-quality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.publishersversionpublishe

Download

football terms encoded in TEI Lex-0

UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in g... more UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in general language dictionaries. In this paper, we focus on how football terms are treated in three Academy Dictionaries – Portuguese, French, and Spanish – and draw some conclusions about the lexicographical decisions taken in the three languages. After identifying every position football players can have on the field, we verify whether the dictionaries above include these terms. We propose the TEI encoding of the term “defesa” (defence), which designates a position occupied by football players on the field. Bearing in mind concepts such as reusability and interoperability, we intend to present: 1) a comparison of football terms in the three dictionaries; 2) TEI Lex-0 dictionary encoding, a streamlined standard to facilitate interoperability; 3) a consistent TEI modelling and description of the microstructural elements of lexicographical entries. In the end, we draw some conclusions.publis...

Download

Designing the ELEXIS Parallel Sense-Annotated Dataset in 10 European Languages

Download

The Grande Dicionário Houaiss da Língua Portuguesa Dictionary as a Use Case

UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new... more UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework(LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, andPart 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the useof both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of thereference Portuguese dictionaryGrande Dicion ́ario Houaiss da L ́ıngua Portuguesa, part of a broader experiment comprisingthe analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the UnifiedModelling Language (UML) and also in a couple of cases in TEI.publishersversionpublishe

Download

Comprender el mundo para mejorar un diccionario: las marcas temáticas en el Diccionario de la lengua española de la Real Academia Española

El objetivo de esta comunicación es analizar la lista de dominios en el Diccionario de la lengua ... more El objetivo de esta comunicación es analizar la lista de dominios en el Diccionario de la lengua española (DLE) para repensar los supuestos teóricos y metodológicos de la tradición lexicográfica en torno al etiquetado de dominios. Después de la descripción y análisis de las marcas que identifican el léxico especializado y su cotejo en los diccionarios académicos ibéricos, compararemos la lista de marcas del DLE con otros sistemas de clasificación como EUROVOC, Tesauro de la UNESCO y WordNet Domains Hierarchy.

Download

Modelling Etymology in LMF/TEI: The 'Grande Dicionário Houaiss da Língua Portuguesa' Dictionary as a Use Case

In this article, we will introduce two of the new parts of the new multi-part version of the Lexi... more In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicion´ario Houaiss da L´ıngua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.

Download

O projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOLP 1940

Apresentação do projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOL... more

Download

A good TACTIC for lexicographical work: football terms encoded in TEI Lex-0

In this presentation, we focus on how football terms are treated in three Academy Dictionaries – ... more

Download

TEI Lex-0: a good fit for the encoding of the Portuguese Academy Dictionary?

In this presentation, we report on the encoding of the Portuguese Academy Dictionary using TEI Le... more In this presentation, we report on the encoding of the Portuguese Academy Dictionary using TEI Lex-0. We demonstrate how we applied this new baseline format for lexical data to mark up 'special entries' in the dictionary: part-of-speech homonyms (antepassadol1, antepassado2, antepassado3), etymological homonyms (cota1, cota2), homographs (lobo1 /ó/, lobo2 /ô/), spelling variants (ouro, oiro), trademarks (donut), entries that have a different meaning in the plural (antepassados), and lexical variants (missanga, miçanga).

Download

TEI Lex-0 In Action: Improving the Encoding of the Dictionary of the Academia das Ciências de Lisboa

This paper describes some experiments made while encoding the first complete dictionary of the Ac... more This paper describes some experiments made while encoding the first complete dictionary of the Academia das Ciências de Lisboa (DACL) in the context of TEI Lex-0, a community-based interchange format for lexical data aimed at facilitating the interoperability and reusability of lexical resources. Even though the original encoding of the DACL was based on TEI, we decided to switch to TEI Lex-0 because it allowed us to streamline our encoding. Our experiments show that even though TEI Lex-0 is stricter than TEI itself (allowing fewer elements and imposing certain constraints that are not present in plain TEI), it is fully capable of representing the complexities of the entry structure of the DACL. In the paper, we discuss the TEI Lex-0 encoding of the DACL, as well as the conversion methodology and the tools used for the automatic conversion from the original encoding. We are currently focusing on the macrostructural level, more precisely on the types of lexical units and on the writt...

Download

LeXmart: A Smart Tool for Lexicographers

The digital era has brought some challenges to lexicographers, but it has also brought new opport... more The digital era has brought some challenges to lexicographers, but it has also brought new opportunities as part of the rise of information technology and, more recently, the emergence of digital humanities. This paper provides a description of LeXmart, the framework that supports the digital development of the Portuguese Academy of Sciences Dictionary. LeXmart is a smart tool framework to support lexicographers' work that offers different types of tools, ranging from a structural editor to a set of validation tools. Given that the dictionary is stored in eXist-DB, LeXmart is developed on top of its ecosystem, using W3C standard languages, and offering default functionalities offered by eXist-DB, namely a RESTful API.

Download

Challenges of Word Sense Alignment: Portuguese Language Resources

This paper reports on an ongoing task of monolingual word sense alignment in which a comparative ... more This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web technologies. The results obtained are useful for the discussion within the community.

Download

Building a Dictionary using XML Technology

In this article we describe the workflow implemented to convert a dictionary saved as a PDF file ... more In this article we describe the workflow implemented to convert a dictionary saved as a PDF file into an XML document and posterior importation into an XML aware database, and the process to edit, add and delete new entries. The conversion process was challenging given the format of the PDF file, and the fine grained detail of the XML schema that was used. For that, an iterative filtering approach was used. To store the dictionary we decided to use an XML aware database (eXist-DB), that stores each dictionary entry as a separate resource. It can be queried used a web interface developed using XQuery. The lexicographers can edit entries using the oXygen XML editor, reading and storing them directly in the database. In order to guarantee incremental backups, it was defined a mechanism to import the XML database into a GIT repository. Finally, a couple of programs were created in order to prepare regular reports on the dictionary revision process, as well as to backup it in a GIT repos...

Download

Encoding polylexical units with TEI Lex-o: A case study

Slovenščina 2.0: empirical, applied and interdisciplinary research, 2020

The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are per... more The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which ...

Download

Marcas temáticas en los diccionarios académicos ibéricos: estudio comparativo

RILEX. Revista sobre investigaciones léxicas, 2019

La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de ... more La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de recursos lexicográficos, concretamente en los diccionarios de lengua general, que se encuentran actualmente adaptados a nuevas necesidades de la sociedad en general y a las de sus usuarios en particular, tanto en la forma que asumen como en el contenido. A la par del léxico general, estas obras registran, describen y definen léxico especializado de diferentes áreas del conocimiento. El número de unidades terminológicas que forman parte de la nomenclatura de estos recursos tiene tendencia a aumentar, dado el auge tecnológico, la evolución de la sociedad y los fenómenos de globalización, una vez que estas unidades constituyen fuentes privilegiadas de renovación y enriquecimiento lexicales de los sistemas lingüísticos. De este modo, las marcas temáticas que etiquetan el léxico especializado en diccionarios monolingües son objeto del estudio del presente trabajo, cuya finalidad es contribuir...

Download

Improving the consistency of usage labelling in dictionaries with TEI Lex-0

Lexicography, 2019

Download

A platform designed with lexicographical data in mind

UIDB/03213/2020 UIDP/03213/2020LeXmart is an open-source web platform used to support the lexicog... more UIDB/03213/2020 UIDP/03213/2020LeXmart is an open-source web platform used to support the lexicographer’s work through editing, control, validation, management, and publication of lexical resources. This tool was specifically developed to facilitate the compilation of general monolingual dictionaries in which data is encoded according to the Text Encoding Initiative (TEI) schema (chapter 9). Here, we will describe the challenges of adapting LeXmart to deal with TEI Lex-0 and distinct types of lexical resources, namely Dicionário da Língua Portuguesa (DLP) and Vocabulário Ortográfico da Língua Portuguesa, lexicographic works from Academia das Ciências Lisboa, and Dicionário Aberto, the retro-digitised version of the Cândido de Figueiredo dictionary. This article describes the steps taken to update the LeXmart platform to deal with the TEI Lex-0 schema and describe the challenges on properly encoding these three projects while allowing the lexicographical team to work continuously. Th...

Download

O projeto 'Edição Digital dos Vocabulários da Academia das Ciências': o VOLP-1940

Revista da Associação Portuguesa de Linguística, 2020

This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, w... more This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, which aims to digitise the spelling vocabularies of the Lisbon Academy of Sciences (ACL) in order to create a digital lexicographic corpus bringing together the printed versions of all these lexicographical reference works – the 1940, 1947, 1970, and finally the 2012 editions. The first stage started with the Vocabulário Ortográfico da Língua Portuguesa [Orthographic Vocabulary of the Portuguese Language] (VOLP-1940), our case study. After digitising this vocabulary, the work described here focuses on the linguistic annotation of VOLP-1940 using eXtensible Markup Language (XML), an annotation metalanguage, and following the annotation directives of the Text Encoding Initiative (TEI), more specifically the application of TEI Lex-0, a new TEI sub-format. We aim to highlight the need for rigorous linguistic data processing in the creation of new lexical resources to increase the quality of t...

Download