Skip to main content
Livy Real

Livy Real

Textual similarity deals with determining how similar two pieces of texts are, considering the lexical (surface forms) or semantic (meaning) closeness. In this paper we applied word embeddings for measuring e-commerce product title... more
Textual similarity deals with determining how similar two pieces of texts are, considering the lexical (surface forms) or semantic (meaning) closeness. In this paper we applied word embeddings for measuring e-commerce product title similarity in Brazilian Portuguese. We generated some domainspecific word embeddings (using Word2Vec, FastText and GloVe) and compared them with general-domain models (word embeddings and BERT models). We concluded that the cosine similarity calculated using the domain-specific word embeddings was a good approach to distinguish between similar and nonsimilar products, but the multilingual BERT pre-trained model proved to be the best one.
We discuss nominalizations in Portuguese formed by the suffix -ura. We have done a corpus-based description of the behavior of these nominals and proposed a type ontology to categorize them. In order to offer a rich description, we also... more
We discuss nominalizations in Portuguese formed by the suffix -ura. We have done a corpus-based description of the behavior of these nominals and proposed a type ontology to categorize them. In order to offer a rich description, we also tested all words formed by -ura in co-predication contexts to check if their types could be co-predicated. Although our main goal was to produce a corpus-based description on those nominals, we have found that may be the frequency of use of a given word has a special role on the acceptability of co-predication between different senses of a nominalization.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research... more
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This note describes OpenWordnet-PT, an automatically created, manually curated wordnet for Portuguese and introduces the newly developed web interface we are using to speed up its manual curation. OpenWordNet-PT is part of a collection of... more
This note describes OpenWordnet-PT, an automatically created, manually curated wordnet for Portuguese and introduces the newly developed web interface we are using to speed up its manual curation. OpenWordNet-PT is part of a collection of wordnets for various languages, jointly described and distributed through the Open MultiLingual WordNet and the Global WordNet Association. OpenWordnet-PT has been primarily distributed, from the beginning, as RDF files along with its model description in OWL, and it is freely available for download. We contend the creation of such large, distributed and linkable lexical resources is on the cusp of revolutionizing multilingual language processing to the next truly semantic level. But to get there, there is a need for user interfaces that allow ordinary users and (not only computational) linguists to help in the checking and cleaning up of the quality of the resource. We present our suggestion of one such web interface and describe its features supp...
O aprendizado multimodal visa explorar as características das diversas modalidades (texto, imagem, áudio) para gerar modelos computacionais. No comércio eletrônico, devido à grande variedade das características dos produtos e à ausência... more
O aprendizado multimodal visa explorar as características das diversas modalidades (texto, imagem, áudio) para gerar modelos computacionais. No comércio eletrônico, devido à grande variedade das características dos produtos e à ausência ou inconsistência de informações, a combinação de informações de modos diferentes vem a ser bastante adequada. Neste trabalho são apresentados alguns experimentos para a classificação multimodal (texto e imagem) de produtos (produtos adultos) que não podem ser vendidos no marketplace da empresa parceira. Nesses experimentos, redes neurais foram usadas para treinar classificadores uni e multimodal. O classificador multimodal atingiu 99% de F1 contra 98% do modelo textual e 94% do visual.
This paper describes work on incorporating Princenton’s WordNet morphosemantics links to the fabric of the Portuguese OpenWordNet-PT. Morphosemantic links are relations between verbs and derivationally related nouns that are semantically... more
This paper describes work on incorporating Princenton’s WordNet morphosemantics links to the fabric of the Portuguese OpenWordNet-PT. Morphosemantic links are relations between verbs and derivationally related nouns that are semantically typed (such as for tune-tuner ― in Portuguese “afinar-afinador” – linked through an “agent” link). Morphosemantic links have been discussed for Princeton’s WordNet for a while, but have not been added to the official database. These links are very useful, they help us to improve our Portuguese WordNet. Thus we discuss the integration of these links in our base and the issues we encountered with the integration.
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead.
This work describes how we used AnCora-Nom, a Spanish nominalization lexicon, to extend NomLex-PT, a lexical resource for Portuguese, originally based on the English NomLex lexicon and fully integrated to OpenWordNet-PT, our freely... more
This work describes how we used AnCora-Nom, a Spanish nominalization lexicon, to extend NomLex-PT, a lexical resource for Portuguese, originally based on the English NomLex lexicon and fully integrated to OpenWordNet-PT, our freely available Portuguese WordNet. The complete Spanish lexicon, which contains 1,655 entries, was translated to Portuguese and then compared to our previous data. Further comparison between the different kinds of nominal classification in AnCora-Nom and NomLex-PT is underway.
This paper presents the first effort towards a portuguese wordnet annotated corpus. We mannualy annotated 30 sentences, using the OpenWordNetPT as a lexicon, and then compared the results with an automatic annotation. In addition to the... more
This paper presents the first effort towards a portuguese wordnet annotated corpus. We mannualy annotated 30 sentences, using the OpenWordNetPT as a lexicon, and then compared the results with an automatic annotation. In addition to the system’s evaluation, the results provided valuable insights about how to deal with this ambitious task. Resumo. O presente trabalho apresenta o primeiro passo em direção à construção de um corpus alinhado com uma wordnet — especificamente, com a OpenWordNet-PT. Fizemos um exercício de anotação manual dos substantivos de 30 frases, e comparamos os resultados com os de uma anotação automática. Para além dos índices de acerto do sistema, este breve exercício foi capaz de apontar caminhos para a construção de um corpus alinhado com uma wordnet.
This paper explores how Natural Language Processing techniques can be integrated to solve real-world problems in the e-commerce scenario. We address the issue of having high quality information products offered to customers in a... more
This paper explores how Natural Language Processing techniques can be integrated to solve real-world problems in the e-commerce scenario. We address the issue of having high quality information products offered to customers in a marketplace platform, composed by thousands of sellers producing original content in multiple languages, following different SEO and cultural assumptions. We propose an NLP pipeline to generate high quality titles products in Portuguese.
This paper presents OpenWordNet-PT, a freely available open-source wordnet for Portuguese, with its latest developments and practical uses. We provide a detailed description of the RDF representation developed for OpenWordnet-PT. We... more
This paper presents OpenWordNet-PT, a freely available open-source wordnet for Portuguese, with its latest developments and practical uses. We provide a detailed description of the RDF representation developed for OpenWordnet-PT. We highlight our efforts to extend the coverage of our resource and add nominalization relations connecting nouns and verbs. Finally, we present several real-world applications where OpenWordnet-PT was put to use, including a large-scale high-throughput sentiment analysis system.
This paper describes our ongoing work to create a temporally annotated open Portuguese corpus. We discuss how this task helped to improve and evaluate linked open lexical resources in Portuguese, namely OpenWordNet-PT and TempoWordNet. We... more
This paper describes our ongoing work to create a temporally annotated open Portuguese corpus. We discuss how this task helped to improve and evaluate linked open lexical resources in Portuguese, namely OpenWordNet-PT and TempoWordNet. We use the Linguateca’s Bosque corpus, which we annotated with Universal Dependencies (UD2.0) and the system HeidelTime, the state of the art open source time tagging, to build Bosque-T, our proposed temporal corpus.
This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words... more
This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet.
This paper presents NomLex-BR, a lexical resource describing Brazilian Portuguese nominalizations, and its integration with OpenWordnet-PT. We first describe the original English NOMLEX lexical resource and how we used it to bootstrap a... more
This paper presents NomLex-BR, a lexical resource describing Brazilian Portuguese nominalizations, and its integration with OpenWordnet-PT. We first describe the original English NOMLEX lexical resource and how we used it to bootstrap a Portuguese version. Subsequently, we describe how this lexicon can be embedded into OpenWordnet-PT, which facilitates its use and helps spot-checking both the bigger integrated resource and the original lexicon. Lastly, we outline some of the other, more substantial work that we plan to engage for the project of using linguistic insights for knowledge representation in Portuguese.
Previous research has demonstrated the benefits of using linguistic resources to analyze a user’s social media profiles in order to learn information about that user. However, numerous linguistic resources exist, raising the question of... more
Previous research has demonstrated the benefits of using linguistic resources to analyze a user’s social media profiles in order to learn information about that user. However, numerous linguistic resources exist, raising the question of choosing the appropriate resource. This paper compares Extended WordNet Domains with DBpedia. The comparison takes the form of an investigation of the relationship between users’ descriptions of their knowledge and background on LinkedIn with their description of the same characteristics on Twitter. The analysis applied in this study consists of four parts. First, information a user has shared on each service is mined for keywords. These keywords are then linked with terms in DBpedia/Extended WordNet Domains. These terms are ranked in order to generate separate representations of the user’s interests and knowledge for LinkedIn and Twitter. Finally, the relationship between these separate representations is examined. In a user study with eight partici...
This paper explores how Natural Language Processing techniques can be integrated to solve real-world problems in the e-commerce scenario. We address the issue of having high quality information products offered to customers in a... more
This paper explores how Natural Language Processing techniques can be integrated to solve real-world problems in the e-commerce scenario. We address the issue of having high quality information products offered to customers in a marketplace platform, composed by thousands of sellers producing original content in multiple languages, following different SEO and cultural assumptions. We propose an NLP pipeline to generate high quality titles products in Portuguese.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research... more
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present r...
Semantic relations between words are key to building systems that aim to understand and manipulate language. For English, the “de facto” standard for representing this kind of knowledge is Princeton’s WordNet. Here, we describe the... more
Semantic relations between words are key to building systems that aim to understand and manipulate language. For English, the “de facto” standard for representing this kind of knowledge is Princeton’s WordNet. Here, we describe the wordnet-like resources currently available for Portuguese: their origins, methods of creation, sizes, and usage restrictions. We start tackling the problem of comparing them, but only in quantitative terms. Finally, we sketch ideas for potential collaboration between some of the projects that produce Portuguese wordnets.
This paper describes the creation of a Portuguese corpus following the guidelines of the Universal Dependencies Framework. Instead of starting from scratch, we invested in a conversion process from the existing Portuguese corpus, called... more
This paper describes the creation of a Portuguese corpus following the guidelines of the Universal Dependencies Framework. Instead of starting from scratch, we invested in a conversion process from the existing Portuguese corpus, called Bosque. The conversion was done by applying a context-sensitive set of Constraint Grammar rules to its original deep linguistic analysis, which was carried out by the parser PALAVRAS, with some additional manual corrections. Universal Dependencies offer the promise of greater parallelism between languages, a plus for researchers in many areas. We report the challenges of dealing with Portuguese, a Romance language, hoping that our experience will help others.
This paper is about the automated analysis of the lexical and compo-sitional semantics of nominalisation, in particular of felicitous co-predications — infelicitous ones are rejected. We focus on Brazilian Portuguese nominalisation... more
This paper is about the automated analysis of the lexical and compo-sitional semantics of nominalisation, in particular of felicitous co-predications — infelicitous ones are rejected. We focus on Brazilian Portuguese nominalisation introduced by the suffix-ura but our discussion applies to other nominalisations and languages as well. Much of the theoretical work on deverbals, including ours, concluded that deverbal senses (process, result, location and so on) are rather unforeseeable from the verb and the suffix. The (in)felicity of co-predication, that is the possible conjunction of two predicates which applies to different senses, is also difficult to predict, and in the case of deverbals it is assumed to be (almost) impossible. We present here a study of the CHAVE corpus and show that CHAVE does actually contain some of the supposedly non-existent copredica-tions. We explain our formalisation of sense variation and copredications in our logico-computational framework, Montagovean...
Part-of-Speech (POS) tagging consists of labeling every token of a text with its correct morphosyntactic category and is considered by many a solved task in NLP. However, there are many tag systems in use, tags are not very easy to... more
Part-of-Speech (POS) tagging consists of labeling every token of a text with its correct morphosyntactic category and is considered by many a solved task in NLP. However, there are many tag systems in use, tags are not very easy to compare, there is no o cial golden standard and hence comparing performance of di↵erent systems is a nightmare, even for English. Much more so for less resourced languages. Recently a collective of researchers decided to tackle this issue and there is a new initiative, the Universal Dependencies project, that is developing crosslinguistically consistent treebanks and annotations for many languages. We look at how the coarse categories of POS tags defined by the Universal Dependencies project would work for Portuguese and describe the issues of aligning them with the POS tags produced by FreeLing, the open source NLP system we use.
Not many years ago it was usual to comment on the lack of an open lexical-semantic knowledge base, following the lines of Princeton WordNet, but for Portuguese. Today, the landscape has changed significantly, and researchers that need... more
Not many years ago it was usual to comment on the lack of an open lexical-semantic knowledge base, following the lines of Princeton WordNet, but for Portuguese. Today, the landscape has changed significantly, and researchers that need access to this specific kind of resource have not one, but several alternatives to choose from. The present article describes the wordnet-like resources currently available for Portuguese. It provides some context on their origin, creation approach, size and license for utilization. Apart from being an obvious starting point for those looking for a computational resource with information on the meaning of Portuguese words, this article describes the resources available, compares them and lists some plans for future work, sketching ideas for potential collaboration between the projects described.
Research Interests:
This paper describes a manual investigation of the SICK corpus, which is the proposed testing set for a new system for natural language inference. The system provides conceptual semantics for sentences, so that... more
This paper describes a manual investigation of the SICK corpus, which is the proposed testing set for a new system for natural language inference. The system provides conceptual semantics for sentences, so that entailment-contradiction-neutrality relations between sentences can be identified. The investigation of the SICK corpus was a necessary task to check the quality of the testing data which is to be used as a golden standard for the new system. This checking also provides crucial insights for the implementation of the components of the system. The investigation showed that the human judgements used in the building of the SICK corpus can be erroneous, in this way deteriorating the quality of an otherwise useful resource. We also show that detecting the relationship between some pairs of the SICK corpus requires more than just lexical semantics, which provides us with guidelines and intuitions for our further implementation.
This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words... more
This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet.
Experiment in group translation of a passage of Metamorphoses with a Portuguese dactylic hexameter as proposed by Carlos Alberto Nunes in his Homer and Virgil.
Not many years ago it was usual to comment on the lack of an open lexicalsemantic knowledge base, following the lines of Princeton WordNet, but for Portuguese. Today, the landscape has changed significantly, and researchers that need... more
Not many years ago it was usual to comment on the lack of an open lexicalsemantic knowledge base, following the lines of Princeton WordNet, but for Portuguese. Today, the landscape has changed significantly, and researchers that need access to this specific kind of resource have not one, but several alternatives to choose from. The present article describes the wordnet-like resources currently available for Portuguese. It provides some context on their origin, creation approach, size and license for utilization. Apart from being an obvious starting point for those looking for a computational resource with information on the meaning of Portuguese words, this article describes the resources available, compares them and lists some plans for future work, sketching ideas for potential collaboration between the projects described.
Research Interests:
This paper presents OpenWordNet-PT, a freely available open-source wordnet for Portuguese, with its latest developments and practical uses. We provide a detailed description of the RDF representation developed for OpenWordnet-PT. We... more
This paper presents OpenWordNet-PT, a freely available open-source wordnet for Portuguese, with its latest developments and practical uses. We provide a detailed description of the RDF representation developed for OpenWordnet-PT. We highlight our efforts to extend the coverage of our resource and add nominalization relations connecting nouns and verbs. Finally, we present several real-world applications where OpenWordnet-PT was put to use, including a large-scale high-throughput sentiment analysis system.
Research Interests:
We propose a lexical account of action nominals, in particular of deverbal nominalisations, whose meaning is related to the event expressed by their base verb. The literature about nominalisations often assumes that the semantics of the... more
We propose a lexical account of action nominals, in particular of deverbal nominalisations, whose meaning is related to the event expressed by their base verb. The literature about nominalisations often assumes that the semantics of the base verb completely defines the structure of action nominals. We argue that the information in the base verb is not sufficient to completely determine the semantics of action nominals. We exhibit some data from different languages, especially from Romance language, which show that nominalisations focus on some aspects of the verb semantics. The selected aspects, however, seem to be idiosyncratic and do not automatically result from the internal structure of the verb nor from its interaction with the morphological suffix. We therefore propose a partially lexicalist approach view of deverbal nouns. It is made precise and computable by using the Montagovian Generative Lexicon, a type theoretical framework introduced by Bassac, Mery and Retoré in this journal in 2010. This extension of Montague semantics with a richer type system easily incorporates lexical phenomena like the semantics of action nominals in particular deverbals, including their polysemy and (in)felicitous copredications.
Research Interests:
PhD Thesis - in Portuguese. An onverview on the behavior of nominalizations in Brazilian Portuguese and a deep discussion considering three different lexical frameworks: Grimshaw (1990), Pustejovsky(1995) and Bassac, Mery & Retoré (2007).
Research Interests:
This preliminary account of our work on improving the verb lexicon of OpenWordNet-PT describes some of the issues that one faces when manually cleaning up a semi-automatically constructed lexical resource and some of the lessons we... more
This preliminary account of our work on improving the verb
lexicon of OpenWordNet-PT describes some of the issues that one faces when manually cleaning up a semi-automatically constructed lexical resource and some of the lessons we learned while doing it.
Research Interests:
This paper presents NomLex-BR, a lexical resource describing Brazilian Portuguese nominalizations, and its integration with OpenWordnet-PT. We first describe the original English NOMLEX lexical resource and how we used it to bootstrap a... more
This paper presents NomLex-BR, a lexical resource describing Brazilian Portuguese nominalizations, and its integration with OpenWordnet-PT. We first describe the original English NOMLEX lexical resource and how we used it to bootstrap a Portuguese version. Subsequently, we describe how this lexicon can be embedded into OpenWordnet-PT, which facilitates its use and helps spot-checking both the bigger integrated resource and the original lexicon. Lastly, we outline some of the other, more substantial work that we plan to engage for the project of using linguistic insights for knowledge representation in Portuguese.
Research Interests:
This work points out a semantic approach to the morphological level, strongly based on Bayer (1997) and Hoeksema (1985). Based on Categorial Grammar, a syntactic semantic tool commonly used by semanticists and computational linguists, I... more
This work points out a semantic approach to the morphological level, strongly based on
Bayer (1997) and Hoeksema (1985). Based on Categorial Grammar, a syntactic semantic tool commonly used by semanticists and computational linguists, I look at the morphological level through the same syntactic rules adopted by the model. In this way, it is possible to gain a theoretical improvement since I look at morphological, semantic and syntactic levels through the same formal structure. I will briefly introduce Categorial Grammar showing its application in Brazilian Portuguese. I have chosen a nominalizer suffix, -ura, present in such words as abertura ‘opening’ and assadura ‘baking/rash’.
Research Interests: