Skip to main content
The paper focuses on the manipulation of a WordNet-based knowledge graph by adding, changing and combining various semantic relations. This is done in the context of measuring similarity and relatedness between words, based on word... more
The paper focuses on the manipulation of a WordNet-based knowledge graph by adding, changing and combining various semantic relations. This is done in the context of measuring similarity and relatedness between words, based on word embedding representations trained on a pseudo corpus generated from the knowledge graph. The UKB tool is used for generating pseudo corpora that are then used for learning word embeddings. The results from the performed experiments show that the addition of more relations generally improves performance along both dimensions – similarity and relatedness. In line with previous research, our survey confirms that paradigmatic relations predominantly improve similarity, while syntagmatic relations benefit relatedness scores.
One of the most successful approaches to Word Sense Disambiguation (WSD) in the last decade has been the knowledge-based approach, which exploits lexical knowledge sources such as Wordnets, ontologies, etc. The knowledge encoded in them... more
One of the most successful approaches to Word Sense Disambiguation (WSD) in the last decade has been the knowledge-based approach, which exploits lexical knowledge sources such as Wordnets, ontologies, etc. The knowledge encoded in them is typically used as a sense inventory and as a relations bank. However, this type of information is rather sparse in terms of senses and the relations among them. In this paper we present a strategy for the enrichment of WSD knowledge bases with data-driven relations from a gold standard corpus (annotated with word senses, syntactic analyses, etc.). We focus on English as use case, but our approach is scalable to other languages. The results show that the addition of new knowledge improves the accuracy of WSD task.
This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based... more
This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based and one rule-based. The combination of these modules achieves the best result for morphological tagging of Bulgarian over a rich tagset (680 tags). The lemmatization is based on rules, generated from a large morphological lexicon of Bulgarian. The syntactic analysis is implemented via MaltParser. The two statistical morphological taggers and MaltParser are trained on datasets constructed within BulTreeBank project. The processing pipeline includes also a sentence splitter and a tokenizer. All tools in the pipeline are packed in modules that can also perform separately. The whole pipeline is designed to be able to serve as a back-end of a web service oriented interface, but it also supports the user tasks with a command-line interface. The processin...
The current developments in the area report on numerous applications of recurrent neural networks for Word Sense Disambiguation that allowed the increase of prediction accuracy even in situation with sparse knowledge due to the available... more
The current developments in the area report on numerous applications of recurrent neural networks for Word Sense Disambiguation that allowed the increase of prediction accuracy even in situation with sparse knowledge due to the available generalization properties. Since the traditionally used LSTM networks demand enormous computational power and time to be trained, the aim of the present work is to investigate the possibility of applying a recently proposed fast trainable RNN, namely Echo state networks. The preliminary results reported here demonstrate the applicability of ESN to WSD.
ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked... more
ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (from November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1432. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpus; the derived corpus in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 2.1 release of the data and scripts available at the GitHub repository of the ParlaMint project. As opposed to the previous version 2.0, this version corrects some errors in various corpora and adds the information on upper / lower house for bicameral parliaments. The vertical files have also been changed to make them easier to use in the concordancers
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers,... more
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis.
Currently the BuTreeBank comprises 214 000 tokens, a little more than 15 000 sentences. Each token is annotated with morphosyntactic information. Additionally the Named Entities are annotated with ontological classes as person,... more
Currently the BuTreeBank comprises 214 000 tokens, a little more than 15 000 sentences. Each token is annotated with morphosyntactic information. Additionally the Named Entities are annotated with ontological classes as person, organization, location, and other. Based on HPSG theory the annotation scheme defines a number of phrase types which reflect both the constituent structure and the head-dependant relation. Thus we have phrase labels with the explication of the dependant types like VPC (verbal head complement phrase), VPS (verbal head subject phrase), VPA (verbal head adjunct phrase), NPA (nominal head adjunct phrase) etc. Behind the constituent structures and the head-dependant relations the treebank also represents phenomena like coordination, ellipsis, pro-dropness, word order, secondary predication, control – see (Simov and Osenova 2003). We will focus on some of them in this demo presentation. The treebank is encoded in XML.
Research Interests:
This paper presents a corpus survey on the metaphor types in a contemporary Bulgarian corpus of Parliamentary Sessions. The corpus spans partially over the pandemic period (Nov. 2019-July 2020). The following processes are described: the... more
This paper presents a corpus survey on the metaphor types in a contemporary Bulgarian corpus of Parliamentary Sessions. The corpus spans partially over the pandemic period (Nov. 2019-July 2020). The following processes are described: the extraction of the contexts of the selected key words 'virus, coronavirus, pandemia, epidemia, COVID-19, COVID, corona'; the classification method that was employed for dividing the detected contexts into the relevant metaphor frames; the language expressions that are associated with the corresponding metaphor type; the frequency of the used key words and metaphor frames. Some problems are outlined with respect to such kind of studies. The most frequent metaphor frames in our data turned out to be CONTROL and WAR.
The paper continues investigations on the application of bidirectional echo state networks (BiESN) to the task of word sense disambiguation (WSD). Motivated by observations that the quality of the embedding vectors used to train the... more
The paper continues investigations on the application of bidirectional echo state networks (BiESN) to the task of word sense disambiguation (WSD). Motivated by observations that the quality of the embedding vectors used to train the models influences to a significant degree their accuracy, here we propose the application of a single ESN reservoir to generate new potentially better embedding vectors with different dimensions. BiESN models for WSD of various reservoir sizes were trained using various combinations of new and original embeddings models for the input and/or output steps; the achieved accuracy is reported here. The results demonstrate increased WSD accuracy in several cases of newly derived embedding sets.
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and crosslingual linking. The Challenge... more
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and crosslingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% Fmeasure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages. Detailed evaluation information is available on the shared task web page.
The notion of catena was introduced originally to represent the syntactic structure of multiword expressions with idiosyncratic semantics and non-constituent structure. Later on, several other phenomena (such as ellipsis, verbal... more
The notion of catena was introduced originally to represent the syntactic structure of multiword expressions with idiosyncratic semantics and non-constituent structure. Later on, several other phenomena (such as ellipsis, verbal complexes, etc.) were formalized as catenae. This naturally led to the suggestion that a catena can be considered a basic unit of syntax. In this paper we present a formalization of catenae and the main operations over them for modelling the combinatorial potential of units in dependency grammar.
By means of an online survey, we have investigated ways in which various types of multiword expressions are annotated in existing treebanks. The results indicate that there is considerable variation in treatments across treebanks and... more
By means of an online survey, we have investigated ways in which various types of multiword expressions are annotated in existing treebanks. The results indicate that there is considerable variation in treatments across treebanks and thereby also, to some extent, across languages and across theoretical frameworks. The comparison is focused on the annotation of light verb constructions and verbal idioms. The survey shows that the light verb constructions either get special annotations as such, or are treated as ordinary verbs, while VP idioms are handled through different strategies. Based on insights from our investigation, we propose some general guidelines for annotating multiword expressions in treebanks. The recommendations address the following application-based needs: distinguishing MWEs from similar but compositional constructions; searching distinct types of MWEs in treebanks; awareness of literal and nonliteral meanings; and normalization of the MWE representation. The cros...
The paper introduces the Political Speech Corpus of Bulgarian. First, its current state has been discussed with respect to its size, coverage, genre specification and related online services. Then, the focus goes to the annotation... more
The paper introduces the Political Speech Corpus of Bulgarian. First, its current state has been discussed with respect to its size, coverage, genre specification and related online services. Then, the focus goes to the annotation details. On the one hand, the layers of linguistic annotation are presented. On the other hand, the compatibility with CLARIN technical Infrastructure is explained. Also, some user-based scenarios are mentioned to demonstrate the corpus services and applicability.
The paper reports on the usage of deep learning methods for improving a Named Entity Recognition (NER) training corpus and for predicting and annotating new types in a test corpus. We show how the annotations in a type-based corpus of... more
The paper reports on the usage of deep learning methods for improving a Named Entity Recognition (NER) training corpus and for predicting and annotating new types in a test corpus. We show how the annotations in a type-based corpus of named entities (NE) were populated as occurrences within it, thus ensuring density of the training information. A deep learning model was adopted for discovering inconsistencies in the initial annotation and for learning new NE types. The evaluation results get improved after data curation, randomization and deduplication.
This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and... more
This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.
We explore lexical choice in Natural Language Generation (NLG) by implementing a model that uses both context and frequency information. Our model chooses a lemma given a WordNet synset in the abstract representations that are the input... more
We explore lexical choice in Natural Language Generation (NLG) by implementing a model that uses both context and frequency information. Our model chooses a lemma given a WordNet synset in the abstract representations that are the input for generation. In order to find the correct lemma in its context, we map underspecified dependency trees to Hidden Markov Trees that take into account the probability of a lemma given its governing lemma, as well as the probability of a word sense given a lemma. A tree-modified Viterbi algorithm is then utilized to find the most probable hidden tree containing the most appropriate lemmas in the given context. Further processing ensures that the correct morphological realization for the given lemma is produced. We evaluate our model by comparing it to a statistical transfer component in a Machine Translation system for English to Dutch. In this set-up, the word sense of words are determined in English analysis, and then our model is used to select th...
This paper proposes a combined model for POS tagging, dependency parsing and co-reference resolution for Bulgarian — a pro-drop Slavic language with rich morphosyntax. We formulate an extension of the MSTParser algorithm that allows the... more
This paper proposes a combined model for POS tagging, dependency parsing and co-reference resolution for Bulgarian — a pro-drop Slavic language with rich morphosyntax. We formulate an extension of the MSTParser algorithm that allows the simultaneous handling of the three tasks in a way that makes it possible for each task to benefit from the information available to the others, and conduct a set of experiments against a treebank of the Bulgarian language. The results indicate that the proposed joint model achieves state-of-theart performance for POS tagging task, and outperforms the current pipeline solution.
In this paper we present a system for experimenting with combinations of dependency parsers. The system supports initial training of different parsing models, creation of parsebank(s) with these models, and different strategies for the... more
In this paper we present a system for experimenting with combinations of dependency parsers. The system supports initial training of different parsing models, creation of parsebank(s) with these models, and different strategies for the construction of ensemble models aimed at improving the output of the individual models by voting. The system employs two algorithms for construction of dependency trees from several parses of the same sentence and several ways for ranking of the arcs in the resulting trees. We have performed experiments with state-of-the-art dependency parsers including MaltParser, MSTParser, TurboParser, and MATEParser, on the data from the Bulgarian treebank -- BulTreeBank. Our best result from these experiments is slightly better then the best result reported in the literature for this language.
Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning... more
Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.
@Book{AdaptLRTtoND:2009, editor = {Núria Bel, Pompeu Fabra University and Erhard Hinrichs, Tuebingen University and Petya Osenova, Bulgarian Academy of Sciences and Sofia University and Kiril Simov, Bulgarian Academy of Sciences}, title =... more
@Book{AdaptLRTtoND:2009, editor = {Núria Bel, Pompeu Fabra University and Erhard Hinrichs, Tuebingen University and Petya Osenova, Bulgarian Academy of Sciences and Sofia University and Kiril Simov, Bulgarian Academy of Sciences}, title = {Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains}, month = {September}, year = {2009}, address = {Borovets, Bulgaria}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/W09-4100} } @InProceedings ...
In this paper we present an approach for the enrichment of WSD knowledge bases with data-driven relations from a gold standard corpus (annotated with word senses, valency information, syntactic analyses, etc.). We focus on Bulgarian as a... more
In this paper we present an approach for the enrichment of WSD knowledge bases with data-driven relations from a gold standard corpus (annotated with word senses, valency information, syntactic analyses, etc.). We focus on Bulgarian as a use case, but our approach is scalable to other languages as well. For the purpose of exploring such methods, the Personalized Page Rank algorithm was used. The reported results show that the addition of new knowledge improves the accuracy of WSD with approximately 10.5%.
The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-<br> English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset).<br> Observations were made on alignments in which at least one... more
The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-<br> English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset).<br> Observations were made on alignments in which at least one multiword expression<br> was used per language. The multiword expressions were classified with respect to<br> the PARSEME lexicon-based (WG1) and treebank-based (WG4) classifications. The<br> non-MWE counterparts of MWEs are also considered. Our approach is data-driven<br> because the data of this study was retrieved from parallel corpora and not from<br> bilingual dictionaries. The survey shows that the predominant translation relation<br> between Bulgarian and English is MWE-to-word, and that this relation does not<br> exclude other translation options. To formalize our observations, a catenae-based<br> modelling of the parallel pairs is proposed.
The paper presents the strategies and conversion principles of BulTreeBank into Universal Dependencies annotation scheme. The mappings are discussed from linguistic and technical point of view. The mapping from the original resource to... more
The paper presents the strategies and conversion principles of BulTreeBank into Universal Dependencies annotation scheme. The mappings are discussed from linguistic and technical point of view. The mapping from the original resource to the new one has been done on morphological and syntactic level. The first release of the treebank was issued in May 2015. It contains 125 000 tokens, which cover roughly half of the corpus data.
In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line... more
In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. The observations show that when the classes outnumber the POS tags, the results are better. Since this approach adds on another dimension of abstraction (in comparison to the lemma), its coarse-grained representation can be used further for training statistical parsers.

And 136 more

Research Interests:
Research Interests:
(с оглед на автоматичната обработка на естествен език)
Research Interests:
This report is an extension of a paper published at RANLP conference 2001. The paper is an improvement over the work done on POS disambiguation for Bulgarian via Neural Networks (Vlasseva 1999). Our improvements are in several directions:... more
This report is an extension of a paper published at RANLP conference 2001. The paper is an improvement over the work done on POS disambiguation for Bulgarian via Neural Networks (Vlasseva 1999). Our improvements are in several directions: (1) we extended the range of grammatical features predicted by the system to cover almost all paradigmatic members of Bulgarian words, (2) we changed the encoding schemata for grammatical features in order to minimize the computation and to use more extensively the context layer of the network, (3) we changed the evaluation of the network output in order to minimize the side effects from evaluating cases that are not relevant in a particular instance of ambiguity. Besides the improvements when using neural networks, we did some improvements on the choice of the training corpus and we added a rule-based preprocessing component in order to disambiguate the cases for which there are rules ensuring 100% correct results
Research Interests: