Skip to main content

Vincent Vandeghinste

KU Leuven, Department of Linguistics, Post-Doc

Followers

43

Following

25

Co-authors

16

Public Views

Interests

Uploads

Papers by Vincent Vandeghinste

A Comparison of Different Punctuation Prediction Approaches in a Translation Context

We test a series of techniques to predict punctuation and its effect on machine translation (MT) ... more We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long short-term memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchical and neural MT. For actual translation, phrase-based, hierarchical and neural MT are investigated. We observe that for punctuation prediction, phrase-based statistical MT and neural MT reach similar results, and are best used as a preprocessing step which is followed by neural MT to perform the actual translation. Implicit punctuation insertion by a dedicated neural MT system, trained on unpunctuated source and punctuated target, yields similar results.

Treebank querying with GrETEL 3 : bigger, faster, stronger

We describe the new version of GrETEL (http://gretel.ccl.kuleuven.be/gretel3), an online tool whi... more We describe the new version of GrETEL (http://gretel.ccl.kuleuven.be/gretel3), an online tool which allows users to query treebanks by means of a natural language example (example-based search) or via a formal query (XPath search). The new release comprises an update to the interface and considerable improvements in the back-end search mechanism. The update of the front-end is based on user suggestions. In addition to an overall design update, major changes include a more intuitive query builder in the example-based search mode and a visualizer for syntax trees that is compatible with all modern browsers. Moreover, the results are presented to the user as soon as they are found, so users can browse the matching sentences before the treebank search is completed. We will demonstrate that those changes considerably improve the query procedure. The update of the back-end mainly includes optimizing the search algorithm for querying the (very) large SoNaR treebank. Querying this 500-milli...

Finding your way through the woods with GrETEL

E-Inclusion of Functionally Illiterate Users by the use of Language Technology

Social media websites have radically changed the way in which we access and share information. Ho... more Social media websites have radically changed the way in which we access and share information. However, people with Intellectual Disabilities (ID) have very limited access to the currently available technological tools, such as email clients or Facebook. We describe how the Able to Include project is changing this situation, using various Natural Language Processing (NLP) technologies within the framework of a contextaware Accessibility Layer. More particularly, in this paper, we will focus on the set of tools that translate written text into pictographs and vice versa. Additionally, we will explain how the different pilot studies that are conducted within the project guide us in improving our technologies.No ISSNstatus: publishe

Automating lexical simplification in Dutch

We discuss the design, development and evaluation of an automated lexical simplification tool for... more We discuss the design, development and evaluation of an automated lexical simplification tool for Dutch. A basic pipeline approach is used to perform both text adaptation and annotation. First, sentences are preprocessed and word sense disambiguation is performed. Then, the difficulty of each token is estimated by looking at their average age of acquisition and frequency in a corpus of simplified Dutch. We use Cornetto to find synonyms of words that have been identified as difficult and the SONAR500 corpus to perform reverse lemmatisation. Finally, we rely on a largescale language model to verify whether the selected replacement word fits the local context. In addition, the text is augmented with information from Wikipedia (word definitions and links). We tune and evaluate the system with sentences taken from the Flemish newspaper De Standaard. The results show that the system’s adaptation component has low coverage, since it only correctly simplifies around one in five ‘difficult’ ...

Improving Text-to-Pictograph Translation Through Word Sense Disambiguation

Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, 2016

Deyi Xiong and Min Zhang: Linguistically motivated statistical machine translation: models and algorithms

Machine Translation, 2015

Linguistically Motivated Statistical Machine Translation, written by Deyi Xiong and Min Zhang is ... more Linguistically Motivated Statistical Machine Translation, written by Deyi Xiong and Min Zhang is an overview of (mostly) already publishedwork by the same researchers, rewritten into a coherent book that explains how several different research aspects fit into one research paradigm. The end of each chapter contains an additional readings section referring to related work on the mentioned topics. The book is clearly not a beginner’s guide to machine translation, as it spends no time on traditional rule-based machine translation and only about 5 pages on the standard models in statistical machine translation, including the hierarchical models. It feels like a sequel to Koehn’s (2010) Statistical Machine Translation, and I would suggest to first read that book before starting on this one. The book sets out with a number of challenges that current phrase-based SMTmodels are facing and that cannot be properly solved by just adding more data or raising the n-gram order of the models. The first such challenge is the lexical challenge, in which a translation pattern is triggered by specific linguistic items at the lexicon level, such as lexical selection and lexical reordering. An example of the former is the translation of the word bank, which can have two entirely different meanings; an example of the latter is the fact that the Chinese particle de often triggers a swapping where the modifier to its left is moved towards its right after translation. The second challenge is the syntactic challenge in which syntactic categories and constituent order diverge between source and target language. For instance a noun phrase as a verbal object in the source may translate into a prepositional phrase (PP) in the target, or a Chinese verb phrase (VP) consisting of a PP followed by a VP are frequently translated as a VP consisting of a VP followed by a PP. The third challenge is the semantic challenge,

CLARIN Flanders: new prospects

We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans<br... more

RU Groningen KU Leuven KU Leuven Background

The Dutch Language Corpus Initiative (D-Coi) is one of the projects funded within the current STE... more The Dutch Language Corpus Initiative (D-Coi) is one of the projects funded within the current STEVIN programme. 1 The construction of a 500-millionword reference corpus of written Dutch has been identified as one of the priorities in the programme. In D-Coi, a 50-million-word pilot corpus is being compiled, parts of which will be enriched with (verified) linguistic annotations. In particular, syntactic annotation of a representative subcorpus of 200.000 words is envisaged. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN). CGN contains a subcorpus of 1 million words with syntactic annotations. During the construction of this corpus, no syntactically annotated corpus of Dutch was available to train a statistical parser on, nor an adequate parser for Dutch (requirements: wide-coverage, theory-neutral output, access to both functional and categorial information). This situation has changed considerably since then. Over the last few years, Alpino was ...

Pictograph-to-Text Translation for Augmented and Alternative Communication

Automated Spelling Correction for Dutch Internet Users with Intellectual Disabilities

Assessing linguistically aware fuzzy matching in translation memories

The concept of fuzzy matching in translation memories can take place using linguistically aware o... more The concept of fuzzy matching in translation memories can take place using linguistically aware or unaware methods, or a combination of both. We designed a flexible and time-efficient framework which applies and combines linguistically unaware or aware metrics in the source and target language. We measure the correlation of fuzzy matching metric scores with the evaluation score of the suggested translation to find out how well the usefulness of a suggestion can be predicted, and we measure the difference in recall between fuzzy matching metrics by looking at the improvements in mean TER as the match score decreases. We found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics we have tested.

M3TRA: integrating TM and MT for professional translators

Translation memories (TM) and machine translation (MT) both are potentially useful resources for ... more Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

Improved treebank querying: a facelift for GrETEL

We describe the improvements to the interface of GrETEL, an online tool for querying treebanks. W... more

Less is more: A rule-based syntactic simplification module for improved text-to-pictograph translation

Data & Knowledge Engineering, 2018

E-Including the Illiterate

IEEE Potentials, 2017

In present-day society, we communicate over the Internet in several media forms. We put videos an... more In present-day society, we communicate over the Internet in several media forms. We put videos and images online, listen to music made by famous bands or by our friends, and read and write a lot of text. Never in the history of mankind have we produced more text than at this present moment, so being able to read and write is an important way of taking part in our society. We tend to forget that, even in our educated communities, not all people can read or write and there exist several degrees of literateness. People with reduced cognitive capacities and those migrating from cultures with a different language, or even a completely different writing system, are excluded from fully taking part in written online communication: they are e-excluded.

Number agreement in copular constructions: A treebank-based investigation

Lingua, 2016

This paper has both a theoretical and a methodological objective. The theoretical one concerns th... more This paper has both a theoretical and a methodological objective. The theoretical one concerns the modeling of number agreement in copular constructions. For that purpose it adopts the distinction, familiar from Head-driven Phrase Structure Grammar, between morpho-syntactic agreement (also known as concord) and index agreement. The methodological objective concerns the demonstration of how treebanks can be exploited in order to guide the formulation of relevant generalizations. For that purpose we crucially rely on tools and resources that have recently been developed in the framework of the Dutch-Flemish stevin program (2004–2011) and the European clarin infrastructure.

Improving the Precision of Synset Links Between Cornetto and Princeton WordNet

Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing, 2014

Automatic transcription and normalisation of speech

Semantics-based pretranslation for SMT using fuzzy matches

Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2015

A Comparison of Different Punctuation Prediction Approaches in a Translation Context

We test a series of techniques to predict punctuation and its effect on machine translation (MT) ... more We test a series of techniques to predict punctuation and its effect on machine translation (MT) quality. Several techniques for punctuation prediction are compared: language modeling techniques, such as n-grams and long short-term memories (LSTM), sequence labeling LSTMs (unidirectional and bidirectional), and monolingual phrase-based, hierarchical and neural MT. For actual translation, phrase-based, hierarchical and neural MT are investigated. We observe that for punctuation prediction, phrase-based statistical MT and neural MT reach similar results, and are best used as a preprocessing step which is followed by neural MT to perform the actual translation. Implicit punctuation insertion by a dedicated neural MT system, trained on unpunctuated source and punctuated target, yields similar results.

Treebank querying with GrETEL 3 : bigger, faster, stronger

We describe the new version of GrETEL (http://gretel.ccl.kuleuven.be/gretel3), an online tool whi... more We describe the new version of GrETEL (http://gretel.ccl.kuleuven.be/gretel3), an online tool which allows users to query treebanks by means of a natural language example (example-based search) or via a formal query (XPath search). The new release comprises an update to the interface and considerable improvements in the back-end search mechanism. The update of the front-end is based on user suggestions. In addition to an overall design update, major changes include a more intuitive query builder in the example-based search mode and a visualizer for syntax trees that is compatible with all modern browsers. Moreover, the results are presented to the user as soon as they are found, so users can browse the matching sentences before the treebank search is completed. We will demonstrate that those changes considerably improve the query procedure. The update of the back-end mainly includes optimizing the search algorithm for querying the (very) large SoNaR treebank. Querying this 500-milli...

Finding your way through the woods with GrETEL

E-Inclusion of Functionally Illiterate Users by the use of Language Technology

Social media websites have radically changed the way in which we access and share information. Ho... more Social media websites have radically changed the way in which we access and share information. However, people with Intellectual Disabilities (ID) have very limited access to the currently available technological tools, such as email clients or Facebook. We describe how the Able to Include project is changing this situation, using various Natural Language Processing (NLP) technologies within the framework of a contextaware Accessibility Layer. More particularly, in this paper, we will focus on the set of tools that translate written text into pictographs and vice versa. Additionally, we will explain how the different pilot studies that are conducted within the project guide us in improving our technologies.No ISSNstatus: publishe

Automating lexical simplification in Dutch

We discuss the design, development and evaluation of an automated lexical simplification tool for... more We discuss the design, development and evaluation of an automated lexical simplification tool for Dutch. A basic pipeline approach is used to perform both text adaptation and annotation. First, sentences are preprocessed and word sense disambiguation is performed. Then, the difficulty of each token is estimated by looking at their average age of acquisition and frequency in a corpus of simplified Dutch. We use Cornetto to find synonyms of words that have been identified as difficult and the SONAR500 corpus to perform reverse lemmatisation. Finally, we rely on a largescale language model to verify whether the selected replacement word fits the local context. In addition, the text is augmented with information from Wikipedia (word definitions and links). We tune and evaluate the system with sentences taken from the Flemish newspaper De Standaard. The results show that the system’s adaptation component has low coverage, since it only correctly simplifies around one in five ‘difficult’ ...

Improving Text-to-Pictograph Translation Through Word Sense Disambiguation

Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, 2016

Deyi Xiong and Min Zhang: Linguistically motivated statistical machine translation: models and algorithms

Machine Translation, 2015

Linguistically Motivated Statistical Machine Translation, written by Deyi Xiong and Min Zhang is ... more Linguistically Motivated Statistical Machine Translation, written by Deyi Xiong and Min Zhang is an overview of (mostly) already publishedwork by the same researchers, rewritten into a coherent book that explains how several different research aspects fit into one research paradigm. The end of each chapter contains an additional readings section referring to related work on the mentioned topics. The book is clearly not a beginner’s guide to machine translation, as it spends no time on traditional rule-based machine translation and only about 5 pages on the standard models in statistical machine translation, including the hierarchical models. It feels like a sequel to Koehn’s (2010) Statistical Machine Translation, and I would suggest to first read that book before starting on this one. The book sets out with a number of challenges that current phrase-based SMTmodels are facing and that cannot be properly solved by just adding more data or raising the n-gram order of the models. The first such challenge is the lexical challenge, in which a translation pattern is triggered by specific linguistic items at the lexicon level, such as lexical selection and lexical reordering. An example of the former is the translation of the word bank, which can have two entirely different meanings; an example of the latter is the fact that the Chinese particle de often triggers a swapping where the modifier to its left is moved towards its right after translation. The second challenge is the syntactic challenge in which syntactic categories and constituent order diverge between source and target language. For instance a noun phrase as a verbal object in the source may translate into a prepositional phrase (PP) in the target, or a Chinese verb phrase (VP) consisting of a PP followed by a VP are frequently translated as a VP consisting of a VP followed by a PP. The third challenge is the semantic challenge,

CLARIN Flanders: new prospects

We describe the creation of CLARIN Belgium (CLARIN-BE) and, associated with that, the plans<br... more

RU Groningen KU Leuven KU Leuven Background

The Dutch Language Corpus Initiative (D-Coi) is one of the projects funded within the current STE... more The Dutch Language Corpus Initiative (D-Coi) is one of the projects funded within the current STEVIN programme. 1 The construction of a 500-millionword reference corpus of written Dutch has been identified as one of the priorities in the programme. In D-Coi, a 50-million-word pilot corpus is being compiled, parts of which will be enriched with (verified) linguistic annotations. In particular, syntactic annotation of a representative subcorpus of 200.000 words is envisaged. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN). CGN contains a subcorpus of 1 million words with syntactic annotations. During the construction of this corpus, no syntactically annotated corpus of Dutch was available to train a statistical parser on, nor an adequate parser for Dutch (requirements: wide-coverage, theory-neutral output, access to both functional and categorial information). This situation has changed considerably since then. Over the last few years, Alpino was ...

Pictograph-to-Text Translation for Augmented and Alternative Communication

Automated Spelling Correction for Dutch Internet Users with Intellectual Disabilities

Assessing linguistically aware fuzzy matching in translation memories

The concept of fuzzy matching in translation memories can take place using linguistically aware o... more The concept of fuzzy matching in translation memories can take place using linguistically aware or unaware methods, or a combination of both. We designed a flexible and time-efficient framework which applies and combines linguistically unaware or aware metrics in the source and target language. We measure the correlation of fuzzy matching metric scores with the evaluation score of the suggested translation to find out how well the usefulness of a suggestion can be predicted, and we measure the difference in recall between fuzzy matching metrics by looking at the improvements in mean TER as the match score decreases. We found that combinations of fuzzy matching metrics outperform single metrics and that the best-scoring combination is a non-linear combination of the different metrics we have tested.

M3TRA: integrating TM and MT for professional translators

Translation memories (TM) and machine translation (MT) both are potentially useful resources for ... more Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

Improved treebank querying: a facelift for GrETEL

We describe the improvements to the interface of GrETEL, an online tool for querying treebanks. W... more

Less is more: A rule-based syntactic simplification module for improved text-to-pictograph translation

Data & Knowledge Engineering, 2018

E-Including the Illiterate

IEEE Potentials, 2017

In present-day society, we communicate over the Internet in several media forms. We put videos an... more In present-day society, we communicate over the Internet in several media forms. We put videos and images online, listen to music made by famous bands or by our friends, and read and write a lot of text. Never in the history of mankind have we produced more text than at this present moment, so being able to read and write is an important way of taking part in our society. We tend to forget that, even in our educated communities, not all people can read or write and there exist several degrees of literateness. People with reduced cognitive capacities and those migrating from cultures with a different language, or even a completely different writing system, are excluded from fully taking part in written online communication: they are e-excluded.

Number agreement in copular constructions: A treebank-based investigation

Lingua, 2016

This paper has both a theoretical and a methodological objective. The theoretical one concerns th... more This paper has both a theoretical and a methodological objective. The theoretical one concerns the modeling of number agreement in copular constructions. For that purpose it adopts the distinction, familiar from Head-driven Phrase Structure Grammar, between morpho-syntactic agreement (also known as concord) and index agreement. The methodological objective concerns the demonstration of how treebanks can be exploited in order to guide the formulation of relevant generalizations. For that purpose we crucially rely on tools and resources that have recently been developed in the framework of the Dutch-Flemish stevin program (2004–2011) and the European clarin infrastructure.

Improving the Precision of Synset Links Between Cornetto and Princeton WordNet

Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing, 2014

Automatic transcription and normalisation of speech

Semantics-based pretranslation for SMT using fuzzy matches

Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2015

Challenges with Sign Language Datasets for Sign Language Recognition and Translation

by Vincent Vandeghinste and Mirella De Sisto

LREC2022 Proceedings, 2022

Sign Languages (SLs) are the primary means of communication for at least half a million people in... more Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the
development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and
standardization issues in the available data. The former challenge relates to the volume of data available for machine learning as well
as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are
not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing
the provision of automatic tools based on neural models. In the present paper, we give an overview of these challenges by comparing
various SL corpora and SL machine learning datasets. Furthermore, we propose a framework to address the lack of standardization at
format level, unify the available resources and facilitate SL research for different languages. Our framework takes ELAN files as inputs
and returns textual and visual data ready to train SL recognition and translation models. We present a proof of concept, training neural
translation models on the data produced by the proposed framework.