Abstract The Factored Language Model (FLM) is a flexible framework for incorporating various info... more Abstract The Factored Language Model (FLM) is a flexible framework for incorporating various information sources, such as morphology and part-of-speech, into language modeling. FLMs have so far been successfully applied to tasks such as speech recognition and machine translation; it has the potential to be used in a wide variety of problems in estimating probability tables from sparse data. This tutorial serves as a comprehensive description of FLMs and related algorithms.
An illustrative frustration processing system modifies the operation of a target system to improv... more An illustrative frustration processing system modifies the operation of a target system to improve its performance. In one case, the frustration processing system receives express indications that a user is frustrated in the course of interacting with the target system. The frustration processing system responds to these indications by modifying the operation of the target system to reduce the likelihood that the user will be frustrated in the future.
Abstract Wikis have enabled Web users to author and edit documents in a collaborative manner. In ... more Abstract Wikis have enabled Web users to author and edit documents in a collaborative manner. In many cases such as Wikipedia and Wikibooks, they have been used to host a set of parallel or comparable documents written in different languages. While a wiki provides an environment in which editors can work together efficiently, maintaining a set of multi-lingual documents is still a very demanding task for the editors.
Abstract Japanese sentences have completely different word orders from corresponding English sent... more Abstract Japanese sentences have completely different word orders from corresponding English sentences. Typical phrase-based statistical machine translation (SMT) systems such as Moses search for the best word permutation within a given distance limit (distortion limit). For English-to-Japanese translation, we need a large distance limit to obtain acceptable translations, and the number of translation candidates is extremely large. Therefore, SMT systems often fail to find acceptable translations within a limited time.
We propose a framework to assist Wikipedia editors to transfer information among different langua... more We propose a framework to assist Wikipedia editors to transfer information among different languages. Firstly, with the help of some machine translation tools, we analyse the texts in two different language editions of an article and identify information that is only available in one edition. Next, we propose an algorithm to look for the most probable position in the other edition where the new information can be inserted. We show that our method can accurately suggest positions for new information.
Abstract English is a typical SVO (Subject-Verb-Object) language, while Japanese is a typical SOV... more Abstract English is a typical SVO (Subject-Verb-Object) language, while Japanese is a typical SOV language. Conventional Statistical Machine Translation (SMT) systems work well within each of these language families. However, SMT-based translation from an SVO language to an SOV language does not work well because their word orders are completely different. Recently, a few groups have proposed rule-based preprocessing methods to mitigate this problem (Xu et al., 2009; Hong et al., 2009).
Abstract Natural language processing technology for the dialects of Arabic is still in its infanc... more Abstract Natural language processing technology for the dialects of Arabic is still in its infancy, due to the problem of obtaining large amounts of text data for spoken Arabic. In this paper we describe the development of a part-of-speech (POS) tagger for Egyptian Colloquial Arabic. We adopt a minimally supervised approach that only requires raw text data from several varieties of Arabic and a morphological analyzer for Modern Standard Arabic. No dialect-specific tools are used.
Abstract We present new statistical models for jointly labeling multiple sequences and apply them... more Abstract We present new statistical models for jointly labeling multiple sequences and apply them to the combined task of partof-speech tagging and noun phrase chunking. The model is based on the Factorial Hidden Markov Model (FHMM) with distributed hidden states representing partof-speech and noun phrase sequences. We demonstrate that this joint labeling approach, by enabling information sharing between tagging/chunking subtasks, outperforms the traditional method of tagging and chunking in succession.
Abstract This paper describes the NAIST statistical machine translation system for the IWSLT2012 ... more Abstract This paper describes the NAIST statistical machine translation system for the IWSLT2012 Evaluation Campaign. We participated in all TED Talk tasks, for a total of 11 languagepairs. For all tasks, we use the Moses phrase-based decoder and its experiment management system as a common base for building translation systems.
Abstract This paper proposes a novel method for long distance, clause-level reordering in statist... more Abstract This paper proposes a novel method for long distance, clause-level reordering in statistical machine translation (SMT). The proposed method separately translates clauses in the source sentence and reconstructs the target sentence using the clause translations with non-terminals. The non-terminals are placeholders of embedded clauses, by which we reduce complicated clause-level reordering into simple word-level reordering.
Abstract We introduce an approach to optimize a machine translation (MT) system on multiple metri... more Abstract We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (eg BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Abstract The Factored Language Model (FLM) is a flexible framework for incorporating various info... more Abstract The Factored Language Model (FLM) is a flexible framework for incorporating various information sources, such as morphology and part-of-speech, into language modeling. FLMs have so far been successfully applied to tasks such as speech recognition and machine translation; it has the potential to be used in a wide variety of problems in estimating probability tables from sparse data. This tutorial serves as a comprehensive description of FLMs and related algorithms.
An illustrative frustration processing system modifies the operation of a target system to improv... more An illustrative frustration processing system modifies the operation of a target system to improve its performance. In one case, the frustration processing system receives express indications that a user is frustrated in the course of interacting with the target system. The frustration processing system responds to these indications by modifying the operation of the target system to reduce the likelihood that the user will be frustrated in the future.
Abstract Wikis have enabled Web users to author and edit documents in a collaborative manner. In ... more Abstract Wikis have enabled Web users to author and edit documents in a collaborative manner. In many cases such as Wikipedia and Wikibooks, they have been used to host a set of parallel or comparable documents written in different languages. While a wiki provides an environment in which editors can work together efficiently, maintaining a set of multi-lingual documents is still a very demanding task for the editors.
Abstract Japanese sentences have completely different word orders from corresponding English sent... more Abstract Japanese sentences have completely different word orders from corresponding English sentences. Typical phrase-based statistical machine translation (SMT) systems such as Moses search for the best word permutation within a given distance limit (distortion limit). For English-to-Japanese translation, we need a large distance limit to obtain acceptable translations, and the number of translation candidates is extremely large. Therefore, SMT systems often fail to find acceptable translations within a limited time.
We propose a framework to assist Wikipedia editors to transfer information among different langua... more We propose a framework to assist Wikipedia editors to transfer information among different languages. Firstly, with the help of some machine translation tools, we analyse the texts in two different language editions of an article and identify information that is only available in one edition. Next, we propose an algorithm to look for the most probable position in the other edition where the new information can be inserted. We show that our method can accurately suggest positions for new information.
Abstract English is a typical SVO (Subject-Verb-Object) language, while Japanese is a typical SOV... more Abstract English is a typical SVO (Subject-Verb-Object) language, while Japanese is a typical SOV language. Conventional Statistical Machine Translation (SMT) systems work well within each of these language families. However, SMT-based translation from an SVO language to an SOV language does not work well because their word orders are completely different. Recently, a few groups have proposed rule-based preprocessing methods to mitigate this problem (Xu et al., 2009; Hong et al., 2009).
Abstract Natural language processing technology for the dialects of Arabic is still in its infanc... more Abstract Natural language processing technology for the dialects of Arabic is still in its infancy, due to the problem of obtaining large amounts of text data for spoken Arabic. In this paper we describe the development of a part-of-speech (POS) tagger for Egyptian Colloquial Arabic. We adopt a minimally supervised approach that only requires raw text data from several varieties of Arabic and a morphological analyzer for Modern Standard Arabic. No dialect-specific tools are used.
Abstract We present new statistical models for jointly labeling multiple sequences and apply them... more Abstract We present new statistical models for jointly labeling multiple sequences and apply them to the combined task of partof-speech tagging and noun phrase chunking. The model is based on the Factorial Hidden Markov Model (FHMM) with distributed hidden states representing partof-speech and noun phrase sequences. We demonstrate that this joint labeling approach, by enabling information sharing between tagging/chunking subtasks, outperforms the traditional method of tagging and chunking in succession.
Abstract This paper describes the NAIST statistical machine translation system for the IWSLT2012 ... more Abstract This paper describes the NAIST statistical machine translation system for the IWSLT2012 Evaluation Campaign. We participated in all TED Talk tasks, for a total of 11 languagepairs. For all tasks, we use the Moses phrase-based decoder and its experiment management system as a common base for building translation systems.
Abstract This paper proposes a novel method for long distance, clause-level reordering in statist... more Abstract This paper proposes a novel method for long distance, clause-level reordering in statistical machine translation (SMT). The proposed method separately translates clauses in the source sentence and reconstructs the target sentence using the clause translations with non-terminals. The non-terminals are placeholders of embedded clauses, by which we reduce complicated clause-level reordering into simple word-level reordering.
Abstract We introduce an approach to optimize a machine translation (MT) system on multiple metri... more Abstract We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (eg BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality.
Uploads
Papers by Kevin Duh