[go: up one dir, main page]

0% found this document useful (0 votes)
126 views125 pages

COLABA Arabic Dialect Annotation and Processing

The document discusses the COLABA project focused on Arabic dialect annotation and processing, highlighting various contributions and research presented at a workshop in May 2010. It includes sessions on syntax, semantics, morphology, and machine translation, featuring multiple studies and tools developed for Arabic language processing. The workshop aimed to foster collaboration and advancements in human language technologies for Arabic within and beyond the Arabic-speaking world.

Uploaded by

tassnem ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views125 pages

COLABA Arabic Dialect Annotation and Processing

The document discusses the COLABA project focused on Arabic dialect annotation and processing, highlighting various contributions and research presented at a workshop in May 2010. It includes sessions on syntax, semantics, morphology, and machine translation, featuring multiple studies and tools developed for Arabic language processing. The workshop aimed to foster collaboration and advancements in human language technologies for Arabic within and beyond the Arabic-speaking world.

Uploaded by

tassnem ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/230668379

COLABA: Arabic dialect annotation and processing

Article · January 2010

CITATIONS READS

57 975

5 authors, including:

Mona Diab Nizar Habash


George Washington University Columbia University
120 PUBLICATIONS 7,996 CITATIONS 319 PUBLICATIONS 10,289 CITATIONS

SEE PROFILE SEE PROFILE

Owen Rambow Mohamed Altantawy


Elemental Cognition LLC Columbia University
263 PUBLICATIONS 8,584 CITATIONS 7 PUBLICATIONS 134 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Yassine Benajiba on 27 May 2014.

The user has requested enhancement of the downloaded file.


WORKSHOP PROGRAMME
Monday 17 May 2010
9:15-9:30 Welcome and Introduction
Khalid Choukri, Owen Rambow, Bente Maegaard, and Ibrahim A. Al-Kharashi
Oral Session 1: Syntax, Semantics, and Parsing
9:30-9:50 Structures and Procedures in Arabic Language
André Jaccarini (1), Christian Gaubert (2), Claude Audebert (1),
(1)Maison méditerranéenne des sciences de l’homme (MMSH), France
(2)Institut français d’archéologie orientale du Caire (IFAO), Cairo, Egypt
9:50-10:10 Developing and Evaluating an Arabic Statistical Parser
Ibrahim Zaghloul (1) and Ahmed Rafea (2)
(1) Central Lab for Agricultural Expert Systems, Agricultural Research Center, Ministry of
Agriculture and Land Reclamation.
(2) Computer Science and Engineering Dept., American University in Cairo
10:10-10:30 A Dependency Grammar for Amharic
Michael Gasser
Indiana University, USA
10:30-11:00 Coffee break

11:00-12:20 Poster Session 1: Morphology & NLP Applications I


A syllable-based approach to Semitic verbal morphology
Lynne Cahill,
University of Brighton, United Kingdom
Using the Yago ontology as a resource for the enrichment of Named Entities in Arabic
WordNet
Lahsen Abouenour (1), Karim Bouzoubaa (1) and Paolo Rosso (2)
(1) Mohammadia School of Engineers, Med V University Rabat, Morocco
(2)Natural Language Engineering Lab. - ELiRF, Universidad Politécnica Valencia, Spain
Light Morphology Processing for Amazighe Language
Fadoua Ataa Allah and Siham Boulaknadel
CEISIC, IRCAM, Madinat Al Irfane, Rabat, Morocco
Using Mechanical Turk to Create a Corpus of Arabic Summaries
Mahmoud EL-Haj, Udo Kruschwitz and Chris Fox
School of Computer Science and Electronic Engineering, University of Essex, United Kingdom
DefArabicQA: Arabic Definition Question Answering System
Omar Trigui (1), Lamia Hadrich Belguith (1) and Paolo Rosso (2)
(1) ANLP Research Group- MIRACL Laboratory, University of Sfax, Tunisia
(2) Natural Language Engineering Lab. – EliRF, Universidad Politécnica Valencia, Spain

12:20-13:50 Lunch break

13:50-15:10 Poster Session 2: Morphology & NLP Applications and NLP Tools
Techniques for Arabic Morphological Detokenization and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash
Center for Computational Learning Systems, Columbia University, USA
Tagging Amazigh with AncoraPipe
Mohamed Outahajala (1), Lahbib Zenkouar (2), Paolo Rosso (3) and Antònia Martí (4)
(1) IRCAM,
(2) Mohammadia School of Engineers, Med V University Rabat, Morocco,
(3) Natural Language Engineering Lab. - ELiRF, Universidad Politécnica Valencia, Spain,
(4) CLiC - Centre de Llenguatge i Computació, Universitat de Barcelona, Barcelona, Spain

i
Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical
Resource Grammar in GF
Dana Dannélls (1) and John J. Camilleri (2)
(1) Department of Swedish Language, University of Gothenburg, Sweden;
(2) Department of Intelligent Computer Systems, University of Malta, Malta
Syllable Based Transcription of English Words into Perso-Arabic Writing System
Jalal Maleki
Dept. of Computer and Information Science, Linkping University, Sweden
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Al Tantawy and Yassine Benajiba
Center for Computational Learning Systems, Columbia University, USA
A Linguistic Search Tool for Semitic Languages
Alon Itai
Knowledge Center for Processing Hebrew, Computer Science Department, Technion, Haifa,
Israel

13:50-15:10 Poster Session 3: Speech & Related resources


Algerian Arabic Speech database Project (ALGASD): Description and Research
Applications
Ghania Droua-Hamdani (1), Sid Ahmed Selouani (2) and Malika Boudraa (3)
(1) Speech Processing Laboratory (TAP), CRSTDLA, Algiers, Algeria;
(2) LARIHS Laboratory, University of Moncton, Canada;
(3) Speech Communication Laboratory, USTHB, Algiers, Algeria.
Integrating Annotated Spoken Maltese Data into Corpora of Written Maltese
Alexandra Vella (1,2), Flavia Chetcuti (1), Sarah Grech (1) and Michael Spagnol (3)
(1)University of Malta, Malta
(2)University of Cologne, Germany
(3) University of Konstanz, Germany
A Web Application for Dialectal Arabic Text Annotation
Yassine Benajiba and Mona Diab
Center for Computational Learning Systems, Columbia University, USA
Towards a Psycholinguistic Database for Modern Standard Arabic
Sami Boudelaa and William David Marslen-Wilson
MRC-Cognition & Brain Sciences Unit, Cambridge, United Kingdom

Oral Session 2 : Resources and tools for Machine Translation


15:10-15:30 Creating Arabic-English Parallel Word-Aligned Treebank Corpora
Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma and Stephanie Strassel
Linguistic Data Consortium, USA
15:30-15:50 Using English as a Pivot Language to Enhance Danish-Arabic Statistical Machine
Translation
Mossab Al-Hunaity, Bente Maegaard and Dorte Hansen
Center for Language Technology , University of Copenhagen, Denmark
15:50-16:10 Using a Hybrid Word Alignment Approach for Automatic Construction and Updating of
Arabic to French Lexicons
Nasredine Semmar
CEA LIST, France
16:10-16:30 Coffee break
16:30-17:20 General Discussion
Cooperation Roadmap for building a sustainable Human Language Technologies for the Arabic
language within and outside the Arabic world.
17:20-17:30 Concluding remarks and Closing

ii
Editors & Workshop Chairs
Workshop general chair:
Khalid Choukri, ELRA/ELDA, Paris, France
Workshop co-chairs:
Owen Rambow, Columbia University, New York, USA
Bente Maegaard , University of Copenhagen, Denmark
Ibrahim A. Al-Kharashi, Computer and Electronics Research Institute, King Abdulaziz City for
Science and Technology, Saudi Arabia

iii
Table of Contents
Structures and Procedures in Arabic Language
André Jaccarini , Christian Gaubert , Claude Audebert 1
Developing and Evaluating an Arabic Statistical Parser
Ibrahim Zaghloul and Ahmed Rafea 7
A Dependency Grammar for Amharic
Michael Gasser 12
A syllable-based approach to Semitic verbal morphology
Lynne Cahill 19
Using the Yago ontology as a resource for the enrichment of Named Entities in Arabic WordNet
Lahsen Abouenour , Karim Bouzoubaa and Paolo Rosso 27
Light Morphology Processing for Amazighe Language
Fadoua Ataa Allah and Siham Boulaknadel 32
Using Mechanical Turk to Create a Corpus of Arabic Summaries
Mahmoud EL-Haj, Udo Kruschwitz and Chris Fox 36
DefArabicQA: Arabic Definition Question Answering System
Omar Trigui , Lamia Hadrich Belguith and Paolo Rosso 40
Techniques for Arabic Morphological Detokenization and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash 45
Tagging Amazigh with AncoraPipe
Mohamed Outahajala , Lahbib Zenkouar , Paolo Rosso and Antònia Martí 52
Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical Resource Grammar in
G
Dana Dannélls and John J. Camilleri 57
Syllable Based Transcription of English Words into Perso-Arabic Writing System
Jalal Maleki 62
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Al Tantawy and Yassine Benajiba 66
A Linguistic Search Tool for Semitic Languages
Alon Itai 75

Algerian Arabic Speech database Project (ALGASD): Description and Research Applications
Ghania Droua-Hamdani , Sid Ahmed Selouani and Malika Boudraa 79
Integrating Annotated Spoken Maltese Data into Corpora of Written Maltese
Alexandra Vella, Flavia Chetcuti , Sarah Grech and Michael Spagnol 83
A Web Application for Dialectal Arabic Text Annotation
Yassine Benajiba and Mona Diab 91
Towards a Psycholinguistic Database for Modern Standard Arabic
Sami Boudelaa and William David Marslen-Wilson 99
Creating Arabic-English Parallel Word-Aligned Treebank Corpora
Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma and Stephanie Strassel 102
Using English as a Pivot Language to Enhance Danish-Arabic Statistical Machine Translation
Mossab Al-Hunaity, Bente Maegaard and Dorte Hansen 108
Using a Hybrid Word Alignment Approach for Automatic Construction and Updating of Arabic to French
i
Nasredine Semmar 114

iv
Author Index
Lahsen Abouenour 27
Mohamed Al Tantawy 66
Mossab Al-Hunaity 108
Fadoua Ataa Allah 32
Claude Audebert 1
Yassine Benajiba 66,91
Ann Bies 102
Sami Boudelaa 99
Malika Boudraa 79
Siham Boulaknadel 32
Karim Bouzoubaa 27
Lynne Cahill 19
John J. Camilleri 57
Flavia Chetcuti 83
Dana Dannélls 57
Mona Diab 66,91
Ghania Droua-Hamdani 79
Ahmed El Kholy 45
Mahmoud EL-Haj 36
Chris Fox 36
Michael Gasser 12
Christian Gaubert 1
Sarah Grech 83
Stephen Grimes 102
Nizar Habash 45,66
Lamia Hadrich Belguith 40
Dorte Hansen 108
Alon Itai 75
André Jaccarini 1
Udo Kruschwitz 36
Seth Kulick 102
Xuansong Li 102
Xiaoyi Ma 102
Bente Maegaard 108
Jalal Maleki 62
William David Marslen-Wilson 99
Antònia Martí 52
Mohamed Outahajala 52
Ahmed Rafea 7
Owen Rambow 66
Paolo Rosso 27,40,52
Sid Ahmed Selouani 79
Nasredine Semmar 114
Michael Spagnol 83
Stephanie Strassel 102
Omar Trigui 40
Alexandra Vella 83
Ibrahim Zaghloul 7
Lahbib Zenkouar 52

v
Structures and Procedures in Arabic Language
André Jaccarini, Christian Gaubert *, Claude Audebert,
Maison méditerranéenne des sciences de l’homme (MMSH)
5 rue du Château de l'Horloge BP 647 13094 Aix-en-Provence, France
*
Institut français d’archéologie orientale du Caire (IFAO),
37 el Cheikh Aly Yousef Str., Cairo, Egypt
E-mail: jaccarini@mmsh.univ-aix.fr, cgaubert@ifao.egnet.net, claude.audebert@gmail.com

Abstract
In order to demonstrate the efficiency of the feedback method for the construction of finite machines (automata and transductors)
applied to the Arabic language, and to exhibit the algebraic characterization of this language through mathematical theory of
Schutzenberger school, we have chosen applications which are linked to many domains: morphological analysis (with or without
lexicon), syntactic analysis, construction of operators for I.R. and noise filtering.
A data bank of finite machines can only be efficient if integrated in a computational environment allowing the extraction of these
operators (which are fragments, pieces of operational grammars) which are to be combined in order to synthesise new operators,
according to the needs.
We have developed a software called Sarfiyya for the manipulation of arabic automata.
We constructed an extractor of quotations and reported discourse. The evaluation of this automaton will be available online. Sarfiyya
was entirely written in Java, which allowed the creation of a Web based application called Kawâkib, offering among other functions,
root extraction and tool word detection.
We are now heading towards content analysis and text characterization.
that this passage to the limit does not present a
1. General presentation1 significant difference with regard to the coefficients of
One of the important ideas of our work that Arabic, ambiguities of the considered forms compared to the
Semitic languages in general, has a particularly high parsers which resort systematically to lexicon (see for
degree of surfacing “algorithmicity/grammaticalness”. instance DIINAR project of the university of Lyon). The
We should for this however clarify the relation context, i.e. syntax is indeed much more interesting on
“procedure/structure” (see below §2). The this level, has led us to study in a formal way the
structural characteristics of Arabic language are thus passages of information between syntax and
also algorithmic. This duality is easily translatable within morphology.
the framework of the algebraic theory of the The minimal recourse to the lexicon even the reduction
of all the lexemes to their simple patterns (principle of
automata and offers extremely interesting applicative
the empty dictionary) is compensated by the headlight
prospects (the correspondence is done in the two
role which we confer on the tokens (tools words) and
directions; see below 2.), easily specifiable. that we regard as true operators defined by minimal
A certain deficit of mathematical specification is, indeed, finite machines (automata and transducers).
one of the characteristics of the actual position of the These elements, which are the most constraining,
automatic treatment of Arabic. The theoretical precisely coincide with the fixed elements (the
unification, operated thanks to the algebraic theory of the morphological atoms, which do not have roots: for
automata, seems to us to be particularly interesting to example inna, limādha, … etc). This coincidence has a
firmly draw up the Arab Studies in the universe of simple explanation: the cardinality of their class of
knowledge which is ours today, namely that of syntactic congruence2 is very limited (often equals with
the Turing machines. We thus seek to register the the unit) contrary to those of the other lexemes, which
can “commutate” with other elements belonging to the
automatic treatment of Arabic in the algebraic tradition
same “category” (or class of syntactic congruence)
which, in data processing, was especially initiated by
without doubting the “grammaticality” of the sentence,
M.P Schutzenberger and its school. nor its type (as for the relationship between syntactic
In order to support our assumption that the strong congruence and the congruence induced by the patterns,
algorithmicity/grammaticaness is an important refer to “Algorithmic Approach of Arab grammar”, first
specificity of Arabic we have stated in our preceding chapter “Morphological System and syntactic monoïd”;
studies (Audebert, Jaccarini 1994, Jaccarini 1997, to be published; summary of the chapter is available on
Gaubert 2001) that the construction of parsers, only the theoretical website associated with the article, from
requiring a minimal recourse to the lexicon, and even in now automatesarabes). An algebraic characterization of
certain cases, completely avoiding the lexicon, did not the tokens is given there: they are the “lexical” invariants
cause explosions of ambiguities. It is noted indeed of the projection of the language on its skeleton (in other

11 2
The authors thank Eva Saenz Diez for having read this A syntactic class of congruence consists of all the
text. Without her active participation, this work would words that can permute in a sentence without questioning
have been completed. its grammaticality.

1/119
words invariants of projection L L/RAC). The term an operator of extraction of speeches, built from finished
“token” was selected in reference so what states machines (automata and transducers). This
data-processing originators of language indicate by this machine that we made deterministic by a calculation,
term: association of symbols which it is necessary to requiring a time of several hours but carried out only
regard as fixed (example: BEGIN, GOTO, LOOP,…etc), once and whose result is then stored in memory once for
which naturally induce particular “waitings”. Thus this
all (the final automaton contains nearly 10.000
approach was initially presented like a “Grammar of
waitings” (“grammaire des attentes”; see Audebert, states!) allows to extract in a reasonable period time
Jaccarini, 1986); this name seems to be perfectly (which will be improved later on) all the quotations from
appropriate if one is situated at the level of didactic and a text of almost 30.000 words with a rate of success of
of cognition and we thus use it in all the contexts where almost 80% if one makes however abstraction of silences
it is not likely to produce any ambiguity. and noises especially due to anomalies of punctuation
The “tokens” can be considered, to some extent, like the and a lack of standardization to which it will be more or
“lexemes” of the quotient language. By choosing this less possible in the future to mitigate. Rate of success
term our first concern was simply to draw the attention to and times show that the operation of feasibility was a
the analogy with formal languages. The object L/RAC is success (see herewith table reproduced on
a semi-formal language obtained, by projection, starting automatesarabes). An evaluation is thus provided, which
from a natural language. It presents the particularity to
proves that our formal considerations, even algebraic, are
have a very limited lexicon. It is indeed this fact that
seems to us the most characteristic of the Semitic system indeed necessary to make coherent the theoretical
(strong grammaticaliness/algorithmicity) whose Arab framework, without which one is likely to be involved in
language is the current richest representative, best known a pragmatism which ends up becoming sterile. At the
and especially most spoken. present time, we are very much conscious of the
The assumption of the construction of a syntactic disadvantages of ad hoc programming.
monitor, which the theoretical goal is to study and
modeling of the interactions between syntax and 2. Advantages of the representation of
morphology and the practical finality –the Arabic by finite machines
short-circuiting of the superfluous operations in the
location of the structure of the sentence-, remains the 2.1 The transfer from structures to procedures
long-term objective which will lead the continuation of (Arabic language <--> automata)
this research. This transfer is carried out in the two directions. It is
Grammars not being static, but being regarded as a possible indeed, on the theoretical level to establish a
particular point of view on the language, they can appear
formal link between the structural
drifting by transformation of a non-fixed core
and algebraic properties by which we characterized the
itself. These points of view, i.e. these grammars, can be
connected to each other. The question of their adequacy Arab system, on one hand, and automata and transducers
compared to a given objective arises then: the grammar on which we based our algorithmic approach of Arabic,
of an orthographical controller is not the same as a on the other hand. These automata and transducers
program of learning Arabic or of an information constitute a class of machines equivalent to those of
extractor. Turing, on which we nevertheless introduced restrictions
The morpho-syntactic analysis of Arabic which we and conditions of minimality related to a rigorous
propose (see bibliography) constitutes a reflection and a hierarchy.
general tool to answer a whole set of applications. The Arabic structural properties that we proposed relate
The development of this approach passes by a thorough to the commutation of the operators of “categorization”
study of the tokens or words tools. This study results in and “projection”. “Categorization” amounts building a
the constitution of morphosyntaxic grammars, conceived quotient unit made up of the “classes of syntactic
like operational bricks of grammars in order to congruence” on the free monoïd of the symbols of the
synthesize procedures of research. initial vocabulary (the formal words in the case of
Such grammars built from automata and finite syntax), or “syntactic categories”. The relation of
implied congruence formalizes the distributional
transducers can also be used to detect various types of
principle of linguists (Bloomfield): it expresses that two
sentences, for example conditional sentences, relations of
words can be regarded as equivalents if and only if they
causality or other discursive relations, to extract the can commutate without affecting the grammaticalness of
reported speeches, etc. These topics are fundamental for the sentence. Projection amounts reducing the word to its
the extraction of information: coupled with the search for pattern, which also induces a partition in class of
collocations (non-fortuitous co-occurrences), in congruence: The concatenations (reciprocally the
interaction with the morpho-syntactic analyzer, they are segmentations) remain invariant even by changing the
the source of many applications. We presented to roots. This last congruence is compatible with the
MEDAR 09 (Audebert, Gaubert, Jaccarini, 2009) preceding one in the sense that the syntactic relation of
examples of constructions of such operational grammars congruence is more “coarse” (the least fine) that can be
that is at the level of the morphological, syntactic defined on the monoïd made up from Arabic graphemes
and who saturates the unit consisted by the unit or Arabic
analysis or even at the level of extraction of information
graphic words, which has as a corollary that any patter,
(I.R). As simple illustrative example, we have introduced
considered here as a given class of words and not as an

2/119
operator generating them (duality structure-procedure), is Automata Theory”, see automatesarabes). This
necessarily located inside a syntactic category within the illustration offers a theoretical interest (to reduce a
framework of the unit of the Arab graphic words. We possibly infinite set of sentences to a finished number of
registered the modeling of the Arab morph-syntax within configurations) as well as a practical one (the
the general framework of a principle of invariance “automatic” construction of the minimal deterministic
deriving from the previous Arabic morphographic automat corresponding).
property, which is obvious (invariance by change of root The automaton corresponding to the study of David
within the framework of the morphography) by Cohen (Cohen, 1970) will be rebuilt by using this same
generalizing it with syntax, namely that: method (which leads to the constitution of an automat of
1. To establish syntactic classes which are partitioned in 13 states and 102 transitions) while following an
patterns “entirely automated chain” if we may say so, or rather
2. or, on the contrary, to establish patterns which can “automatisable”.
then be gathered into syntactic classes.
It is the same. Any sequence of a language can indeed be regarded as
an application of an initial segment of N in itself and to
Let us suppose that Π and SC respectively indicate say that a language is recognizable by a finite-state
canonical homomorphisms associated with congruences automaton it is in fact equivalent to define a congruence
considered higher: syntactic congruence and the one on this language whose set of classes is finite
associated with the patterns (categorization and
projection), then the principle of invariance will be able The theorems which explicitly establish the links
to be expressed in an even more concise way: that of the between the concepts of syntactic monoïd, congruence
commutation of these two last “operators”: and the traditional concept of automaton, such as we use
them for our analysis of Arabic, also appears in
Π.SC = SC.Π automatesarabes.

Following this principle, we will thus categorize so that In conclusion the syntactic monoïd, with which is
the construction of the grammar is not affected by the associated a minimal deterministic automaton being able
operation which consists in reducing the language only to recognize this language can be produced thanks to a
to its paradigms (patterns + tokens). transducer3. This monoïd of transition (= syntactic) can
The possibility of building a computer program be obtained automatically.
functioning without lexicon is only one consequence of
the above-mentioned property according to which
it should be indifferent to first categorize and then
project or vice versa. 2.2 Automatic vocalisation and transduction
This second point deserves to be insulated, given its
In addition the definition of the automata, as those of the
importance. The standard writing of Arabic is
machines of Turing, can appear somewhat contingent -
shorthand. The short vowels are not noted, which has as
but it is quite astonishing that such a rough mechanism
can represent all calculations that can be made on a natural consequence to increase considerably the
machine and even that it can (theorem of the universal ambiguity and the difficulties of reading. Moreover cases
enumeration) simulate any computer or set of computers are often marked by short vowels, if they are singular,
(including Internet!). The automata with two stacks of and their “calculation” are not always extremely easy4.
memories (we don’t need a large number of them, which
3
does represent a remarkable property) are equivalent to We had programmed it in Lisp; the question of its
these machines. These automata are founded on those of restoration is posed today in terms of opportunity, utility,
less complexity, without stacks of memory: the calendar, working time, etc, etc. For the moment this
finite-state machines whose definition can cause the task is not a priority. It is also possible to enhance this
same feeling of “uneasiness” mentioned above –while transducer (minimal) in order to determine the basic
talking about the machines of Turing- and at the same relations, which associated with the generators, define
time amazement due to the fact that such an elementary the monoïd of transition (isomorphous with the syntactic
mechanism can generate such complex configurations. monoïd). It can indeed be interesting to have the
possibility of defining an infinite language (determined
The adaptation of a more abstract or algebraic viewpoint, nominal group or not determined, Conditionals, etc.) by
allows us at the same time a small set of limited equalities relating to this language
1. to avoid this uneasiness of contingency and of limited length, rather than by rewriting rules. For
2. to give a meaning to the extension of the principle of example in the example evoked in the site (a small subset
invariance from the linguistic level to the of the given nominal group), in order to check that an
data-processing level, to thus unify the theoretical unspecified sequence belongs to this language, it is
framework while offering extremely interesting practical enough to examine its sub-sequences of length 3.
4
prospects. Indeed the calculation of the monoïd of The great grammarian Sībawayh quotes an irreducible
transition M(L) from the language L means building the example of ambiguity, very much like a joke, which has
minimal deterministic automat directly accepting this also the merit to draw the attention to the fact that in
language. One will find in the automatesarabes website, literary Arabic the place of the words in the sentence is
the development on an example of syntax with this type relatively free; this freedom being compensated by a
of calculation (taken from “Linguistic Modeling and more important morpho-casual “marking”. The example

3/119
Our matter is not to discuss the relevance of this written lexeme can only acquire their final value (vocalization or
form but to note the phenomenon while trying to meaning, …) by the retroactive effect, once the complete
measure its consequences in term of ambiguities and to reading of the sentence was accomplished. Knuth studied
provide objective arguments for the supporters of the two the cases of vicious circles and developed at the item
camps: the one constituted by the “Arabists” whose (1968) an algorithm to avoid them. In the case of
systems of transliteration, which they use in general, do impossibility, you then find yourself in the well-known
not leave any right to the error (short vowels having the case of the data processing specialist, the “deadlock”,
same value as the full consonants or the long vowels which occurs when two processes are on standby, one of
(there are only three)) and the usual system used by the other. It is an intrinsic ambiguity6.
Arabs that leaves a certain latitude 5 which seems to In “Algorithmic Approach of the Arabic Grammar” (see
suggest - a fact corroborated by the outside experiments automatesarabes), we have presented
(but which remains to be more deeply examined) - that an ambiguous morphological transducer, functioning
the reading (without diacritics) in the first flow makes it word by word (vowels dependant on the case
possible to perceive the meaning of the sentence (linguistic) are not being taken into account, since the
overall; a finer syntactic analysis, implying backtracking connection with syntax was not
allows in the second time to raise ambiguity. implemented 7 ). Coefficients of ambiguity are varying
Nevertheless these assumptions must be evaluated. The from 1 (in a significant number of cases8) up to 12.
system of transduction based on underlying automata, to It is obvious that a connection with syntax is necessary
which we can make correspond “semantic attributes” of not only to cause a drop in the level of ambiguity but
grammars of Knuth grammars (see automatesarabes), or also to be able to vocalize the end of the words.
“schemes guided by syntax” (Aho, Seti and Ullman, Such tools that are to be reprogrammed can already have
1986), which are associated with synthesized” and extremely interesting applications. The writer of an
“inherited” attributes, is particularly well adapted for this Arabic text can be informed in real-time of the level of
task (linear dominating flow “blocked”, nevertheless, by ambiguity of the form introduced to see himself
“backtracking” which can present in the most dramatic suggested a certain number of solutions (total or partial)
to reduce ambiguity according to the level of
cases, vicious circles (deadlock) i.e impossibilities of
user (tutorial), only by clicking. Fundamental technology
vocalization (irreducible ambiguities which are cases to already exists; all the rest is only question of ergonomics
be studied for itself)). The synthesized attributes are and interface, which in this field is fundamental.
values that are propagated from bottom to top of the tree It goes without saying that it would be an improvable
representing the structure of the sentence (it is said that tool and evolvable tool by introduction of syntax but also
one decorates the tree or even that one associates to by training.
himself a “semantic” tree) and the inherited attributes, The conceptual tool (the interactive transducer of
those which are propagated from top, vocalisation) would obviously be of greater interest to
downwards. Transposed to the reading flow, that means, answer the question that had been asked at the beginning
there exist values (here one is interested in the short of this paragraph namely to try to measure or rather to
vowels) which “are synthesized” progressively according scientifically discuss the relevance of the two viewpoints:
respectively the Arabists one and Arabs conception, to
to the advancement of the reading head, whereas certain
say it in a concise way.

is the following: Akala (ate) IssA (Jesus) MoussA It would have been difficult to scientifically discuss this
(Moses); one cannot know if it is Jesus who ate Moses or question of relevance if one had not had recourse to the
the reverse, being given the phonological incompatibility transducers functioning letter with letter and interacting
of the mark of the direct case (short vowel u) with the with the highest level “director” automats: the syntactic
last phoneme of IssA or MoussA. The ambiguity automats.
naturally remains the same one by permutation of Issa
and of MoussA (the mark of indirect being short a). 2.3 Transparency of the analyzers
5
This report of the use of standard writing by the Arabs The transparency of the analyzers which can be entirely
since centuries, as well as their organization of their specified mathematically, offers essential advantages that
dictionaries, makes us naturally think that they perceived we will only mention here: those to offer evidence of
(and continue to perceive) consonants as being elements programs as well as measurements of complexity and,
of a “skeleton” which would be the principal “support” last but not least, the possibility of establishing relevant
of the meaning (more specifically the radical consonants, similarities with the natural process of apprehension
the others being able to belong to a pattern, inevitably
6
discontinuous, if we take into account the short vowels This question also arises about the “syntactic monitor”
which inevitably intervene there, which can never be which is supposed to optimize the morphological
radical; the pattern in its entirety, which is a analysis, where we must consider the extreme case
non-concatenative form, being only (with the root) likely where both morphological and syntactic processors are
to have one or more, semantic values). In this remark we waiting for each other (irreducible ambiguity).
7
are within the framework of morphology known as Some results will be available on the site
healthy. We announce only facts and we voluntarily keep automatesarabes before the publication of the book.
8
away from the problem of lexical “solidification” However no statistics were drawn up; it was about a
(fixation). feasibility study.

4/119
(cognitivo-algoritmic parallelism). We will find in automatesarabes a detailed evaluation of
the extractor of quotations in journalistic texts (which is
extremely encouraging). This experiment constitutes a
3. Coming back to tokens: from syntax to starter of the pump of feedback announced in MEDAR
semantics 09.
Study of the syntactic operators and grammar of
syntactic waitings. 4. References
Automatesarabes: http://automatesarabes.net.
1. The pivot of this research is the study, made easier by
an advanced version of the tool Sarfiyya, of the Aho, Sethi, Ullmann (1986), Compilateurs. Principes,
grammatical waitings of all the tokens (word tools of techniques et outils. French edition 1991. InterEdition.
Arabic), whose location is already solved. For example, Audebert C, Jaccarini A. (1986) À la recherche du
the operator inna implies the presence of structures Ḫabar, outils en vue de l’établissement d’un
having precise grammatical functions (topic, predicate) programme d’enseignement assisté par ordinateur,
that are recognizable by the machine. On the other hand, Annales islamologiques 22, Institut français
prepositions (ʿalā, ʿinda) are of more reduced range but d’archéologie orientale du Caire.
can possibly combine with high level tokens: a hierarchy Audebert C, Jaccarini A. (1988). De la reconnaissance
is established between families of operators. It is des mots outils et des tokens. Annales islamologiques
necessary to formalize the syntactic behaviors and 24, Institut français d’archéologie orientale du Caire.
their local and total implications. Audebert C, Jaccarini A. (1994). Méthode de variation
This research was started on a corpus and remains to be de la grammaire et algorithme morphologique.
undertaken. It is essential for the study of the syntax of Bulletin d’études orientales XLVI. Damascus.
Arabic and, although outlined it has to be reset once Audebert, Gaubert, Jaccarini (2009). Minimal
again. The number of the tokens amounts to Ressources for Arabic Parsing/ an Interactive Method
approximately 300 and poses problems of location dealt for the Construction of Evolutive Automata. MEDAR
according to a certain methodology and raises, by 09.(http://www.elda.org/medar-conference/summaries
definition, questions concerned with syntax whose /37.html)
modeling must taken into account. Audebert (2010). Quelques rélexions sur la fréquence et
2. This study will be coupled with that of the linguistic la distribution des mots outils ou tokens dans les textes
markers of certain discursive relations. This work arabes en vue de leur caractérisation dans le cadre de
consists in creating a base of the most possible l’extraction d’information. Annales islamologiques 43,
elementary automats (or transducers), so that their Institut français d’archéologie orientale du Caire.
combinations can allow the synthesis of new Beesley, Kenneth R. (1996). Arabic Finite-State
functionalities of search for information (IH). A first Morphological Analysis And Generation. COLING.
demonstration of the effectiveness of this method was Cohen, D. (1970) Essai d’une analyse automatique de
provided (MEDAR 09). The progressive refinement of l’arabe. In: David Cohen. Etudes de linguistique
the filters and the reduction of the noises were obtained, sémitique et arabe. Paris:Mouton, pp. 49-78.
according to a precise experimental method, consisting Gaubert Chr., (2001). Stratégies et règles pour un
in retroacting to the initial grammar according to the traitement automatique minimal de l’arabe. Thèse de
result provided by the machine. This method doctorat. Département d’arabe, Université d’Aix-en
of feedback (continual coming and going between Provence.
theoretical modeling and implementation) naturally Gaubert (2010), Kawâkib, une application web pour le
supposes a work of evaluation of grammars. traitement automatique de textes arabes, Annales
However, there exists several manner of assigning a islamologiques 43, Institut français d’archéologie
value to a grammar, according to the standard selected, orientale du Caire.
which varies according to the required application. The Jaccarini A., (1997). Grammaires modulaires de l’arabe.
standard allows assigning to the grammar Thèse de doctorat. Université de Paris-Sorbonne.
a value starting from fixed criteria. A criterion can be Jaccarini (2010). De l’intérêt de représenter la grammaire
essential for a given application but not very relevant for de l’arabe sous la forme d’une structure de machines
another (for example the non-ambiguous extraction of finies, Annales Islamologiques 43, Institut français
the root represents only little interest if the objective is to d’archéologie orientale du Caire.
obtain a simple spellchecking). The data of the standard Koskenniemi K. (1983). Two-level Morphology. A
makes it possible to privilege, according to its needs, General Computational Model for Word-Form
certain criteria among others and thus induces a Recognition and Production. Department of General
hierarchy.
Linguistics. University of Helsinki.
Inheriting its code from Sarfiyya with some
enhancements for collaborative work, the web-based
application Kawâkib and its latest version Kawâkib Pro
(fig. 1) are the tools we use for now to collect linguistic
data connected with tool words, to parse pieces of corpus
with automata and to perform measures in this regard. It
also includes tools for root searches, frequencies reports,
etc.

5/119
fig. 1 : The Kawâkib Pro web-based application

6/119
Developing and Evaluating an Arabic Statistical Parser

Ibrahim Zaghloul Ahmed Rafea


Central Lab for Agricultural Expert Systems Computer Science and Engineering Dept.
Agricultural Research Center American University in Cairo
Ministry of Agriculture and Land Reclamation. rafea@aucegypt.edu
ibrahimz@claes.sci.eg

Abstract
This paper describes the development of an Arabic statistical parser using Arabic Treebank and a statistical parsing engine. The
different steps followed to develop and test the parser have been described. We divided the LDC2005T20 Arabic Treebank into
training and testing sets. 90 % of the treebank was used to train the Bikel parser package while 10% of it was randomly selected to test
the developed parser. The testing data set annotations were removed to convert it into pure text to be introduced to the trained parser.
The gold testing data set was prepared, by mapping its tags, to the tags produced by the trained parser. This mapping was necessary to
evaluate the parser results using a standard evaluation tool. The metrics widely applied for parsers evaluation were computed for the
developed parser results. The F-measure evaluation metric of the developed parser was 83.66 % which is comparable to evaluation
metrics results of well known English parsers.

parse, and then presenting that parse as the answer


1. Introduction (Charniak, 1997). The probability of each candidate tree is
calculated as a product of terms, each term is
In this paper we present the steps followed to develop and corresponding to some sub-tree within the tree (Collins,
evaluate an Arabic parser using the Dan Bikel 1999).
multilingual parsing package 1 and the LDC2005T20
Arabic Treebank. The results of testing the parser are In general, to construct a statistical parser one must figure
presented for sentences with different lengths. out how to:
a. Train the parser to construct the grammar rules and
Parsing is the task of identifying one or more tree structures their probabilities.
for a given sequence of words (Bikel, 2004). Instead of b. Find possible parses for new sentences.
rule-based parsers, which used hand-crafted grammars, c. Assign probabilities to these new sentences.
statistical parsers increased accuracy and tend to exhibit d. Pull out the most probable parse for each sentence
greater robustness in dealing with unusual utterances, (Charniak, 1997).
which would cause a more strictly rule-based parser to fail.
They also have the advantage of being easier to build and to Applications that potentially benefit from syntactic parsing
customize (Venable, 2003). include corpus analysis, question answering, natural-
language command execution, rule-based automatic
Treebank statistical parsers induce their grammar and translation, and summarization (Venable, 2003).
probabilities from a hand parsed corpus (Treebank). If it is
required to have a parser that produces trees in the In our work we used the Dan Bikel multilingual parsing
Treebank style to all sentences thrown at it, then parsers engine. Bikel parser is the only parsing engine, we found,
induced from Treebank data are currently the best that considers Arabic. It contained some customizations of
(Charniak, 1997). the general features, which he called 'Language Package',
to fit with the Arabic language.
Creating the Treebank is a staggering task, and there are
not many to choose from. Thus the variety of parsers The motivation behind this work was the need of having an
generated by such systems is limited. At the same time, one Arabic parser in many applications like machine
of the positive effects of creating Treebanks is that several translation, text summarization, and others.
systems now exist to induce parsers from this data and it is
possible to make detailed comparisons of these systems The objective of the work presented in this paper was
(Charniak, 1997). Also, the availability of large, developing a statistical Arabic parser using a Treebank and
syntactically bracketed corpora such as the Penn Tree Bank a parsing engine, and evaluating the performance of the
afforded opportunity to automatically build or train broad developed parser.
coverage grammars (Sekine and Grishman, 1995).
Section 2, reviews related work in the statistical parsing
Statistical parsers work by assigning probabilities to area. In section 3, Arabic parser development steps are
possible parses of a sentence, locating the most probable described. In section 4, the evaluation methodology is
explained. In section 5, the results of parser testing are
1
shown and discussed.
http://www.cis.upenn.edu/software.html#stat-parser

7/119
2. Related work 2.5 Stanford Parser
A lot of work has been done in the statistical parsing area. The Stanford Parser is an un-lexicalized (does not use
Most of the work concentrated on parsing English as the lexical information) parser which rivals state-of-the-art
main language and paying no or little attention to other lexicalized ones (Klein and Manning, 2003). It uses a
languages. The following subsections summarize statistical context-free grammar with state splits. The parsing
parsers developed for English. algorithm is simpler, the grammar is smaller. It uses a CKY
chart parser which exhaustively generates all possible
2.1 Apple Pie Parser parses for a sentence before it selects the highest
Apple Pie (Sekine and Grishman, 1995) extracts a probability tree. This parser gave 84.41% Labeled
grammar from Penn Treebank (PTB) v.2. The rules Precision, 87% Labeled Recall, and 95.05% Tagging
extracted from the PTB have S or NP on the left-hand side accuracy, when tested on section 23 of the WSJ Treebank
and a flat structure on the right-hand side. The parser is a (Hempelmann et.al, 2005).
chart parser. The parser model is simple, but it can’t handle
sentences over 40 words. This parser gave 43.71% Labeled 3. Arabic Statistical Parser Development
Precision, 44.29% Labeled Recall, and 90.26% Tagging This section describes the steps for generating the Arabic
accuracy, when tested on section 23 of the Wall Street probabilistic grammar from an Arabic tree bank. The first
journal (WSJ) Treebank. subsection describes the used Treebank while the second
subsection shows how we divide this Treebank into
2.2 Charniak’s Parser training and testing parts. The third subsection describe
Charniak presents a parser based on probabilities gathered the generation of the probabilistic grammar.
from the WSJ part of the PTB (Charniak, 1997). It extracts
the grammar and probabilities and with a standard 3.1 Arabic Treebank
context-free chart-parsing mechanism generates a set of The Arabic Treebank we used is LDC2005T20. The
possible parses for each sentence retaining the one with the Treebank contains 12653 parsed Arabic sentences
highest probability. The probabilities of an entire tree are distributed among 600 text files representing 600 stories
computed bottom-up. from the An Nahar News Agency. This corpus is also
referred to as ANNAHAR. The sentences lengths
In (Charniak, 2000), he proposed a generative model based distributions in the Treebank are shown in Table (1).
on a Markov-grammar (Charniak, 2000). It uses a standard
bottom-up, best-first probabilistic parser to first generate Length Number of sentences
possible parses before ranking them with a probabilistic From 1 To 20 4046
model. This parser gave 84.35% Labeled Precision, 88.28% From 21 To 30 2541
Labeled Recall, and 92.58% Tagging accuracy, when tested From 31 To 40 2121
on section 23 of the WSJ Treebank. From 41 To 50 1481
From 51 To 60 942
2.3 Collins’s Parser From 61 To 100 1257
From 100 To max 265
Collins’s statistical parser (Collins, 1996) (Collins, 1997) is Table (1): Sentences lengths (in words) distributions in
based on the probabilities between head-words in parse the Arabic Tree
trees. Collins defines a mapping from parse trees to sets of
dependencies, on which he defines his statistical model. A 3.2 Division of Treebank
set of rules defines a head-child for each node in the tree.
The gold standard testing set size was selected to be 10% of
The parser is a CYK- style dynamic programming chart
the Treebank size which is approximately 1200 sentences
parser. This parser gave 84.97% Labeled Precision, 87.3%
and the remaining sentences were left for training. The
Labeled Recall, and 93.24% Tagging accuracy, when tested
complete description of the selection of the gold standard
on section 23 of the WSJ Treebank.
set is as follows:
 We first grouped all the Treebank files in one file
2.4 Bikel Parser
containing all sentences,
Bikel based his parser on Collins model 2 (Collins, 1999)  Then, we used a methodology to avoid being biased in
with some additional improvements and features in the the test sentences selection. The methodology was to
parsing engine like: layers of abstraction and encapsulation select a sentence from every 10 sentences; that is we
for quickly extending the engine to different languages span the Treebank and pick a sentence after counting 9
and/or Treebank annotation styles, “plug-’n’-play” sentences. This means that the sample we selected is
probability structures, flexible constrained parsing facility, distributed over all the Treebank.
and multithreaded for use in a multiprocessor and/or  The selected sentences are put in a separate gold file
multihost environment. and all unselected sentences are put in a separate
training file. After completing this step we will have
two files: the gold data set file and the training data set
file.

8/119
3.3 Parser Training 8) No Crossing: Percentage of sentences which have zero
The training data set which is approximately 11400 crossing brackets.
sentences is introduced to Bikel parsing package to 9)2 or less crossing: Percentage of sentences which have
generate the Arabic probabilistic grammar. This grammar two or less crossing brackets.
is used by the parser included in the parsing package to 10) Tagging Accuracy: Percentage of correct POS tags.
generate the parsing tree for an input sentence.
4.2 Extracting the test data
4. Evaluation Methodology A tool was developed to extract the test sentences from the
The Arabic statistical parser will be evaluated following gold standard set. This tool takes the gold data file and
these steps: extracts the words only. So the output is a file containing
1. Select the evaluation tool. the sentences words without any annotations, which will
2. Extract the test data set (remove annotations to be then be given to the parser. Each sentence is processed
pure text) from the gold data. separately by reading the tokens and extracting the word
3. Prepare the test data for parsing. from each token, ignoring any additional annotations or
4. Run parser on the extracted test data. characters.
5. Pre-process the gold data set to meet the
requirements of the evaluation tool. 4.3 Preparing the Test Data for Parsing
The test sentences have to be put in a suitable form for
The following subsections describe in some details each of parsing. The Bikel parser accepts the input sentence in one
the above mentioned steps. of two formats:
1. (word1 word2 word3 ……. wordn).
4.1 Evaluation Tool and Output Description 2. (word1(pos1) word2(pos2) word3(pos3)…
The evaluation tool "Evalb"2, was used in evaluating the wordn(posn)).
parser output. It was written by Michael John Collins We put all the test file sentences in the format that allows
(University of Pennsylvania) to report the values of the the parser to do its own part of speech tagging, which is the
evaluation metrics for a given parsed data. first format.

The description of the outputs of Evalb is as follows: 4.4 Running the Parser
1) Number of sentence: The total number of sentences in The parser has been run over the 1200 test sentences using
the test set. the training outputs and the parameters file for Arabic
2) Number of valid sentences: Number of sentences that parser.
are successfully parsed. The parameter "pruneFactor", described below, was set to
value 2 instead of the default value 4 in order to increase
3) Bracketing Recall: the parsing speed. This change in parameter value was
Number of Correct Constituents made because the default value didn't work well for Arabic
giving infinite time for long sentences.
Number of Constituents in the Gold File The total parsing time for the test set was about 25 minutes
on a machine of processor 3GHz and 8 GB RAM.
4) Bracketing Precision:
Number of Correct Constituents pruneFactor: Is a property in the parameter file by which
the parser should prune away chart entries which have low
Number of Constituents in the Parsed File probability. The smaller the pruneFactor value, the faster
the parsing.
5) Bracketing FMeasure: The harmonic mean of
Precision and Recall. FMeasure = 4.5 Processing the Gold Data
2 × (Precision × Recall) The gold standard set is processed to be in the evaluation
used by the evaluation tool. The reason for this processing
(Precision + Recall) is that the Arabic Treebank annotation style was found to
be different from the parser annotation style. In the
6) Complete Match: Percentage of sentences where recall Treebank we had, the part of speech tags used are the
and precision are both 100%. morphological Arabic tags. But in the Bikel parser output
the tags are from the original Penn Treebank tag set.
7) Average Crossing:
Number of constituents crossing a gold file The following example shows the sentence:
constituent " fy AlsyAsp , AlAHtmAlAt kvyrp w AlHqA}q mEqdp."
Number of sentences (In politics, there are many possibilities and the facts are
complex.)
2
As represented in the LDC Treebank:
http://nlp.cs.nyu.edu/evalb/

9/119
We here show the change in the metrics values up or down
(S (S (PP (PREP fy) (NP
(DET+NOUN+NSUFF_FEM_SG+CASE_DEF_GEN AlsyAsp)))
for different sentences lengths. The results for 100, 60, 40
(PUNC ,) (NP-SBJ and 10 words length sentences are shown in table (3).
(DET+NOUN+NSUFF_FEM_PL+CASE_DEF_NOM
AlAHtmAlAt)) (ADJP-PRD Metric <=100 <=60 <=40 <=10
(ADJ+NSUFF_FEM_SG+CASE_INDEF_NOM kvyrp))) (CONJ w) Number of sentence 1180 1089 888 197
(S (NP-SBJ (DET+NOUN+CASE_DEF_NOM AlHqA}q)) Bracketing Recall 83.24 83.49 83.80 81.29
(ADJP-PRD (ADJ+NSUFF_FEM_SG+CASE_INDEF_NOM
Bracketing Precision 85.07 85.15 85.44 77.31
mEqdp))) (PUNC .))
Bracketing FMeasure 84.14 84.31 84.61 79.25
When this sentence is parsed using Bikel parser, the Complete match 19.24 20.75 25.00 45.69
following annotated sentence is produced: Average crossing 2.63 2.20 1.63 0.20
(S (S (PP (IN fy) (NP (NN AlsyAsp))) (, ,) (NP (NNS No crossing 45.42 48.58 55.29 87.82
AlAHtmAlAt)) (ADJP (JJ kvyrp))) (CC w) (S (NP (NN
Tagging accuracy 99.10 99.02 98.81 97.25
AlHqA}q)) (ADJP (JJ mEqdp))) (PUNC .))
Table (3): evalb outputs for the different lengths
In the Bikel parser training phase, the LDC tags are
sentences.
converted into the Bikel tags using the "training- 5.1 Analysis of the Results
metadata.lisp" file. Unfortunately this conversion is part of
The best accuracy of the parser appears with sentences in
the grammar generation code in Bikel package.
the "less than forty" category, as it has the highest
Consequently we have to develop a separate program that
F-measure value.
converts LDC tags into Bikel tags in order to test the parser.
Some metrics values drop at the "less than ten" category
The output of this process is the gold file that enables
like Recall, Precession, F-measure and tagging accuracy.
evaluating the output of Bikel parser running on the test
But the Complete match and No crossing metrics go up for
data against this gold file.
this category.
5. Results and Analysis These values went down as sentences less than ten are
We applied the evaluation tool on the whole test set with no more sensitive to any error, i.e. the accuracy for a sentence
length restriction to test the overall quality, and then we with length 5 words will be 80% accuracy with one wrong
made the evaluation again to see the change in the metrics bracket or tag, although accuracy will be 87.5% for a
values up or down for different sentences lengths. We sentence with 40 words and 5 wrong brackets or tags.
examined the results for the parser outputs trying to On the other hand, the chance to have a complete match
analyze the reasons for the drop in the accuracy for some increases for shorter sentences because it has smaller
metrics for different sentences lengths. number of brackets.

5.1 Results 6. Conclusion


Applying the evaluation tool on the whole test set with no The results we got show that the Arabic parser we built
length restriction, produces the following results: here gives results comparable to the results obtained for
English. The best Labeled Precision of an English parser
Metric Value was 84.97 % obtained by Collins parser while the labeled
Number of sentence 1200 precision using Bikel parser adapted to Arabic was 84.6 %.
The best labeled recall of an English parser was 88.28%
Number of Valid sentence 1200 obtained by Charniak while the labeled recall using Bikel
Bracketing Recall 82.74 parser adapted to Arabic was 82.74%. The best tagging
accuracy of an English parser was 95.05% obtained by
Bracketing Precision 84.60 Stanford parser while the tagging accuracy using Bikel
Bracketing FMeasure 83.66 parser adapted to Arabic was 99.11 %. It should be noticed
that all English results were obtained using sections 02-21
Complete match 18.92 of the WSJ part of the English Treebank for training
Average crossing 2.92 (39,832 sentences) and section 23 of WSJ for testing (2416
sentences). In our case we run the experiment on 1200
No crossing 44.67 sentences only.
2 or less crossing 65.58
Tagging accuracy 99.11
Table (2): Evalb output for the whole test set.

10/119
7. References
Brian Roark and Richard Sproat. 2006. Computational
Approaches to Morphology and Syntax, Oxford
University Press.
Charniak, Eugene. 2000. A maximum entropy–inspired
parser. In Proceedings of the 1st NAACL, pages 132–
139, Seattle, Washington, April 29 to May 4.
Charniak, Eugene. 1996. Tree-bank grammars, Technical
report, Department of Computer Science, Brown
University.
Charniak, Eugene. 1997. Statistical parsing with a con
text-free grammar and word statistics, In Proceedings
of the 14th National Conference on Artificial
Intelligence, pages 598–603, Providence, RI.
Charniak, Eugene. 1996. Tree-bank grammars, Technical
Report CS-96-02, Department of Computer Science,
Brown University.
Charniak, Eugene. 1997. Statistical Techniques for
natural language parsing, AI Magazine 18 4 (1997),
33-43.
D. Klein and C. Manning. 2003. Accurate unlexicalized
parsing. In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistic, Sapporo,
Japan, Pages 423-430.
Daniel M. Bikel. 2002. Design of a multi-lingual,
parallel-processing statistical parsing engine, In
Proceedings of HLT2002, San Diego, CA.
Daniel M. Bikel. 2004. On the parameter space of
generative lexicalized statistical parsing models, Ph.D.
thesis, University of Pennsylvania.
Hempelmann, Christian F. and Rus, Vasile and Graesser,
Arthur C. and McNamara, Danielle S. 2005. Evaluating
State-of-the-Art Treebank-style Parsers for Coh-Metrix
and Other Learning Technology Environments, In
Proceedings of the Second ACL Workshop on Building
Educational Applications Using NLP 2005, Ann Arbor,
Michigan, Pages 69-76.
Michael John Collins. 1999. Head-Driven Statistical
Models for Natural Language Parsing, Ph.D. thesis,
University of Pennsylvania.
Michael John Collins, 1997, Three generative lexicalized
models for statistical parsing, In Proceedings of the
35th Annual Meeting of the ACL. 1997, 16-23.
Peter Venable. 2003. Modeling Syntax for Parsing and
Translation, Ph.D. thesis, Carnegie Mellon University.
S.Sekine and R. Grishman. 1995. A corpus-based
probabilistic grammar with only two non-terminals, In
proceedings of the International Workshop on Parsing
Technologies.

11/119
A Dependency Grammar for Amharic
Michael Gasser
School of Informatics and Computing
Indiana University, Bloomington, Indiana USA
gasser@cs.indiana.edu

Abstract
There has been little work on computational grammars for Amharic or other Ethio-Semitic languages and their use for parsing and
generation. This paper introduces a grammar for a fragment of Amharic within the Extensible Dependency Grammar (XDG) framework
of Debusmann. A language such as Amharic presents special challenges for the design of a dependency grammar because of the complex
morphology and agreement constraints. The paper describes how a morphological analyzer for the language can be integrated into the
grammar, introduces empty nodes as a solution to the problem of null subjects and objects, and extends the agreement principle of XDG
in several ways to handle verb agreement with objects as well as subjects and the constraints governing relative clause verbs. It is shown
that XDG’s multiple dimensions lend themselves to a new approach to relative clauses in the language. The introduced extensions to
XDG are also applicable to other Ethio-Semitic languages.

1. Introduction English grammar discussed in his dissertation. In the gen-


Within the Semitic family, a number of languages remain eral case, then, an analysis of a sentence is a multigraph
relatively under-resourced, including the second most spo- consisting of a separate dependency graph for each dimen-
ken language in the family, Amharic. Among other gaps in sion over a single sequence of word nodes. Figure 1 shows
the available resources, there is no computational grammar a possible analysis for the English sentence John edited the
for even a sizable fragment of the language; consequently paper on two dimensions. The analysis follows the XDG
analysis of Amharic texts rarely goes beyond morphologi- convention of treating the end-of-sentence punctuation as
cal analysis, stemming, or part-of-speech tagging. the root of the sentence.
This paper describes a dependency grammar for a fragment
of Amharic syntax. The grammar is based on Extensible Syntax
Dependency Grammar (XDG), developed by Ralph Debus- root
mann and colleagues (Debusmann et al., 2004; Debusmann, sbj
obj
2007). XDG was selected because of its modular structure, det
its extensibility, and its simple, declarative format. The John edited the paper .
paper begins with an overview of XDG and a description
of some relative aspects of Amharic morphosyntax. Then agt
pat del
we look at the extensions to XDG that were implemented
root
to handle Amharic null subjects and objects, agreement of
verbs with subjects and objects, and some of the special Semantics
properties of relative clauses. Most of these extensions will
also apply to other Semitic languages.
Figure 1: Two-dimensional XDG analysis of an English
sentence. Arrows go from head to dependent. Words that
2. Extensible Dependency Grammar
do not participate in the semantic dimension are distin-
As in other dependency grammar frameworks, XDG is lex- guished by delete arcs from the root node.
ical; the basic units are words and the directed, labeled de-
pendency relations between them. In the simplest case, an
analysis (“model” in XDG terms) of a sentence is a graph A grammatical analysis is one that conforms to a set of con-
consisting of a set of dependency arcs connecting the nodes straints, each generated by one or another principle. Each
in the sentence such that each node other than the root node dimension has its own characteristic set of principles. Some
has a head and certain constraints on the dependencies are examples:
satisfied. As in some, but not all, other dependency frame-
works, XDG permits analyses at multiple strata, known as • Principles concerned with the structure of the graph,
dimensions, each corresponding to some level of grammat- for example, it may be constrained to be a tree or a
ical abstraction. For example, one dimension could repre- directed acyclic graph.
sent syntax, another semantics. Two dimensions may also • The Valency Principle, governing the labels on the arcs
be related by an explicit interface dimension which has no into and out of a given node.
arcs itself but constrains how arcs in the related dimensions
associate with one another. Debusmann includes a total of • The Agreement Principle, constraining how certain
six simple dimensions and five interface dimensions in the features within some words must match features in
other words.
12/119
• The Order Principle, concerned with the order of the gerund(ive). We represent verb lexemes in the lexicon in
words in the sentence. terms of the conventional citation form, the third person
singular masculine perfective. For example, the verb ay-
As the framework is completely lexical, it is at the level of
words or word classes that the principles apply. For exam- wededm2 ‘he is not liked’ has the lemma tewedede ’he was
ple, the constraint that a finite present-tense verb in English liked’, which is derived from the verb root w.d.d.
must agree with its subject on the syntactic dimension could Every Amharic verb must agree with its subject. As in
appear in the lexicon in this form:1 other Semitic languages, subject agreement is expressed
by suffixes alone in some TAM categories (perfective
- gram: V_FIN_PRES and gerundive) and by a combination of prefixes and
syn:
suffixes in other TAM categories (imperfective and jus-
agree: [sbj]
sive/imperative). Amharic is a null subject language; that
The lexicon is organized in an inheritance hierarchy, with is, a sentence does not require an explicit subject, and per-
lexical entries inheriting attributes from their ancestor sonal pronouns appear as subjects only when they are being
classes. For example, the verb eats would inherit the emphasized for one reason or another.
subject-verb agreement constraint from the V FIN PRES An Amharic verb may also have a suffix representing the
class. person, number, and gender of a direct object or an indi-
Parsing and generation within the XDG framework take the rect object that is definite.3 The corresponding suffixes in
form of constraint satisfaction. Given an input sentence to other Semitic languages are often considered to be clitics
be parsed, lexicalization of the words invokes the princi- or even pronouns, but there are good reasons not to do so
ples that are referenced in the lexical entries for the words for Amharic. First, one or two other suffixes may follow
(or inherited from their ancestors in the lexical hierarchy). the object suffix. Second, as with subjects, object personal
Each of these principle invocations results in the instanti- pronouns may also appear but only when they are being em-
ation of one or more constraints, each applying to a set of phasized. Thus we will consider Amharic to have optional
variables. For example, a variable is associated with the object agreement as well as obligatory subject agreement
label on the arc between two given nodes, and the domain and to be a null object as well as a null subject language.
for that variable is the set of possible arc labels that can ap-
pear on the arc. Among the constraints that apply to such a 3.2. Noun phrases
variable are those that are created by the Valency Principle. Amharic nouns without modifiers take suffixes indicating
For example, for English transitive verbs, there is a valency definiteness and accusative case for direct objects and pre-
constraint which requires that exactly one of the arcs leav- fixes representing prepositions:
ing the verb must have an obj label. Constraint satisfaction
hakim
returns all possible combinations of variable bindings, each
doctor
corresponding to a single analysis of the input sentence.
The XDG framework has been applied to a number of lan- ‘a doctor’ (1)
guages, including a small fragment of Arabic (Odeh, 2004),
hakimu
but no one has yet addressed the complexities of mor-
doctor-DEF
phosyntax that arise with Semitic languages. This paper
represents a first effort. ‘the doctor’ (2)

hakimun
3. Relevant Amharic Morphosyntax doctor-DEF-ACC
3.1. Verb morphology ‘the doctor (as object of a verb)’ (3)
As in other Semitic languages, Amharic verbs are very
complex (see Leslau (1995) for an overview), consisting lehakimu
of a stem and up to four prefixes and four suffixes. The to-doctor-DEF
stem in turn is composed of a root, representing the purely ‘to the doctor’ (4)
lexical component of the verb, and a template, consist-
ing of slots for the root segments and for the vowels (and However, when a noun is modified by one or more adjec-
sometimes consonants) that are inserted around and be- tives or relative clauses, it is the first modifier that takes
tween these segments. The template represents tense, as-
2
pect, mood, and one of a small set of derivational cate- Amharic is written using the Ge’ez script. While there is
gories: passive-reflexive, transitive, causative, iterative, re- no single agreed-on standard for romanizing the language, the
ciprocal, and causative reciprocal. For the purposes of this SERA transcription system, which represents Ge’ez graphemes
using ASCII characters (Firdyiwek and Yaqob, 1997), is common
paper, we will consider the combination of root and deriva-
in computational work on Amharic and is used in this paper. This
tional category to constitute the verb lexeme.
transcription system represents the orthography directly, failing to
Each lexeme can appear in four different tense-aspect- indicate phonological features that the orthography does not en-
mood (TAM) categories, conventionally referred to code, in particular, consonant gemination and the presence of the
as perfect(ive), imperfect(ive), jussive/imperative, and epenthetic vowel that breaks up consonant clusters.
3
In the interest of simplification, indirect objects will be
1
We use YAML syntax (http://www.yaml.org/) for mostly ignored in this paper. Most of what will be said about
lexical entries. direct objects also applies to indirect objects.
13/119
these affixes (Kramer, 2009). If a noun takes a determiner, • The head noun of a noun phrase with an adjective or
the noun phrase needs no other indication of definiteness, relative clause modifier is optional.
but it is the determiner that takes the accusative suffix or
prepositional prefix. tlqun ’merTalehu
big-DEF-ACC I-choose
senefu hakim
lazy-DEF doctor ‘I choose the big one.’ (13)
‘the lazy doctor’ (5)
yemiwedat alderesem
lesenefu hakim REL-he-likes-her he-didn’t-arrive
to-lazy-DEF doctor ‘(He) who likes her didn’t arrive. (14)
‘to the lazy doctor’ (6)
Headless relative clauses are found in many lan-
yann senef hakim
guages, for example, in the English translation of sen-
that-ACC lazy doctor
tence (14). What makes Amharic somewhat unusual is
‘that lazy doctor (as object of a verb)’ (7) that headless relative clauses and adjectives function-
3.3. Relative clauses ing as noun phrases can be formed by simply dropping
Relative clauses in Amharic consist of a relative verb and the noun.
zero or more arguments and modifiers of the verb, as in any • Relative verbs agree with the main clause verbs that
clause. A relative verb is a verb in either the imperfective or contain them. For example, in example (14) above,
perfective TAM with a prefix indicating relativization. As the third person singular masculine subject in the main
with a main clause verb, a relative verb must agree with its clause verb agrees with the third person singular mas-
subject and may agree with its direct object if it has one. culine subject of the relative clause verb.
Both subjects and objects can be relativized.
Therefore we interpret relative clause modifiers as syntac-
yemiwedat sEt
tic heads of Amharic nouns. Because XDG offers the pos-
REL-he-likes-her woman
sibility of one or more dimensions for semantics as well as
‘the woman that he likes’ (8) syntax, it is straightforward to make the noun the semantic
yemiwedat wend head, much as auxiliary verbs function as syntactic heads
REL-he-likes-her man while the main verbs they accompany function as semantic
‘the man who likes her’ (9) heads in Debusmann’s XDG grammar of English. This is
discussed further below.
As noted above, when a noun is modified by a relative
clause and has no preceding determiner, it is the rela- 4. XDG for Amharic
tive clause that takes suffixes indicating definiteness or ac- In its current incomplete version, our Amharic grammar has
cusative case or prepositional prefixes. a single layer for syntax and a single layer for semantics.
yetemereqew lj wendmE new The Syntax dimension handles word order, agreement, and
REL-he-graduated-DEF boy my-brother is syntactic valency.5 The Semantics dimensions handles se-
mantic valency.
‘The boy who graduated is my brother.’ (10)
Because the grammar still does not cover some relatively
yetemereqewn lj alawqm common structures such as cleft sentences and complement
REL-he-graduated-DEF-ACC boy I-don’t-know clauses, the parser has not yet been evaluated on corpus
‘I don’t know the boy who graduated.’ (11) data.

When a sequence of modifiers precedes a noun, it is the first 4.1. Incorporating morphology
one that takes the suffixes or prefixes.4 For a language like Amharic, it is impractical to list all
yetemereqew gWebez lj wordforms in the lexicon; a verb lexeme can appear in
REL-he-graduated-DEF clever boy more than 100,000 wordforms. Instead we treat the lex-
eme/lemma as the basic unit; for nouns this is their stem.6
‘the clever boy who graduated’ (12)
5
Because the first modifier of a noun determines the syntac- Amharic word order is considerably simpler than that of a
tic role of the noun phrase in the clause as well as its def- language such as English or German, and there are none of the
initeness, we will treat this modifier, rather than the noun, problems of long-distance dependences in questions and relative
clauses that we find in those languages. The only non-projective
as the syntactic head of the noun phrase. There are at least
structures are those in cleft sentences and sentences with right dis-
two other reasons for doing this. location, neither of which is handled in the current version of our
grammar. In a later version, we will separate a projective lin-
4
With two adjectives, both may optionally take the affixes ear precedence layer from a non-projective immediate dominance
(Kramer, 2009). We consider this to fall within the realm of coor- layer, as Debusmann does for English and German (2007).
6
dination, which is not handled in the current version of the gram- Unlike in most other Semitic languages, most Amharic nouns
mar described in this paper. do not lend themselves to an analysis as template+root.
14/119
For verbs, as noted above, this is the root plus any deriva- both. Figure 4.3. shows the analysis returned by our parser
tional morphemes. for the following sentence.7
In parsing a sentence, we first run a morpholog-
ical parser over each of the input words. We yoHans ywedatal
use the HornMorpho Amharic parser available at Yohannis he-likes-her
http://www.cs.indiana.edu/˜gasser/ ‘Yohannis likes her.’ (15)
Research/software.html and described in Gasser
(2009). Given an Amharic word, this parser returns the root
(for verbs only), the lemma, and a grammatical analysis Syntax
in the form of a feature structure description (Carpenter, root
1992; Copestake, 2002) for each possible analysis. For
sbj obj
example, for the verb ywedatal ‘he likes her’, it returns the
following (excluding features that are not relevant for this
discussion): ዮሐንስ ይወዳታል
.
yoHans ywedatal
’wedede’, {’tam’: ’impf’,
’rel’: False,
’sb’: [-p1,-p2,-plr,-fem], arg1 arg2
’ob’: [-p1,-p2,-plr,+fem]} root

That is, it indicates that this a non-relative verb whose Semantics


lemma is ‘wedede’ in imperfective TAM with a third per-
son singular masculine subject and a third person singular
feminine object. Figure 2: Empty nodes in Amharic. The transitive verb
It is this sequence of lemma-structure tuples rather than raw ywedatal ‘he likes her’ has no explicit object, so it is linked
wordforms that is the input to the usual XDG lexicalization to an empty node by an obj arc in the Syntax dimension.
process that initiates parsing. We have not yet implemented
generation, but the reverse process will occur there; that
Note that our empty nodes are similar to the hidden nodes
is, the output of constraint satisfaction will be a sequence
used in annotation for the Quranic Dependency Treebank
of lemma-structure tuples which will then be passed to a
project (Dukes et al., 2010).
morphological generator (also available in HornMorpho).
4.3. Subject and object agreement
4.2. Null subjects and objects
In the XDG grammars described by Debusmann and other
XDG is grounded in the words occurring in a sentence, but researchers within the framework, agreement applies to two
it has to come to grips with the mismatch between nodes separate verb attributes. The agrs attribute is a list of pos-
in different dimensions. For example, we probably do not sible features for the verb form, while the agree attribute
want a strictly grammatical word such as the to correspond is a list of arc labels for daughters which must agree with
to anything at all on the semantic dimension. Debusmann the verb. For example, the following could be part of the en-
handles the deletion of surface nodes using del arcs from try for the English very eats, representing the fact that this
the sentence root; this can be seen in the semantic dimen- word has a single possibility for its agreement feature (third
sion in Figure 1. person singular) and the constraint that its subject must also
However, as far as we know, no one has addressed the re- be third person singular.
verse problem, that of nodes in some dimension which cor- - word: eats
respond to nothing on the surface. Null subjects and ob- syn:
jects in a language such as Amharic present such a problem. agrs: [3ps]
They correspond to arguments that need to be explicit at the agree: [sbj]
semantic level but are not present in the input to parsing. This limited approach to agreement fails to address the
We are also working on a synchronous version of XDG with complexity of a language such as Amharic. First, the agrs
dimensions representing syntactic analyses in different lan- attribute must distinguish subject, direct object, and indirect
guages. For a language pair such as Amharic-English, with object features. Second, the agree attribute must specify
Amharic as the input language, the nodes corresponding to which agreement feature of the mother verb agrees with the
English subject and object pronouns will have to come from daughter on the specified arc. Third, the agree attribute
must also allow for agreement with different features of the
somewhere.
daughter when the daughter is verb itself, that is when it is
We solve this problem by introducing “empty nodes” in the the verb of a relative clause. Consider the entry for transi-
syntactic dimension. Each verb creates an empty node for tive verbs (actually a combination of several entries):
its subject, and each transitive verb creates an additional
- gram: V_T
one for its object. The nodes are used only when no ex-
syn:
plicit argument fills their role. We introduce a new XDG agree: {sbj: [sbj, [ˆ,sbj,obj,iobj]],
principle to handle these cases, the Empty Node Principle. obj: [obj, [ˆ,sbj,obj,iobj]]}
When a word invoking this principle is found during lex-
icalization, a constraint is created which sanctions an arc 7
In the Amharic dependency graphs in the figures we show the
from the verb with the relevant label (sbj or obj) to ei- original Ge’ez forms that are the actual input to the parser as well
ther an explicit word or the associated empty node, but not as the transcribed forms.
15/119
This specifies that a transitive Amharic verb agrees with
the words on both its outbound sbj and obj arcs, that Syntax
the subject agrees with the sbj feature of the verb and the obj:top=^
object agrees with the obj feature of the verb, and that root
the agreement feature of the daughter (subject or object) is
either the whole word (denoted by ˆ) or, in the case of a
አስቴር ደክሟታል
relative verb, its sbj, obj or iobj feature. .
astEr dekmWatal
The following sentence is an example of a transitive verb
whose subject and object features agree with nouns. The
output of the parser on the Syntax dimension for this sen-
Figure 4: Agreement of a topic with a verb’s object suffix.
tence is shown in Figure 3.
astEr yoHansn twedewalec
Aster Yohannis-ACC she-likes-him
‘Aster likes Yohannis.’ (16)

and the noun it modifies does not need to be stated sepa-


Syntax rately in the grammar. For illustration, however, we show
what this constraint would look like in the entry for object
sbj:sbj=^
relative verbs.
root
obj:obj=^
- gram: V_REL_OBJ
አስቴር ዮሐንስን ትወደዋለች syn:
.
astEr yoHansn twedewalec
agree: {obj: [obj, ˆ]}

Sentence (18) is an example of a sentence with an object


Figure 3: Simple subject-verb and object-verb agreement relative clause. The analysis of the sentence by our system
in Amharic. In addition to their arc labels, two arcs show on the Syntax dimension is shown in Figure 5. The ob-
mother and daughter features that agree. In these cases, the ject feature of the relative verb yemtTelaw ‘that she hates
arc label precedes the colon, and the mother and daughter him’ agrees with the modified noun wendlj ‘boy’; both are
features are separated by “=”. third person singular masculine. Two other agreement con-
straints are also satisfied in this sentence. The subject fea-
ture of the main verb tameme ‘he-got-sick’ agrees with the
Note that the verb agreement feature and the arc label need
object feature of the relative verb; both are third person sin-
not be the same. For example, for an important subclass of
gular masculine. The subject feature of the relative verb
Amharic verbs, the object suffix of the verb agrees with a
agrees with its subject astEr; both are third person singular
syntactic argument that we will call the “topic”, which does
feminine.
not take the accusative marker and is not the syntactic sub-
ject. In the following example, the verb’s object suffix is astEr yemtTelaw wendlj tameme
third person singular feminine, agreeing with the nomina- Aster REL-she-hates-him boy he-got-sick
tive topic astEr. ‘The boy that Aster hates got sick.’ (18)
astEr dekmWatal
Aster it-has-tired-her
‘Aster is tired.’ (17) Syntax
sbj:sbj=obj
The verb in this sentence, dekeme ‘tire’, has the following sbj:sbj=^ root
obj:obj=^
in its entry:
- lexeme: dekeme አስቴር የምትጠላው ወንድልጅ ታመመ
.
syn: astEr yemtTelaw wendlj tameme

agree: {obj: [top, [ˆ,sbj,obj,iobj]]}

Figure 4 shows the parser’s analysis of sentence (17). Figure 5: Syntactic analysis of a sentence with a relative
clause.
4.4. Relative clauses
As argued above, relative verbs are best treated as the heads
of their noun phrases. When a relative verb has a head We model the semantics of a sentence with a relative clause
noun, the verb’s subject, object, or indirect object feature as a directed acyclic graph in which the shared noun has
must agree with that noun, depending on the role it plays in multiple verb heads. The relative clause predicate is dis-
the verb’s argument structure. In our grammar, we join the tinguished from the main clause predicate by a rel rather
relative verb to its head noun in the Syntax dimension by than a root arc into it from the sentence root. Figure 6
an arc with a label specifying this role, that is, sbj, obj, shows the analysis of sentence (18) on the Semantics di-
or iobj. Since verbs are already constrained to agree with
mension.
their arguments, the agreement between the relative verb
16/119
different verbs, the main clause verb and the relative verb.
አስቴር የምትጠላው ወንድልጅ ታመመ
astEr wendlj tameme
. The Cross-Agreement Principle forces the same feature of
yemtTelaw
the relative verb to agree with the main clause verb and the
modified noun. By this principle our parser finds no analy-
arg1 root
arg1 arg2 sis for sentence (20) because the feature of the relative verb
rel yemtTelaw that agrees with the modified noun (its object)
Semantics differs from the feature that agrees with the main verb (its
subject). This is illustrated in Figure 8. The grammar fails
to parse this sentence between the features marked with red
Figure 6: Semantic analysis of a sentence with a relative boxes do not agree.
clause.
Syntax
sbj: sbj =sbj
sbj:sbj=^ root
obj: obj =^

Relative clauses without nouns have no overt form corre- አስቴር የምትጠላው ወንድልጅ ታመመች
.
astEr yemtTelaw wendlj tamemec
sponding to the shared semantic argument, so we introduce
this argument as an empty node. Sentence (19) is sentence
(18) with the noun wendlj ‘boy’ dropped. The analyis of
Figure 8: Violation of the Cross-Agreement Principle. The
this sentence is shown in Figure 7.
features in red boxes should match.
astEr yemtTelaw tameme
Aster REL-she-hates-him he-got-sick
‘The one that Aster hates got sick.’ (19)
5. Conclusions
This paper has described an implementation of Extensible
Dependency Grammar for the Semitic language Amharic.
Syntax
root Amharic is interesting because it suffers from a serious lack
sbj:sbj=obj
of computational resources and because its extreme mor-
sbj:sbj=^ obj phological complexity and elaborate interactions of mor-
phology with syntax present challenges for computational
አስቴር የምትጠላው ታመመ grammatical theories. Besides the strongly lexical charac-
.
astEr yemtTelaw tameme ter that it shares with other dependency grammar frame-
works, XDG is attractive because of the modularity offered
arg1 by separate dimensions. We have seen how this modularity
arg1 arg2
permits us to handle the agreement constraints on a relative
rel
root verb by treating such verbs as the heads of noun phrases
Semantics on the Syntax, but not the Semantics dimension. We have
also seen that XDG requires some augmentation to deal
with null subjects and objects and the intricacies of verb
Figure 7: Analysis of a relative clause with no modified agreement. These complexities of Amharic are not unique.
noun. Much of what has been said in this paper also applies to
other Ethio-Semitic languages such as Tigrinya. In addi-
tion to expanding the coverage of Amharic, further work
Without further constraints, however, the grammar assigns on this project will be directed at developing synchronous
multiple analyses to some sentences and parses some un- XDG grammars to support translation between the different
grammatical sentences with relative clauses. Consider the Semitic languages spoken in Ethiopia and Eritrea.
following ungrammatical sentence.
*astEr yemtTelaw wendlj tamemec 6. References
Aster REL-she-hates-him boy she-got-sick Bob Carpenter. 1992. The Logic of Typed Feature Struc-
‘The boy that Aster hates (she) got sick.’ (20) tures. Cambridge University Press, Cambridge.
Ann Copestake. 2002. Implementing Typed Feature Struc-
This satisfies the constraint that subject of the main verb ture Grammars. CSLI Publications, Stanford, CA, USA.
tamemec agree with some feature of the relative verb (its Ralph Debusmann, Denys Duchier, and Geert-Jan M. Krui-
subject) and the constraint that the some feature of the rel- jff. 2004. Extensible dependency grammar: A new
ative verb (its object) agree with the modified noun wendlj. methodology. In Proceedings of the COLING 2004
To exclude sentences like this, we need a further XDG prin- Workshop on Recent Advances in Dependency Grammar,
ciple, which we call the Cross-Agreement Principle. This Geneva/SUI.
specifies a fundamental fact about relative clauses in all lan-
guages, that the same noun functions as an argument of two
17/119
Ralph Debusmann. 2007. Extensible Dependency Gram-
mar: A Modular Grammar Formalism Based On Multi-
graph Description. Ph.D. thesis, Universität des Saarlan-
des.
Kais Dukes, Eric Atwell, and Abdul-Baquee M. Sharaf.
2010. Syntactic annotation guidelines for the Quranic
Arabic treebank. In Proceedings of the Seventh Interna-
tional Conference on Language Resources and Evalua-
tion, Valletta, Malta.
Yitna Firdyiwek and Daniel Yaqob. 1997. The sys-
tem for Ethiopic representation in ASCII. URL: cite-
seer.ist.psu.edu/56365.html.
Michael Gasser. 2009. Semitic morphological analysis
and generation using finite state transducers with feature
structures. In Proceedings of the 12th Conference of the
European Chapter of the ACL, pages 309–317, Athens,
Greece.
Ruth Kramer. 2009. Definite Markers, Phi Features,
and Agreement: a Morphosyntactic Investigation of the
Amharic DP. Ph.D. thesis, University of California,
Santa Cruz.
Wolf Leslau. 1995. Reference Grammar of Amharic. Har-
rassowitz, Wiesbaden, Germany.
Marwan Odeh. 2004. Topologische dependenzgrammatik
fürs arabische. Technical report, Saarland University.
Forschungspraktikum.

18/119
A Syllable-based approach to verbal morphology in Arabic
Lynne Cahill
University of Brighton
NLTG, Watts Building, Lewes Rd, Brighton BN2 4GJ, UK
E-mail: L.Cahill@brighton.ac.uk

Abstract
The syllable-based approach to morphological representation (Cahill, 2007) involves defining fully inflected morphological forms
according to their syllabic structure. This permits the definition, for example, of distinct vowel constituents for inflected forms where an
ablaut process operates. Cahill (2007) demonstrated that this framework was capable of defining standard Arabic templatic morphology,
without the need for different techniques. In this paper we describe a further development of this lexicon which includes a larger number
of verbs, a complete account of the agreement inflections and accounts for one of the oft-cited problems for Arabic morphology, the
weak forms. Further, we explain how the use of this particular lexical framework permits the development of lexicons for the Semitic
languages that are easily maintainable, extendable and can represent dialectal variation.

provide a complete account of the verbal and nominal


morphology of Modern Standard Arabic (MSA).
1. Introduction
The Semitic languages are linguistically interesting for a The key developments we report here are:
number of reasons. One of the most widely discussed
1. the addition of the agreement inflections;
aspects of these languages is the so-called templatic
morphology with the typical triliteral verbal (and nominal) 2. the addition of the apparatus required for
roots and their vocalic inflections. In the 1980s a rash of handling non-standard roots.
studies emerged discussing ways of describing this
morphology and associated problems such as spreading The first of these does not amount to anything very
(where only two consonants are specified in the root) and different from a large number of accounts of affixal
the weak verbs, where one of the consonants in the root is morphology within an inheritance framework. The second
one of the "weak" consonants or glides, waw (/w/) or yaa is more interesting, but turns out to be no more
(/j/). challenging for the framework than various types of
phonological conditioning in the morphological systems
Cahill (2007) presented an alternative to these approaches of many European languages. We illustrate our approach
which made use of a framework developed to describe to the weak roots with an analysis of one particular weak
European languages which is based on defining the root, the defective root r-m-j, “throw”, which has a weak
syllabic structure for each word form. The lexicon is final consonant.
defined as a complex inheritance hierarchy. The
fundamental assumption behind this work is that the Finally, we discuss the ways in which the framework
vocalic inflections can be defined in exactly the same way presented allows for easier extension of the lexicons to
as an ablaut process commonly seen in European enable the development of large-scale lexical resources
languages. Even the less obviously similar derivations for the Arabic languages, and how the lexicon structure
which involve “moving around” of the root consonants (for will permit the definition of dialects in addition to the
the different binyan1 derivations) can be dealt with using current account of MSA.
the same apparatus as required for consonant adaptations in
European languages.
2. MSA verbal morphology
The account in Cahill (2007) describes the basic lexical The verbal morphology of the semitic languages has
hierarchy for triliteral verbal roots in MSA with a single attracted plenty of attention in both the theoretical and
verb root being used to demonstrate the ability to generate computational linguistics communities. What makes it
the full (potential) range of forms with the framework. The interesting, particularly from the perspective of those
account does not cover the agreement inflections (the exposed only to European languages, is the structure of
prefixes and suffixes), nor does it cover anything other than the stems, involving consonantal roots, vocalic inflections
verbs with triliteral strong roots. In this paper we present and templates or patterns defining how the consonants
the latest extensions to this work, which aims ultimately to and vowels are ordered. Several approaches to the task
have been implemented, most based to some degree on
1
the two-level morphology of Koskenniemi (1983),
We use the Hebrew term “binyan” to refer to the although once adapted to allow for the formation of
different derived forms, also known as “measures” or semitic roots, it ended up being four-level morphology
“forms”.

19/119
(see e.g. Kiraz (2000)). and each rhyme of a peak and a coda 2 . The simplest
situation is where all wordform stems of a particular
The stem formation has already been shown (Cahill, 2007) lexeme are the same. In this case, we can simply specify
to be elegantly definable using an approach which was the onsets, peaks and codas for all of the syllables. For
developed mainly for defining European languages such example, the English word “pit” has the root /pIt/ and this
as English and German. We will describe this technique in is also its stem for all forms (singular, plural and
the next section. However, semitic morphology, and possessive). The phonological structure of this word in an
specifically the morphology of MSA, involves other word SBM lexicon would therefore by defined as follows3:
formation and inflection processes. One of the areas that
has attracted a good deal of attention is the issue of what <phn syl1 onset> == p
happens when the verb root, traditionally assumed to <phn syl1 peak> == I
consist of three consonants, does not fit this pattern. The <phn syl1 coda> == t
three principal situations where this happens are in the
case of biliteral or quadriliteral roots, where there are This example is monosyllabic, but polysyllabic roots
either two or four consonants instead of the expected three, involve identifying individual syllables by counting from
and the weak roots, where one of the consonants is a either the left or right of a root. For suffixing languages,
“weak” glide, i.e. either /w/ or /j/. the root’s syllables are counted from the right, while for
prefixing languages, they are counted from the left. For
Where a root has only two consonants, one or other of Arabic, although both pre- and suffixing processes occur,
those consonants is used as the third (middle) consonant, the decision has been made to count from the right, as
which one depending on the stem shape. Where a root has there is more suffixation. However, as the roots in Arabic,
four consonants, the possible forms are restricted to forms to all intents and purposes, always have the same number
where there are at least four consonant “slots”. Early of syllables, it is not important whether we choose to call
accounts of these types of root include a range of means of the initial syllable syl1 or syl2.
“spreading” where post lexical processes have to be
invoked to copy one or other of the consonants (see, e.g. In the case of simple stem alternations such as ablaut, the
Yip (1988)). peak of a specified syllable is defined as distinct for the
different wordforms. That is, the realisation of the peak is
The issue of bi- and quadri-literal roots is relatively determined by the morphosyntactic features of the form.
simply handled within the syllable-based framework, as To use a simple example, for an English word man, which
described in section 4 below. The weak roots are slightly has the plural men, we can specify in its lexical entry:
more complex, but nevertheless amenable to definition in
a similar way to the kind of phonological conditioning <phn syl1 peak sing> == a
seen, for example, in German final consonant devoicing, <phn syl1 peak plur> == E.
where the realisation of the final consonant of a stem
depends on whether it is followed by a suffix beginning As the individual consonants and vowels are defined
with a vowel or not. The Syllable-based Morphology separately for any stem, the situation for Arabic is actually
framework has been developed to allow for the realisation quite straightforward. For each verb form, inflected or
of fully inflected forms to be determined in part by derived, the consonants and vowels are defined, not in
phonological characteristics of the root or stem in terms of their position in a string or template, but in terms
question. This means that, while Arabic weak roots are of their position in the syllable trees. Thus, Cahill (2007)
often cited as behaving differently morphologically, we describes how the three consonants can be positioned as
argue that they behave entirely regularly morphologically, the onset or coda of different syllables. The vowels are
but their behaviour is determined by their phonology. defined in terms of tense/aspect.

3. Syllable-based morphology Figure 1 shows how the (underspecified) root structure for
The theory of syllable-based morphology (SBM) can the root katab looks. This is defined in DATR as follows4:
trace its roots back to the early work of Cahill (1990). The <phn syl2 onset> == Qpath:<c1>
initial aim was to develop an approach to describing
morphological alternation that could be used for all 2
The term “peak” is used to refer to the vowel portion of
languages and all types of morphology. Cahill’s doctoral the syllable, rather than the sometimes used “nucleus”.
work included a very small indicative example of how the The syllable structure is relatively uncontroversial, having
proposed approach could describe Arabic verbal stem been first proposed by Pike and Pike (1947).
formation. The basic idea behind syllable-based 3
We use the lexical representation language DATR
morphology is simply that one can use syllable structure (Evans and Gazdar, 1996) to represent the inheritance
to define all types of stem alternation, including simple network and use SAMPA (Wells, 1989) to represent
vowel alternations such as ablauts. All stems are defined phonemic representations.
4
by default as consisting of a string of tree-structured This is specified at the node for verbs, which defines all
syllables. Each syllable consists of an onset and a rhyme of the information that is shared, by default, by all verbs in
Arabic.

20/119
the separate “Suffix_S” node5.

One of the key aspects of SBM is that all forms are


defined in terms of their syllable structure. This does lead
to a slight complication with affixes which consist of a
single consonant, for example. The SBM approach to this
is to say that there is a necessary post-lexical
resyllabification process which takes place after all
affixes have been added and so it is not a problem to
define affixes as (at least) single syllables, even if they are
syllables with no peaks. Although this may seem a little
counter-intuitive, the issue of resyllabification is clearly
one which must be addressed. If we affix –ed (/Id/) to an
Figure 1: the structure of /katab/ English verb stem which ends in a consonant, it is almost
always the case that that consonant becomes the onset of
<phn syl1 onset> == Qpath:<c2> the suffix syllable, while it is the coda of the final syllable
<phn syl1 coda> == Qpath:<c3> of the stem if no affix is added. Indeed, in most languages
These equations simply say that (by default) the onset of it is even the case that resyllabification takes place across
the initial syllable is filled by the first consonant (c1), the word boundaries in normal speech.
onset of the second syllable is filled by the second
consonant (c2) and the coda of the second syllable is filled 4. Extensions to the framework
by the third consonant (c3). The precise position of the Cahill’s (2007) account of Arabic morphology only
consonants depends not only on the binyan, but also on covered the stem formation, and did not attempt to cover
tense. By default, the past tense has the structure in figure anything other than straightforward triliteral strong verb
1, but the present tense has that in figure 2. roots. In fact, the fragment published in the appendix of
that paper includes a single example verb entry, an
example of a standard strong triliteral verb. In this section
we discuss the three ways in which we have, to date,
extended the lexicon.

4.1 Adding more lexemes


We have extended the lexicon initially to include a larger
number of strong, triliteral verbs. This is an extremely
simple process in the lexicon structure provided as all that
needs to be specified are the three consonants in the root.
This does result in overgeneration, as all possible stems,
for all binyanim, are generated. However, it is a simple
process to block possible forms, and there is a genuine
linguistic validity to the forms, such that, if a particular
verb has a Binyan 9 form, then we know what form it will
Figure 2: the structure of /aktub/ take.

Affixation is handled as simple concatenation, such that The issue of how many binyanim to define is an
(syllable-structured) affixes concatenate with interesting one, and one we will come back to in the
(syllable-structured) stems to make longer strings of discussion of extending coverage to dialects of Arabic.
structured syllables. For a simple case such as English Classical Arabic has a total of fifteen possible binyanim,
noun plural suffixation, for example, we need to specify while MSA makes use of ten of these standardly and two
that a noun consists of a stem and a suffix. We then need more in a handful of cases.
to state that, by default, the suffix is null, and that in the
case of the plural form, a suffix is added. 4.2 Agreement inflections
The next extension to the existing lexicon was to add the
<mor word form> == agreement inflections. These include prefixes and suffixes
“<phn root>” “<mor suffix>” and mark the person, number and gender of the form. As
<mor suffix> == Null noted above, the affixal inflections do not pose any
<mor suffix plur> == particular difficulties for the syllable-based framework.
Suffix_S:<phn form>
5
For more detail of this type of SBM definition for
As we are dealing with phonological forms, we also need German, English and Dutch, see Cahill and Gazdar
to specify how the suffix is realised, which is defined at (1999a, 1999b).

21/119
The “slots” for the affixes were already defined in the positions. This leads to phonologically conditioned
original account, so it was simply a case of specifying the variation from the standard stem forms. For example, the
realisations. The exact equations required for this will not hollow verb zawar (“visit”) has a glide in second
be covered in detail here, but we note that the affixes consonant position. This leads to stem forms with no
display typical levels of syncretism and default behaviour middle consonant, and a u in place of the two as. In order
so that, for example, we can specify that the default to allow for this variation, we need to check whether the
present tense prefix is t- as this is the most frequent second consonant is a glide and this will determine the
realisation, but the third person present tense prefix is y- realisations. This check must be done for each onset, peak
while the third person feminine prefix is t-. This kind of and coda that is defined as having possible variation, and
situation occurs often in defining default inheritance involves a simple check whether the second consonant is
accounts of inflection and is handled by means of the a /w/ or a /j/. In each case the behaviour is the same for the
following three equations6: consonant itself, i.e. it is omitted, but different for the
<agr prefix pres> == t vowel. With /w/, the vowel is /u/ but with /y/ it is /i/, in the
<agr prefix pres third> == y second vowel position.
<agr prefix pres third femn> == t
There are two possible approaches we could take to
4.3 Non-standard verb roots defining the different behaviour of weak verbs. The first is
The final extension which we report in this paper is the to specify a finite state transducer to run over the default
adaptation of the framework for stem structure to take forms. For example, we could state that if a verb root has
account of the different types of verb root, as discussed in the sequence /awa then this becomes /u:/. The second
section 2. approach is to define the elements of the syllable structure
according to the phonological context in which they occur.
Dealing with biliteral roots involves specifying for each We opt for the second of these approaches for a number of
consonant (i.e. onset or coda) defined in the stem structure reasons. The first is that we wish to minimise the different
whether it should take the first or third consonant value, if technologies used in our lexicon. Although FSTs are very
the second consonant is unspecified. Thus, biliteral simple to implement, we want to resist using them if
roots have their second consonants defined thus: possible, in order to make use only of the default
<c2> == Undef inheritance mechanisms available to us. The second is
Then, an example of defining the correct consonant that we are not yet at a stage in the project where we have
involves a simple conditional statement: enough varied data for all of the different verb and noun
<phn syl2 onset> == forms to be certain that any transducer we devise will not
IF:<EQUAL:<”<c2>” Undef> over apply, whereas we can be more confident of the
THEN “<c1>” specific generation behaviour of the inheritance
ELSE “<c2>” mechanisms we are employing in the lexicon structure as
This simply states that, if the second consonant is a whole.
unspecified, then the first consonant takes this particular
position, but if not, then the second consonant will take its One disadvantage of the approach we have chosen to take
normal place. In positions where the absent second is that is does result in somewhat more complex
consonant is represented by the third consonant simply definitions in our lexical hierarchy. For example, if we
require the third line above to give c3 rather than c1 as the only define strong triliteral verb roots, then our lexical
value. hierarchy can include statements like:
In order to handle quadrilateral roots, we need a separate
node for these verbs which defines which of the <phn syl1 onset> == Qpath:<c1>
consonants occupies each consonant slot in the syllable
trees. In many cases these are inherited from the Verb which are very simple. If we include all of the variation in
node, for example, the first consonant behaves the same in this hierarchy then we need more statements (to
these roots. Typically, where a triliteral root uses c1, a distinguish between, for example, past and present tense
quadriliteral root will use c1; where a triliteral root uses behaviour) and those statements are more complex. This
c3, the quadriliteral root will use c4 in most cases, but c3 is because, even for the standard strong triliteral roots, we
in others; where a triliteral root uses c2, the quadriliteral need to check for each consonant whether or not it is weak
root will either use c2 or c3, so these equations have to be and for each vowel, we need to check whether it is
specified. adjacent to a weak consonant . For this reason, we do not
include the DATR code which defines the weak verb
The weak roots have a glide in one of the consonant forms, but rather describe the checks needed.

6
The approach we take involves two levels of specification.
We have specified the present tense prefixes without the At the first level, each equation defining a consonant or a
/a/, as this is present in all forms. We therefore consider vowel calls on a simple checking function to determine
that this segment is part of the present tense stem.

22/119
whether the realisation is the default one or something strong then we would get a /u/ as the peak of the second
different. These calls to checking functions may take syllable. However, as the final consonant, /j/ is weak, the
different arguments. Thus, the simplest type just needs to peak is null. Similarly, the final consonant is not realised,
be passed the root consonant in question and will because it is weak. So, our stem is fully realised as /arm/.
determine whether it is realised (if it is strong) or not (if it Finally, we look for the agreement suffix, which is
is weak). In more complex situations, e.g. where a weak defined as /u:na/. So, our fully inflected form is /jarmu:na/.
root has /u:/ where it would by default have /awa/, we The syllable structure of the stem is shown in figure 3.
need to pass both the consonant and at least one of the
vowels.

The checking nodes are each very simple. The simplest


just state that the consonant is realised if it is strong but
not if it is weak:
Check_ajwaf_cons:
<$weak> ==
<$strong> == $strong.
We add similar checks to the equations for vowels so that,
instead of the default stem form of /zawar/ we get the
correct (first and second person7) stem of /zur/. The other
weak forms involve similar checks for the other
consonants. Figure 3: the structure of /arm/

These checks are very similar to the checks we can see in 5. Future directions
the syllable-based accounts of, for example, German
The extensions we report on here are only the start of a
(Cahill and Gazdar, 1999a). The realisation of the final
program of research which will add nouns and other
consonant in any stem in German is dependent on whether
non-regular morphological forms (e.g. the broken plural).
or not there is a suffix which begins with a vowel.
The project is also going to add orthographic forms,
Therefore, the equation specifying that consonant checks
derived from the phonological and morphological
for the beginning of the suffix (if there is one) and for
information, and supplement these with information
underlying voiced consonants returns the voiced variant
about the relative frequency of the ambiguous forms in
only if there is a vowel following, and returns the
unvocalised script.
voiceless variant otherwise.
5.1 Extension of the lexicon
To clarify the entire process involved in generating a verb
form from our lexicon, we shall now describe the The DATR implementation of the lexicon is based on the
derivation of the present tense active third person plural lexicon structure of PolyLex (Cahill and Gazdar, 1999b).
masculine form of the verb “throw” (they(m) throw). This This gives the lexicon two big advantages over other
is a weak (defective) verb, with a root of r-m-j. The first lexicon structures. The first is the default inheritance
thing we do is look for the agreement prefix. Our Verb machinery, which allows very easy extension. It is
node tells us that this is /j/. Next we need to determine the extremely easy to add large numbers of new lexemes
stem for this form. The stem is defined as having /a/ as the automatically, as long as the hierarchy defines all of the
peak of the first syllable (the default value for all present variation. The task is simply to add basic root information
tense forms) and the first consonant of the root, i.e. /r/ as (the consonants and the meaning and any irregularities
the coda of the first syllable. We determine this by peculiar to that lexeme – although there should not be
checking whether it is weak or not. Once we have many irregularities in new additions, as the most frequent
determined that it is a strong consonant, it takes its place words will have been added, and it is usually the more
in the syllable structure. The onset of the second syllable frequent words which are irregular) and choose the node
is the second consonant, in this case /m/, just as it is in in the hierarchy from which it should inherit. The PolyLex
most stems. Once again, we check that this is not a weak project developed tools to allow the generation of large
consonant before placing it in its position. At this point we numbers of additional lexical entries from a simple
start to find different behaviour. If the final consonant was database format which includes the important
information.

7
The discussion here has been simplified for the sake of Crucially, the use of default inheritance means that, even
brevity. The first and second person stem forms are the if we do not have all of the information available to
same, and are defined here, but the third person stems are determine the exact morphological behaviour of a
different. This is not a problem for our account, as the particular lexeme, we can assign sensible default values.
framework is specifically designed to allow both For example, if we wanted to add a new English noun to
morphosyntactic and phonological information to be used our lexicon, and we have not seen an example of that noun
in determining the correct form.

23/119
in its plural form, we can add it as a regular noun, and being designed to work for closely related languages, it is
generate a plural form which adds the –s suffix. This may also appropriate for dialects of a single language. We can
not be correct, but it is a reasonable guess, and the kind of map the situation for MSA 8 and the dialects onto this
behaviour we would expect from a human learning a directly, with MSA taking the place of the multilingual
language. This is useful if the data we use to extend our hierarchy and the dialects taking the place of the separate
lexicons comes, for example, from corpora – often a languages here. The assumption is that, by default, the
necessity for languages which do not have large dialects inherit information (about morphology,
established resources. phonology, orthography, syntax and semantics) from the
MSA hierarchy, but any part of that information can be
In terms of the Arabic lexicon we describe here, the forms overridden lower down for individual dialects. There is
of verbs, even those with weak roots, do not need any nothing to prevent a more complex inheritance system,
further specification, as the lexical hierarchy defines the for example, to allow two dialects to share information
alternation in terms of the phonological structure of the below the level of the MSA hierarchy, but to also specify
root. Therefore, if a newly added root has a weak some distinct bits of information.
consonant, the correct behaviour will automatically be
instigated by the recognition of that weak consonant.
6. Conclusions
This process has already been tested with a random The approach to Arabic morphology presented here is still
selection of 50 additional strong verbs, two weak verbs in the early stages of development. It does, nevertheless,
for each of the consonant positions (i.e. two with weak demonstrate a number of crucial points. First, it backs
initial consonants, two with weak medial consonants and Cahill (2007) in showing that the SBM approach appears
two with weak final consonants) and one with two weak to be adequate to define those aspects of Arabic
consonants. The resulting forms for some of these verbs morphology that have frequently been cited as
are included in Appendix 2. problematic. It is important to establish proof of concept
in employing a new approach to specifying the
5.2 Adding more dialects morphology of any language, and the (admittedly small)
Another issue which causes much concern in the lexicon does demonstrate the possibility of handling bi-
representation and processing of Arabic is the question of and quadriliteral roots as well as weak verb roots within
the different varieties or dialects. Buckwalter (2007) says the SBM framework. Although not all of the details for all
“... the issue of how to integrate morphological analysis of of the verbal morphology have yet been implemented,
the dialects into the existing morphological analysis of nothing has been shown to cause any significant
Modern Standard Arabic is identified as the primary difficulties that cannot be overcome in the framework.
challenge of the next decade.” (p. 23). Until relatively
recently, the issue of dialects in Arabic was only relevant Secondly, having established that the approach appears to
for phonological processing, as dialects did not tend to be be feasible for the complexities of Arabic morphology, it
written. However, the rapid expansion of the Internet, follows that the implementation of the morphology in the
amongst other developments, means that written versions form of a PolyLex-style lexicon will permit the definition
of the various dialects are increasingly used, and of dialectal variation, thus allowing the development of a
processing of these is becoming more important. full lexicon structure defining MSA, Classical Arabic as
well as regional variants in an efficient and practically
The PolyLex architecture was developed as a multilingual maintainable way. Although the details remain to be
representation framework, particularly aimed at worked out, the assumed structure would involve a core
representing closely related languages (the PolyLex lexicon which defines, for example, all fifteen of the
lexicons themselves include English, German and Dutch). Classical Arabic binyanim, with each of the lexicons for a
The framework involves making use of extended default “real” language specifying which of those are employed
inheritance to specify information which is shared, by within that language or dialect.
default, by more than one language, with overrides being
used to specify differences between languages as well as The PolyLex lexicon structure allows the definition of
variant behaviour within a single language (such as defaults, which can be overridden at any of a number of
irregular or sub-regular inflectional forms). In the case of levels. It is possible to override some pieces of
English, German and Dutch, for example, it is possible to information for an entire language or dialect, for a
state that, by default, nouns form their plural by adding an word-class such as nouns, for a sub-class of nouns or
–s suffix. This is true of all regular nouns in English and verbs or for an individual lexeme. This makes it very
of one class of nouns in both Dutch and German. efficient at representing lexical information which tends
Importantly, those classes in Dutch and German are the
classes that new nouns tend to belong to, so assuming that 8
class to be the default works well. It may prove more accurate and useful to have Classical
Arabic in the multilingual position, as this probably
One of the great advantages of such a framework is that, includes more of the range of forms that the different
dialects would need to inherit.

24/119
to be very largely regular. It also makes it very easy to add Linguistic Theory, 6.4. pp. 551-577.
new lexemes, even if it has not been wholly established
what all of the correct forms of that lexeme are. To use an
analogy from child language acquisition, a child hearing Appendix: Sample output
an English noun, will assume that its plural is –s unless
and until they hear an irregular plural form for it. The DATR-implemented lexicon can be compiled and
Similarly, a child learning Arabic will assume that a new queried. In this appendix, we include the full lexical
verb it hears follows the default, regular patterns unless dumps for three lexemes: the fully regular strong triliteral,
and until they hear non-regular forms. That is the kind of k-t-b, “write”; the weak (defective) verb r-m-y, “throw”;
behaviour that our default inheritance lexicon models and the “doubly” weak verb T-w-y, “fold”. The dumps
when adding new lexemes. give the present and past active forms for the first binyan.

Write:<bin1 mor word past act first sing>


7. Acknowledgements = k a t a b t u.
The work reported here was undertaken as part of the Write:<bin1 mor word past act first plur>
ESRC (UK) funded project Orthography, phonology and = k a t a b n a:.
morphology in the Arabic lexicon, grant number Write:<bin1 mor word past act secnd sing
RES-000-22-3868. Their support is gratefully masc> = k a t a b t a.
acknowledged. I am also grateful to the anonymous Write:<bin1 mor word past act secnd sing
reviewers for their constructive comments. femn> = k a t a b t i.
Write:<bin1 mor word past act secnd plur
masc> = k a t a b t u m.
8. References Write:<bin1 mor word past act secnd plur
Buckwalter, Tim. (2007) Issues in Arabic Morphological femn> = k a t a b t u n n a.
Analysis. In Abdelhadi Soudi, Antal van der Bosch Write:<bin1 mor word past act third sing
and Günther Neumann (eds.) Arabic Computational masc> = k a t a b a.
Morphology Dordrecht : Springer. pp. 23-41. Write:<bin1 mor word past act third sing
Cahill, Lynne. (2007) A Syllable-based Account of Arabic femn> = k a t a b a t.
Morphology. In Abdelhadi Soudi, Antal van der Write:<bin1 mor word past act third plur
masc> = k a t a b u:.
Bosch and G ü nther Neumann (eds.) Arabic
Write:<bin1 mor word past act third plur
Computational Morphology Dordrecht : Springer. femn> = k a t a b n a.
pp. 45-66. Write:<bin1 mor word pres act first sing>
Cahill, Lynne. (1990) Syllable-based Morphology. = a k t u b u.
COLING-90, Vol 3, pp. 48-53, Helsinki. Write:<bin1 mor word pres act first plur>
Cahill, Lynne and Gazdar, Gerald. (1999a) German noun = n a k t u b u.
inflection. Journal of Linguistics, 35 :1, pp. Write:<bin1 mor word pres act secnd sing
211-245. masc> = t a k t u b u.
Cahill, Lynne and Gazdar, Gerald. (1999b) The PolyLex Write:<bin1 mor word pres act secnd sing
architecture : multilingual lexicons for related femn> = t a k t u b i: n a.
languages. Traitement Automatique des Langues, Write:<bin1 mor word pres act secnd plur
masc> = t a k t u b u: n a.
40 :2, pp. 5-23.
Write:<bin1 mor word pres act secnd plur
Evans, Roger and Gazdar, Gerald. (1996) DATR : a femn> = t a k t u b n a.
language for lexical knowledge representation. Write:<bin1 mor word pres act third sing
Computational Linguistics, 22 :2, pp. 167-216. masc> = j a k t u b u.
Kiraz, George. (2000) A Multi-tiered Non-linear Write:<bin1 mor word pres act third sing
Morphology using Multi-tape Finite State femn> = t a k t u b u.
Automata : A Case Study on Syriac and Arabic. Write:<bin1 mor word pres act third plur
Computational Linguistics, 26 :1, pp. 77-105. masc> = j a k t u b u: n a.
Koskenniemi, Kimmo. (1983) Two-level Morphology : A Write:<bin1 mor word pres act third plur
General Computational Model for Word-form femn> = j a k t u b n a.
Recognition and Production. PhD Dissertation
Throw:<bin1 mor word past act first sing>
University of Helsinki.
= r a m a j t u.
Pike, Kenneth L. and Pike, Eunice V. (1947) Immediate Throw:<bin1 mor word past act first plur>
constituents of Mazateco syllables. International = r a m a j n a:.
Journal of American Linguistics, 13, pp. 78-91. Throw:<bin1 mor word past act secnd sing
Wells, John. (1989) Computer-coded phonemic notation masc> = r a m a j t a.
of individual languages of the European Community. Throw:<bin1 mor word past act secnd sing
Journal of the International Phonetic Association, femn> = r a m a j t i.
19 :1, pp. 31-54. Throw:<bin1 mor word past act secnd plur
Yip, Moira. (1988) Template Morphology and the masc> = r a m a j t u m.
Direction of Association. Natural Language and

25/119
Throw:<bin1 mor word past act secnd plur Fold:<bin1 mor word pres act third sing
femn> = r a m a j t u n n a. femn> = t a T w i:.
Throw:<bin1 mor word past act third sing Fold:<bin1 mor word pres act third plur
masc> = r a m a a. masc> = j a T u: n a.
Throw:<bin1 mor word past act third sing Fold:<bin1 mor word pres act third plur
femn> = r a m a t. femn> = j a T w i: n a.
Throw:<bin1 mor word past act third plur
masc> = r a m a w.
Throw:<bin1 mor word past act third plur
femn> = r a m a j n a.
Throw:<bin1 mor word pres act first sing>
= a r m i:.
Throw:<bin1 mor word pres act first plur>
= n a r m i:.
Throw:<bin1 mor word pres act secnd sing
masc> = t a r m i:.
Throw:<bin1 mor word pres act secnd sing
femn> = t a r m i: n a.
Throw:<bin1 mor word pres act secnd plur
masc> = t a r m u: n a.
Throw:<bin1 mor word pres act secnd plur
femn> = t a r m i: n a.
Throw:<bin1 mor word pres act third sing
masc> = j a r m i:.
Throw:<bin1 mor word pres act third sing
femn> = t a r m i:.
Throw:<bin1 mor word pres act third plur
masc> = j a r m u: n a.
Throw:<bin1 mor word pres act third plur
femn> = j a r m i: n a.

Fold:<bin1 mor word past act first sing>


= T a w a j t u.
Fold:<bin1 mor word past act first plur>
= T a w a j n a:.
Fold:<bin1 mor word past act secnd sing
masc> = T a w a j t a.
Fold:<bin1 mor word past act secnd sing
femn> = T a w a j t i.
Fold:<bin1 mor word past act secnd plur
masc> = T a w a j t u m.
Fold:<bin1 mor word past act secnd plur
femn> = T a w a j t u n n a.
Fold:<bin1 mor word past act third sing
masc> = T a w a:.
Fold:<bin1 mor word past act third sing
femn> = T a w a t.
Fold:<bin1 mor word past act third plur
masc> = T a w a w.
Fold:<bin1 mor word past act third plur
femn> = T a w a j n a.
Fold:<bin1 mor word pres act first sing>
= a T w i:.
Fold:<bin1 mor word pres act first plur>
= n a T w i:.
Fold:<bin1 mor word pres act secnd sing
masc> = t a T w i:.
Fold:<bin1 mor word pres act secnd sing
femn> = t a T w i: n a.
Fold:<bin1 mor word pres act secnd plur
masc> = t a T w u: n a.
Fold:<bin1 mor word pres act secnd plur
femn> = t a T w i: n a.
Fold:<bin1 mor word pres act third sing
masc> = j a T w i:.

26/119
Using the Yago ontology as a resource for the enrichment of
Named Entities in Arabic WordNet
Lahsen Abouenour1, Karim Bouzoubaa1, Paolo Rosso2
1 Mohammadia School of Engineers, Med V University
Rabat, Morocco
2 Natural Language Engineering Lab. - ELiRF, Universidad Politécnica
Valencia, Spain
E-mail: abouenour@yahoo.fr, karim.bouzoubaa@emi.ac.ma, prosso@dsic.upv.es

Abstract
The development of sophisticated applications in the field of the Arabic Natural Language Processing (ANLP) depends on
the availability of resources. In the context of previous works related to the domain of the Arabic Question/Answering (Q/A)
systems, a semantic Query Expansion approach using Arabic WordNet (AWN) has been evaluated. The obtained results, al-
though AWN (one of the rare resources) has a low coverage of the Arabic language, showed that it helps to improve perform-
ances. The evaluation process integrates a Passage Retrieval (PR) system which helps to rank the returned passages according
to their structure similarity with the question. In this paper, we investigate the usefulness of enriching AWN by means of the
Yago ontology. Preliminary experiments show that this technique helps to extend and improve the processed questions.

1. Introduction improved the performances in terms of accuracy, the


MRR2 and the number of answered questions.
Arabic Natural Language Processing (ANLP) has
known interesting attempts in the last years especially The reached performances have been considered
in morphology and less advanced Information Re- encouraging for the following reasons:
trieval (IR) systems. However, the development of • a number of 2,264 of well-known question
more sophisticated applications such as Ques- sets in the field of Q/A and IR, namely the
tion/Answering (Q/A), Search Engines (SEs) and Ma- TREC3 and CLEF4 collections, were used;
chine Translation (MT) has still a common problem: • The difference of performances before and
the lack of available electronic resources. after using AWN is significant;
• Experiments have been conducted in an open
The Arabic WordNet (AWN) ontology (Elkateb et
domain (the web) which is a challenging
al., 2006) is one of these few resources. The AWN1 is
context for Q/A systems.
a lexical ontology composed of 23,000 Arabic words
and 10 thousands of synsets (sets of words having a
Even though AWN has a low coverage of the Ara-
common meaning). The design of AWN presents
bic language regarding other languages such as Eng-
many advantages for its use in the context of ANLP.
lish, it helped to improve performances.
Indeed, AWN has the same structure as the Princeton
WordNet (PWN) (Fellbaum, 2000) and WordNets of In order to enhance further performances, the idea
other languages. The AWN ontology is also a seman- is to develop and use a more enriched release of the
tic resource since it contains relations between its AWN ontology. The enrichment of AWN could be
synsets and links to the concepts of the Suggested done according to different lines: adding new synsets,
Upper Model Ontology (SUMO) (Niles & Pease, enriching the existing synsets, enriching the hy-
2003). The advantages described above show that poymy/hypernymy relations, verb categorization,
AWN can contribute in the development of sophisti- Named Entity (NE) synsets, gloss, forms (for instance
cated applications as well as the development of broken plurals), etc. The aim was focusing on the
cross-language systems. AWN lacks related to the used question collections.
Therefore, the authors have performed an analysis of
In a previous work on Arabic Q/A, (Abouenour et
al., 2009b) proposed a Query Expansion (QE) ap- 2
proach which relies on the AWN content and its se- Mean Reciprocal Rank (MRR) is defined as the average of
mantic relations, namely synonymy, hypernymy, hy- the reciprocal ranks of the results for a sample of queries
(the reciprocal rank of a query response is the multiplicative
ponymy and definition. The proposed approach has inverse of the rank of the correct answer).
3
Text REtrieval Conference,
http://trec.nist.gov/data/qa.html
4
Cross Language Evaluation Forum, http://www.clef-
1
http://www.globalwordnet.org/AWN/ campaign.org

27/119
the questions which contain keywords that can not be QASAL. They have reported that it will be necessary
found in AWN (not extensible questions) and those in future works to consider, the synonymy relations
for which the system could not reach the expected between AWN synsets at the question analysis stage
answer (not answered questions). For the two types of of the proposed system. In (Benajiba et al., 2009), the
questions, they investigated either the keywords form- authors have reported that the use of AWN would
ing the questions and the type of the expected answer. allow exploring the impact of semantic features for
the Arabic Named Entity Recognition (NER) task
The analysis showed that for a high percentage of which is generally included in the first question
the considered questions, both the question keywords analysis step of a Q/A process (generally composed
and answers are NEs. Hence, the enrichment of the by three steps: question analysis, passages re-trieval
NE content in the AWN ontology could help us to and answer extraction).
reach higher performances.
In (Abouenour et al., 2008; Abouenour et al.,
In this paper, we present an attempt to perform an 2009a), the authors have shown how it is possible to
automatic AWN enrichment for the NE synsets. In- build an ontology for QE and semantic reasoning in
deed, the use of a NER system (if such system is the context of the Arabic Q/A task. In addition, the
available and accurate in the context of the Arabic usefulness of AWN as a semantic resource for QE has
language) allows only identifying NE and information been proved in the recent work of (Abouenour et al.,
related to them whereas adding NE in AWN helps 2009a) where the authors have considered not only
also to identify synsets which are semantically related the lexical side of AWN, but also its semantic and
to them (synonyms, subtypes, supertypes, etc.). More- knowledge parts. Moreover, the QE process based on
over, such enrichment could be also useful in the con- AWN has been used together with a structure-based
text of other ANLP and Cross-language tasks. technique for Passage Retrieval (PR). Indeed, the first
step of our approach is retrieving a large number of
The current work is based on the Yago5 ontology passages which could contain the answer to the en-
which contains 2 million entities (such as persons, tered question. Generally, the answer is expected to
organizations, cities, etc.). This ontology (Suchanek et appear in those passages nearer to the other keywords
al., 2007) contains 20 million facts about these enti- of the question or to the terms which are semantically
ties. The main reasons behind using this resource are: related to those keywords. Therefore, new queries
• its large coverage of NEs can help to improve from the question were generated by replacing a key-
performances in the context of Arabic Q/A sys- word by its related terms in AWN regarding the four
tems; semantic relations mentioned previously.
• its connection to the PWN and the SUMO on-
tology (Gerard et al., 2008) can help us to In the second step of the described approach, the re-
transfer the large coverage of Yago to the turned passages have to be ranked according to the
AWN ontology. structure similarity between the passages and the
question. Thus, this step allows decreasing the num-
The rest of the paper is structured as follows: Sec- ber of passages to be processed at the answer extrac-
tion 2 describes works using AWN; Section 3 pre- tion stage.
sents the technique proposed for the AWN enrich-
ment; Section 4 is devoted to the presentation of the The conducted experiments showed an improve-
preliminary experiments that we have conducted on ment of performances thanks to our two steps ap-
the basis of the Yago content; in Section 5 we draw proach based on the AWN ontology for QE and the
the main conclusions of this work and we discuss fu- Java Information Retrieval System6 (JIRS) (Gomez et
ture work. al., 2007) for structure based PR. The analysis of the
obtained results showed that:
2. Arabic WordNet in previous works • A high percentage (46.2%) of the TREC
and CLEF questions are of NEs;
There are many works that have integrated AWN as • The enrichment of the NE content in
a lexical or a semantic resource. To our knowledge, AWN will allow extending 69% of the
most of these works belongs to the Arabic IR and Q/A non extensible questions;
fields. Indeed, in (El Amine, 2009), AWN has been • For a high percentage of the considered
used as a lexical resource for a QE process in the con- questions (50%), we can reach a similarity
text of the IR task. (between the question and passages) equal
or higher than 0.9 and an average of 0.95
In the context of Q/A systems, authors in (Brini et (max is 1) by using AWN together with
al., 2009) have proposed an Arabic Q/A system called JIRS.
5
Yet an Other Great Ontology, available at
http://www.mpi-inf.mpg.de/yago-
6
naga/yago/downloads.html http://jirs.dsic.upv.es

28/119
Thus, according to this analysis, the priority in whereas the latter are facts which set a relation be-
terms of AWN enrichment is clear: in order to evalu- tween these entities. To our knowledge Yago has been
ate the QE and structure-based approach, we have to used as a semantic resource in the context of IR sys-
enlarge and refine the coverage, hierarchy and rela- tems (Pound et al., 2009).
tions related to the NE synsets in AWN.
As we are interested in enriching the NE content of
In the next section, we describe how resources be- AWN, a translation stage has to be considered in our
longing to other languages could be used for the en- process. In (Al Khalifa and Rodriguez, 2009), authors
richment of the NE content in AWN. used the Arabic counterpart of the English Wikipedia
pages as a translation technique. In the current work,
we consider instead the Google Translation API12
3. Enrichment of Arabic WordNet using (GTA) because its coverage for NEs written in Arabic
Yago is higher than the one of Arabic Wikipedia. In addi-
tion, translating a word using GTA is faster. Indeed,
According to the great number of words of the the result of a translation using Arabic Wikipedia
Modern Standard Arabic (MSA) language, the current needs to be disambiguated as many possible words
release of AWN which has been manually built has are returned. This is not the case for the GTA.
still to be enlarged. The automatic enrichment is a
promising way to reach a large coverage by AWN The enrichment concerns both adding new indi-
regarding the MSA. In this context, authors in (Al viduals (NE) and adding their supertypes. These su-
Khalifa and Rodriguez, 2009) have proposed a new pertypes are very important and useful in our QE
approach for extending automatically the NE cover- process combined to the structure-based PR system
age of AWN. This approach relies on Wikipedia7. The (JIRS). In order to show this usefulness, let us con-
evaluation done in that work shows that 93.3% of the sider the example of the TREC question " ‫ و ون‬
NE synsets which was automatically recovered are ‫( "  ن ؟‬When was Lindon Johnson born?). When
correct. However, due to the small size of the Arabic we query a search engine using this question, the two
wikipedia, only 3,854 Arabic NEs have been recov- following passages could be returned:
ered.
‫ و ه ام اي و‬1908  The year 1908 which is
Our approach proposes using a freely available on- … ‫  ون ن‬ the year of birth of
tology with a large coverage of NE instead of the Lindon Johnson ...
Arabic Wikipedia. In addition to Yago, the field of ‫ ون‬ ‫و ا  ا‬ The American presi-
open source ontologies provides interesting resources 1908 "# $‫ أ‬27 ‫ن  م‬ dent Lindon Johnson
and attempts which belong either to the specific and … was born in 27 August
open domain category: OpenCyc (Matuszek et al., 1908 ...
2006), Know-ItAll (Etzioni et al., 2004), HowNet8,
9 10
SNOMED , GeneOntology , etc. According to the two passages above, the JIRS sys-
tem will consider the first passage as being the most
For the purpose of the current work, we have been relevant. Indeed, since the two passages contain the
interested in using Yago for the following reasons keywords of the question (‫ ون   ن‬،‫)و‬, the
(Suchanek et al., 2007): similarity of the structure of each passage to the one
• It covers a great amount of individuals (2 of the question is the criterion to be used to compare
millions NEs), them. The second passage contains a structure similar
• It has a near-human accuracy around 95%, to the question with two additional terms (which are
• It is built from WordNet and Wikipedia, not among the question keywords) whereas in the first
• It is connected with the SUMO ontology, passage only one additional term appears (fyh -  ).
• It exists in many formats (XML, SQL, RDF, Therefore, the latter is considered more similar to the
Notation 3, etc.) and is available with tools11 question than the former one. After enriching AWN
which facilitate exporting and querying it. by the NE ‫ ون   ن‬and its supertypes such as
'() ‫( ر*" أ‬r}ys >mryky : US President), we can
The Yago ontology contains two types of informa- consider, in the query processed by JIRS, the
tion: entities and facts. The former are NE instances extended form of the question where the NE is
(from Wikipedia) and concepts (from WordNet), preceded by its supertype '() ,‫ا)*" ا‬. In this case,
the two terms "*)‫ ا‬and '() ,‫ ا‬are considered as
being among the question keywords. Hence, the
7
www.wikipedia.org/ structure of the second passage would then be
8
www.keenage.com/html/e_index.html considered by JIRS as the most similar to the structure
9
www.snomed.org of the question. The second passage is the one
10
www.geneontology.org containing the expected answer in a structure which
11
http://www.mpi-inf.mpg.de/yago-
12
naga/yago/downloads.html http://code.google.com/p/google-api-translate-java/

29/119
structure which can be easy to process by the answer After performing steps 3 and 4, 374 distinct NEs
extraction module. In order to enrich the NE content (79%) have been identified within the Yago ontology.
in AWN, we have adopted an approach composed of A number of 59,747 facts concern the identified Yago
seven steps. Figure 1 below illustrates these steps. entities, with an average of 160 facts per entity. The
average of the confidence related to these facts around
0.97 (the max is 1). The Yago ontology contains 96
relations. We have identified 43 relations in the facts
corresponding to the NEs extracted from the consid-
ered questions. The TYPE relation is the first one to
be considered in our approach for the enrichment of
NEs in the AWN. For the purpose of the current
work, we have considered only the facts containing a
TYPE relation between a Yago entity and a WordNet
concept. From the 374 NEs identified in Yago, 204 of
them (around 55%) have a TYPE relation with a
WordNet concept.

Relying on these relations on one hand and on the


Figure 1: Steps of our approach for the enrichment relation between the AWN synsets and the WordNet
of the NE content in AWN synsets on the other hand, we were able to connect
189 Yago entities (roughly 51% of the NEs of the
As we can see, for the purpose of the current paper, considered questions) with the corresponding AWN
we are interested in the enrichment of the NE part of synsets.
AWN for the not extensible questions (547 TREC and
CLEF questions). In order to do so, our approach re- In order to connect the rest of NEs (185) with the
lies on the following steps: AWN synsets (102 distinct synsets), we have set, in
(S1) For each considered question, we extract the the context of the step S7 mentioned previously, dif-
NE in Arabic; ferent mappings between the relations used in the
(S2) Using the GTA, we translate the NE into Yago facts and the corresponding AWN synsets. For
English (GTA performs a disambiguation instance, the second arguments of the relations “citi-
process); zenOf”, “livesIn”, “bornIn”, “hasCapital” or “lo-
(S3) We extract the Yago Entity related to the catedIn” are candidate hyponyms of the AWN
NE translated into English; synset “ ” (mdynp : city).
(S4) We extract the Yago Facts related to the
Yago Entity extracted in the previous step; The enriched release of AWN that we have built
(S5) In this step, we have a sub release of Yago using Yago helped us extending more questions and
related to the considered questions; conducting preliminary experiments in the same way
(S6) Using the GTA, we translate the content (en- of (Abouenour et al., 2009a). Table 1 shows the ob-
tities and facts) of the sub release of Yago tained results.
built in step five;
(S7) We perform a mapping between the NEs
contained in the Arabic Yago (of step S4) Measures before Yago Using Yago
and their related entries in AWN according Accuracy 17,49% 23,53%
to synonymy, hypernymy, hyponymy and MRR 7,98 9,59
SUMO relations. Number
23,15% 31,37%
answered questions
After performing these steps, we have an enriched
release of AWN which we consider in our new ex- Table 1: Results of preliminary experiments related
periments. The obtained results in the enrichment and the non extensible questions.
experimental processes are described in the next sec-
tion. As we can see, performances in terms of accuracy,
MRR and the number of answered questions have
4. Preliminary Experiments been improved after using our semantic QE which
relies on the AWN release enriched with Yago.
As we have mentioned in the previous section, our
focus is devoted to the NEs which appear in the not 5. Conclusion and Future Works
extensible questions. The number of these questions is
547. There are some NEs which appear in many ques- In this paper, we have proposed an approach to en-
tions. The number of distinct NEs is 472. rich AWN from the available content of the Yago
ontology. The enrichment process was possible thanks
to the connection existing between Yago entities and

30/119
WordNet on one hand and between WordNet and El Amine M. A. 2009. Vers une interface pour
AWN on the other hand. In the preliminary experi- l’enrichissement des requêtes en arabe dans un
ments that we have conducted, we have considered système de recherche d’information. In Procee-
the previous semantic QE approach which relies now dings of the 2nd Conférence Internationale sur
on the new content of AWN. These experiments show l'informatique et ses Applications (CIIA'09) Sai-
an improvement in terms of accuracy, MRR and the da, Algeria, May 3-4, 2009.
number of answered questions.
Elkateb S., Black W., Vossen P., Farwell D.,
In the current work, we have considered only the Rodríguez H., Pease A., Alkhalifa M. 2006.
relations of Yago which allow a direct mapping be- “Arabic WordNet and the Challenges of Arabic”.
tween its entities and the AWN synsets. Therefore, In proceedings of Arabic NLP/MT Conference,
considering the other relations and the whole content London, U.K.
of Yago is among the intended future works.
Etzioni O., M. J. Cafarella, D. Downey, S. Kok, A.-
Acknowledgement M. Popescu, T. Shaked, S. Soderland, D. S.
Weld, and A. Yates. 2004. Web-scale informa-
This research work is the result of the collaboration in tion extraction in KnowItAll. In WWW,
the framework of the bilateral Spain-Morocco AECID- 2004.
PCI C/026728/09 research project. The third author
thanks also the TIN2009-13391-C04-03 research pro- Fellbaum C. 2000. “WordNet: An Electronic Lexical
ject. Database”. MIT Press, cogsci.princeton.edu/˜wn,
September 7.
References Gerard D. M., Suchanek F. M., Pease A. Integrating
Al Khalifa M. and Rodríguez H. 2009. “Automati- YAGO into the Suggested Upper Merged Ontol-
cally Extending NE coverage of Arabic WordNet ogy. 20th IEEE International Conference on
using Wikipedia”. In Proc. Of the 3rd Interna- Tools with Artificial Intelligence (ICTAI 2008).
tional Conference on Arabic Language Process- Dayton, Ohio, USA (2008).
ing CITALA2009, Rabat, Morocco, May, 2009. Gómez J. M., Rosso P., Sanchis E. 2007. Re-ranking
Abouenour L., Bouzoubaa K., Rosso P., 2009. of Yahoo snippets with the JIRS Passage Re-
“Three-level approach for Passage Retrieval in trieval system. In: Proc. Workshop on Cross Lin-
Arabic Question /Answering Systems”. In Proc. gual Information Access, CLIA-2007, 20th Int.
Of the 3rd International Conference on Arabic Joint Conf. on Artificial Intelligence, IJCAI-07,
Language Processing CITALA2009, Rabat, Mo- Hyderabad, India, January 6-12.
rocco, May, 2009. Matuszek C., J. Cabral, M. Witbrock, and J. De
Abouenour L., Bouzoubaa K., Rosso P., 2009. “Struc- Oliveira. An introduction to the syntax and con-
ture-based evaluation of an Arabic semantic tent of Cyc. In AAAI Spring Symposium, 2006.
Query Expansion using the JIRS Passage Re- Niles I., Pease A. 2003. “Linking Lexicons and On-
trieval system” . In: Proc. Workshop on Compu- tologies: Mapping WordNet to the Suggested Up-
tational Approaches to Semitic Languages, E- per Merged Ontology.” In Proceedings of the
ACL-2009, Athens, Greece. 2003 International Conference on Information
Abouenour L., Bouzoubaa K., Rosso P. 2008. Improv- and Knowledge Engineering, Las Vegas, Nevada.
ing Q/A Using Arabic Wordnet. In: Proc. The
2008 International Arab Conference on Informa- Pound J., Ihab F. I., and Weddell. G. 2009. QUICK:
tion Technology (ACIT'2008),Tunisia, December. Queries Using Inferred Concepts fromKeywords
Technical Report CS-2009-18. Waterlo, Canada.
Benajiba Y., Mona D., Rosso P. Using Language In-
dependent and Language Specific Features to Rodríguez H., Farwell D., Farreres J., Bertran M.,
Enhance Arabic Named Entity Recognition. In: Alkhalifa M., Antonia Martí M., Black W., El-
IEEE Transactions on Audio, Speech and Lan- kateb S., Kirk J., Pease A., Vossen P., and Fell-
guage Processing. Special Issue on Processing baum C. 2008. Arabic WordNet: Current State
Morphologically Rich Languages, Vol. 17, No. 5, and Future Extensions in: Proceedings of the
July 2009. Fourth International GlobalWordNet Conference
- GWC 2008, Szeged, Hungary, January 22-25,
Brini W., Ellouze M., Hadrich Belguith L. 2009. QA 2008.
SAL : “Un système de question-réponse dédié
pour les questions factuelles en langue Arabe”. Suchanek, F. M., Kasneci, G., Weikum, G.: YAGO: a
In: 9ème Journées Scientifiques des Jeunes Cher- core of semantic knowledge unifying WordNet
cheurs en Génie Electrique et Informatique, Tuni- and Wikipedia. In Proc. of the 16th WWW, pp.
sia. (in French). 697-706 (2007).

31/119
Light Morphology Processing for Amazighe Language
Fadoua Ataa Allah, Siham Boulaknadel
CEISIC, IRCAM
Avenue Allal El Fassi, Madinat Al Irfane, Rabat, Morocco
E-mail: {ataaallah, boulaknadel}@ircam.ma

Abstract
In the aim to allow the Amazighe language an automatic processing, and integration in the field of Information and Communication
Technology, we have opted in the Royal Institute of Amazighe Culture “IRCAM” for an innovative approach of progressive
realizations. Thus since 2003, researchers in the Computer Sciences Studies, Information Systems and Communications Center
“CEISIC” have paved the way for elaborating linguistic resources, basic natural language processing tools, and other advanced
scientific researches by encoding Tifinaghe script and developing typefaces.
In this context, we are trying through this paper to develop a computationally stemming process which is based on analyzing words
to their stems. This process consists in splitting Amazighe words into constituent stem part and affix parts without doing complete
morphological analysis. This approach of light stemming will conflate word variants into a common stem in order to be used in
natural language applications such as indexation, information retrieval systems, and classification.

1. Introduction conclusions, and draws some perspectives.


Stemming has been widely used in several fields of
natural language processing such as data mining, 2. Moroccan Standard Amazighe
information retrieval, machine translation, document Language
summarisation, and text classification, in which the The Amazighe language, known as Berber or Tamazight,
identification of lexical occurrences of words referring to is a branch of the Afro-Asiatic (Hamito-Semitic)
some central idea or ‘meaning’ is involved. Indeed, the language family. It covers the Northern part of Africa
lexical analysis is mainly based on word occurrences, which extends from the Red Sea to the Canary Isles, and
which require some form of morphological conflation from the Niger in the Sahara to the Mediterranean Sea.
that could range from removing affixes to using In Morocco, this language is divided into three mean
morphological word structures. regional varieties: Tarifite in North, Tamazight in Central
Morocco and South-East, and Tachelhite in the South-
In literature, many strategies of stemming algorithms West and the High Atlas. Even though 50% of the
have been published for different languages, such as Moroccan population are amazighe speakers, the
English (Lovins 1968; Porter, 1980), French (Savoy, Amazighe language was exclusively reserved for familial
1993; Paternostre et al., 2002), and Arabic (Larkey et al., and informal domains (Boukous, 1995). However, in the
2002; Taghva et al., 2005; Al-Shammari and Lin, 2008). last decade, this language has become institutional.
In general, the stemmer structures vary considerably
depending on the morphology of languages. For Indo- Since the ancient time, the Amazighe language has its
European languages, most basic techniques consist on own writing that was adapted by the Royal Institute of
removing suffixes; while, for the Afro-Asiatic ones, these the Amazighe Culture (IRCAM) in 2003, to provide an
techniques are extended to stripping prefixes. adequate and usable standard alphabetic system called
Tifinaghe-IRCAM. This system contains:
In practice, affixes may alter the meaning of words. So, - 27 consonants including: the labials (ⴼ, ⴱ, ⵎ),
the fact to remove them would greatly discard vital dentals (ⵜ, ⴷ, ⵟ, ⴹ, ⵏ, ⵔ, ⵕ, ⵍ), the alveolars (ⵙ,
information. In the Indo-European languages, prefixes ⵣ, ⵚ, ⵥ), the palatals (ⵛ, ⵊ), the velar (ⴽ, ⴳ), the
modify the word meaning which make their deletion not labiovelars (ⴽⵯ, ⴳⵯ), the uvulars (ⵇ, ⵅ, ⵖ), the
helpful. While in the Afro-Asiatic languages, the prefixes pharyngeals (ⵃ, ⵄ) and the laryngeal (ⵀ);
are also used to fit the word for its syntactic role. Thus,
in this paper, we propose an Amazighe stemming - 2 semi-consonants: ⵢ and ⵡ;
algorithm that consists in removing the common - 4 vowels: three full vowels ⴰ, ⵉ, ⵓ and neutral
inflectional morphemes placed at the beginning and/or vowel (or schwa) ⴻ which has a rather special status
the end of words. in amazighe phonology.

The remaining of the paper is organized as follows: in Furthermore, the IRCAM has recommended the use of
Section 2, we give a brief description of the Moroccan the International symbols for punctuation markers: “ ”
standard Amazighe language. Then, in Section 3, we give (space), “.”, “,”, “;”, “:”, “?”, “!”, “…”; the standard
an overview about the Amazighe language numeral used in Morocco (0, 1, 2, 3, 4, 5, 6, 7, 8, 9); and
characteristics. In Section 4, we present our light the horizontal direction from left to right for Tifinaghe
stemming algorithm. Finally, section 5 gives general writing (Ameur et al., 2004).

32/119
3. Amazighe Language 3.3 Particles
Characteristics In Amazighe language, particle is a function word that is
The purpose of this section is to give an overview of the not assignable to noun neither to verb. It contains
morphological properties of the main syntactic amazighe pronouns; conjunctions; prepositions; aspectual,
categories, which are the noun, the verb, and the particles orientation and negative particles; adverbs; and
(Boukhris et al., 2008; Ameur et al., 2004). subordinates. Generally, particles are uninflected words.
However in Amazighe, some of these particles are
3.1 Noun flectional, such as the possessive and demonstrative
In Amazighe language, noun is a lexical unit, formed pronouns (ta “ta” this (fem.) → tina “tina” these (fem.)).
from a root and a pattern. It could occur in a simple form
(argaz “argaz” the man), compound form (buhyyuf 4. Light Stemming Algorithm
“buhyyuf” the famine), or derived one (amsawaä The light stemming refers to a process of stripping off a
“amsawad ” the communication). This unit varies in small set of prefixes and/or suffixes, without trying to
gender, number and case. deal with infixes, or recognizing patterns and finding
- Gender: Nouns are categorised by roots (Larkey, 2002). As a first edition of such work in
grammatical gender: masculine or feminine. the IRCAM, with regard to the lack of huge digital
Generally, the masculine begin with an initial corpus availability, our method is based only on the
vowel a “a”, I “i”, or u “u”. While, the composition of words that is usually formed in the
feminine, used also to form diminutives and Moroccan standard Amazighe language as a sequence of
singulatives, is marked with the circumfix prefix, core, and suffix. We are assuming that we are not
t…t “t…t” (ampäaë “amh d  ar  ”masc., making use of any stem dictionary or exception list. Our
tampäaët “tamh  d ar  t” fem. the student). algorithm is merely based on an explicit list of prefixes
and suffixes that need to be stripped in a certain order.
- Number: There are two types: singular and
This list is derived from the common inflectional
plural, which has three forms. The external
morphemes of gender, number and case for nouns;
plural consists in changing the initial vowel,
personal markers, aspect and mood for verbs; and affix
and adding the suffix n or one of its variants
pronouns for kinship nouns and prepositions. While, the
in “in”, an “an”, yn “yn”, wn “wn”, awn
derivational morphemes are not included in order to keep
“awn”, iwn “iwn”, tn “tn” (impäaën “im
the semantic meaning of words. It is very reasonable to
amh d ar n” masc., timpäaëin
conflate the noun tarbat “tarbat” girl with its
“timh d  ar i n” fem. students). The broken
masculine form “arba” arba boy; while it seems
plural involves a change in the vowels of the
unreasonable, for some application like information
noun (Adrar “adrar” mountain → idurar
retrieval, to conflate the derived verb ssufv “ssufγ”
idurar mountains, Tivmst “tiγmst” tooth
bring out with the simple form ffv “ffγ” leave.
→ tivmas “tiγm a s” teeth). The mixed
The set of prefixes and suffixes, that we have identified,
plural is formed by the combination of
are classified to five groups ranged from one character to
vowels’ change and the use, sometimes of the
five characters.
suffixation n (izi “izi” fly → izan “izan”
flies, Amggaru “amgguru” last → imggura
4.1 Prefix Set
“imggura” lasts).
- One-character: a, I, n, u, t.
- Case: Two cases are distinguished. The free
case is unmarked, while the construct one - Two-character: na, ni, nu, ta, ti, tu, tt,
involves a variation of the initial vowel wa, wu, ya, yi, yu.
(argaz “argaz” man → urgaz, “urgaz” - Three-character: itt, ntt, tta, tti.
tamvart “tamγar t” woman → tmvart
“tmγa r t” ). - Four-character: itta, itti, ntta, ntti,
tett.
3.2 Verb - Five-character: tetta, tetti.
The verb, in Amazighe, has two forms: basic and derived
forms. The basic form is composed of a root and a 4.2 Suffix Set
radical (ffv “ffγ” leave), while the derived one is - One-character: a, d, I, k, m, n, v, s, t.
based on the combination of a basic form and one of the
following prefixes morphemes: s/ss “s/ss”, tt “tt” - Two-character: an, at, id, im, in, IV, mt,
and m/mm “m/mm” (ssufv “ssufγ” bring out). nv, nt, un, sn, tn, wm, wn, yn.
Whether basic or derived, the verb is conjugated in four - Three-character: amt, ant, awn, imt, int,
aspects: aorist, imperfective, perfect, and negative iwn, nin, unt, tin, tnv, tun, tsn, snt, wmt.
perfect. Moreover, it is constructed using the same
personal markers for each mood, as represented in - Four-character: tunt, tsnt.
Table1.
33/119
Indicative mood Imperative mood Participial mood

Masculine Feminine Masculine Feminine Masculine /


Feminine
Singular 1st pers. ... ⵖ ... ⵖ 2ndpers. ... Ø …Ø i….n
2ndpers. ⵜ ... ⴷ ⵜ ... ⴷ
3rdpers. ⵉ_...____ ⵜ_...____
Plural 1st pers. ⵏ ... ⵏ ... 2ndpers. ... ⴰⵜ/ⵜ/ⵎ ... ⴰⵎⵜ/ ⵎⵜ ….nin
2ndpers. ⵜ ... ⵎ ⵜ ... ⵎⵜ
3rdpers. ... ⵏ ... ⵏⵜ

Table 1: Personal markers for the indicative, imperative and participial moods

6. Appendix
Based on this list of affixes and on theoretical analysis,
Tifinaghe Latin Tifinaghe Latin
we notice that the proposed amazighe light stemmer
Correspondence Corresponden
could make two kinds of errors:
ce
- The understemming errors, in which words
referring to the same concept are not reduced ⴰ a ⵍ l
to the same stem, such the case of the verb ⴱ b ⵎ m
ffv “ ffγ” leave that ends with the character ⴳ g ⵏ n
v “ γ”, which coincides with the 1st singular ⴳⵯ gw ⵓ u
personal marker. So, the stem ffv “ffγ” of
ⴷ d ⵔ r
the verb when is conjugated in the perfect
aspect for the 1st singular person ffvv ⴹ d ⵕ r
“ffγγ” I left will not be conflated with stem ⴻ e ⵖ γ
ff “ff” of the 3rd singular masculine person ⴼ f ⵙ s
iffv “iffγ” he left. ⴽ k ⵚ s
- The overstemming errors, in which words are ⴽⵯ kw ⵛ c
converted to the same stem even though they ⵀ h ⵜ t
refer to distinct concepts, such the example of ⵃ h ⵟ t
the verb g “g” do and the noun aga “aga”
ⵄ ε ⵡ w
bucket. The stem g “g” of the verb when is
conjugated in the perfect aspect for the 3rd ⵅ x ⵢ y
singular masculine person iga “iga” he did ⵇ q ⵣ z
will be conflated with stem g “g” of the noun ⵉ i ⵥ z
aga “aga”. ⵊ j
In general, light stemmers avoid the overstemming
Table 2: Tifinaghe-Ircam Alphabet
errors, especially for the Indo-European languages;
however, it is not the case of the Amazighe language.
7. References
This proves that the Amazighe language constitutes a
significant challenge for natural language processing. Al-shammari, E. T., Lin, J. (2008). Towards an error-free
Arabic stemming. Actes de the 2nd ACM workshop on
improving non English web searching. pp.9--16.
5. Conclusion Ameur, M., Bouhjar, A., Boukhris, F., Boukouss, A.,
Boumalk, A., Elmedlaoui, M., Iazzi, E. M., Souifi, H.
Stemming is an important technique for highly inflected
(2004). Initiation à la langue amazighe. Rabat:
language such as Amazighe. In this work, we have
IRCAM.
investigated on the Amazighe language characteristics,
Boukhris, F., Boumalk, A., Elmoujahid, E., Souifi, H.
and have presented a light stemming approach for
(2008). La nouvelle grammaire de l'amazighe. Rabat:
Amazighe. We should note that the proposed stemming
IRCAM.
algorithm is primarily for handling inflections – it does
Larkey, L. S., Ballesteros, L., Connell, M. (2002).
not handle derivational suffixes, for which one would
Improving Stemming for Arabic Information
need a proper morphological analyzer.
Retrieval: Light Stemming and Cooccurrence
In attempt to improve the amazighe light stemmer, we
Analysis. In Proceedings of the 25th Annual
plan to build a stem dictionary, to elaborate a set of
International Conference on Research and
linguistic rules, and to set a list of exceptions to further
Development in Information Retrieval. Tampere,
extend the stemmer.
Finland, pp. 275--282.

34/119
Lovins, J. B. (1968). Development of a stemming
algorithm. Mechanical Translation and Computational
Linguistics, 11(1), pp. 22--31.
Paternostre, M., Francq, P., Lamoral, J., Wartel, D.,
Saerens, M. (2002). Carry, un algorithme de
désuffixation pour le français. Rapport technique du
projet Galilei.
Porter, M.F. (1980). An algorithm for suffix stripping.
Program, 14(3), pp.130--137.
Savoy, J. (1993). Stemming of French words based on
grammatical categories. Journal of the American
Society for Information Science, 44(1), pp.1--9.
Taghva, K., Elkhoury, R., Coombs, J. (2005). Arabic
stemming without a root dictionary. In Proceeding of
Information Technology: Coding and Computing. Las
Vegas, pp.152--157.

35/119
Using Mechanical Turk to Create a Corpus of Arabic Summaries
Mahmoud El-Haj, Udo Kruschwitz, Chris Fox
School of Computer Science and Electronic Engineering
University of Essex
Colchester, CO4 3SQ
United Kingdom
{melhaj,udo,foxcj}@essex.ac.uk
Abstract
This paper describes the creation of a human-generated corpus of extractive Arabic summaries of a selection of Wikipedia and Arabic
newspaper articles using Mechanical Turk—an online workforce. The purpose of this exercise was two-fold. First, it addresses a shortage
of relevant data for Arabic natural language processing. Second, it demonstrates the application of Mechanical Turk to the problem of
creating natural language resources. The paper also reports on a number of evaluations we have performed to compare the collected
summaries against results obtained from a variety of automatic summarisation systems.

1. Motivation 2. Related Work


There are various approaches to text summarisation, some
The volume of information available on the Web is increas- of which have been around for more than 50 years (Luhn,
ing rapidly. The need for systems that can automatically 1958). These approaches include single-document and
summarise documents is becoming ever more desirable. multi-document summarisation. One of the techniques of
For this reason, text summarisation has quickly grown into single-document summarisation is summarisation through
a major research area as illustrated by the Text Analysis extraction. This relies on the idea of extracting what appear
Conference (TAC) and the Document Understanding Con- to be the most important or significant units of information
ference (DUC) series. from a document and then combining these units to gener-
We are interested in the automatic summarisation of Ara- ate a summary. The extracted units differ from one system
bic documents. Research in Arabic is receiving growing to another. Most of the systems use sentences as units while
attention but it has widely been acknowledged that apart others work with larger units such as paragraphs.
from a few notable exceptions—such as the Arabic Penn Evaluating the quality and consistency of a generated sum-
Treebank1 and the Prague Arabic Dependency Treebank2 — mary has proven to be a difficult problem (Fiszman et al.,
there are few publicly available tools and resources for Ara- 2009). This is mainly because there is no obvious ideal
bic NLP, such as Arabic corpora, lexicons and machine- summary. The use of various models for system evaluation
readable dictionaries, resources that are common in other may help in solving this problem. Automatic evaluation
languages (Diab et al., 2007) although this has started to metrics such as ROUGE (Lin, 2004) and BLEU (Papineni
change in recent years (Maegaard et al., 2008; Alghamdi et al., 2001) have been shown to correlate well with hu-
et al., 2009). Some reasons for this lack of resources may man evaluations for content match in text summarisation
be due to the complex morphology, the absence of diacrit- and machine translation (Liu and Liu, 2008; Hobson et
ics (vowels) in written text and the fact that Arabic does al., 2007, for example). Other commonly used evaluations
not use capitalisation. Tools and resources however are es- include measuring information by testing readers’ under-
sential to advance research in Arabic NLP. In the case of standing of automatically generated summaries.
summarisation tasks, most of the activity is concerned with This very brief review of related work should serve as a
the English language—as with TAC and DUC. This focus motivation for the corpus of Arabic summaries that we have
is reflected in the availability of resources: in particular, produced for the Arabic NLP community. Our decision to
there is no readily available “gold standard” for evaluating use the Mechanical Turk platform is justified by the fact
Arabic summarisers. that it has already been shown to be effective for a variety
of NLP tasks achieving expert quality (Snow et al., 2008;
Tools and resources are essential to advance research in
Callison-Burch, 2009, for example).
Arabic NLP, but generating them with traditional tech-
niques is both costly and time-consuming. It is for this 3. The Document Collection
reason that we considered using Amazon’s Mechanical The document collection used in the development of the re-
Turk3 —an online marketplace for work that requires hu- source was extracted from the Arabic language version of
man intelligence—to generate our own reference standard Wikipedia4 and two Arabic newspapers; Alrai5 from Jor-
for extractive summaries. dan and Alwatan6 from Saudi Arabia. These sources were
chosen for the following reasons.

1 4
http://www.ircs.upenn.edu/arabic/ http://www.wikipedia.com
2 5
http://ufal.mff.cuni.cz/padt/PADT 1.0/ http://www.alrai.com
3 6
http://www.mturk.com http://www.alwatan.com.sa
36/119
1. They contain real text as would be written and used by and a query (in this case the document’s title) and
native speakers of Arabic. returns an extractive summary (El-Haj and Hammo,
2008; El-Haj et al., 2009).
2. They are written by many authors from different back-
grounds. Gen-Summ: similar to AQBTSS except that the query is
replaced by the document’s first sentence.
3. They cover a range of topics from different subject
areas (such as politics, economics, and sports), each LSA-Summ: similar to Gen-Summ, but where the vector
with a credible amount of data. space is tranformed and reduced by applying Latent
Semantic Analysis (LSA) to both document and query
The Wikipedia documents were selected by asking a group
(Dumais et al., 1988).
of students to search the Wikipedia website for arbitrary
topics of their choice within given subject areas. The sub- Baseline-1: the first sentence of a document.
ject areas were: art and music; the environment; politics;
sports; health; finance and insurance; science and technol- The justification for selecting the first sentence in Baseline-
ogy; tourism; religion; and education. To obtain a more 1 is the believe that in Wikipedia and news articles the first
uniform distribution of articles across topics, the collection sentence tends to contain information about the content of
was then supplemented with newspaper articles that were the entire article, and is often included in extractive sum-
retrieved from a bespoke information retrieval system using maries generated by more sophisticated approaches (Bax-
the same queries as were used for selecting the Wikipedia endale, 1958; Yeh et al., 2008; Fattah and Ren, 2008; Ka-
articles. Each document contains on average 380 words. tragadda et al., 2009).
When using Mechanical Turk on other NLP tasks, it has
4. The Human-Generated Summaries been shown that aggregation of multiple independent anno-
tations from non-experts can approximate expert judgement
The corpus of extractive document summaries was gener-
(Snow et al., 2008; Callison-Burch, 2009; Albakour et al.,
ated using Mechanical Turk. The documents were pub-
2010, for example). For this reason, we evaluated the re-
lished as “Human Intelligence Tasks” (HITS). The asses-
sults of the systems not with the raw results of Mechanical
sors (workers) were asked to read and summarise a given
Turk, but with derived gold standard summaries, generated
article (one article per task) by selecting what they consid-
by further processing and analysis of the human generated
ered to be the most significant sentences that should make
summaries.
up the extractive summary. They were required to select no
The aggregation of the summaries can be done in a num-
more than half of the sentences in the article. Using this
ber of ways. To obtain a better understanding of the impact
method, five summaries were created for each article in the
of the aggregation method on the results of the evaluation,
collection. Each of the summaries for a given article were
we constructed three different gold standard summaries for
generated by different workers.
each document. First of all we selected all those sentences
In order to verify that the workers were properly engaged
identified by at least three of the five annotators (we call
with the articles, and provide a measure of quality assur-
this Level 3 summary). We also created a similar summary
ance, each worker was asked to provide up to three key-
which includes all sentences that have been identified by
words as an indicator that they read the article and did not
at least two annotators (called Level 2). Finally, each docu-
select random sentences. In some cases where a worker
ment has a third summary that contains all sentences identi-
appeared to select random sentences, the summary is still
fied by any of the annotators for this document (called All).
considered as part of the corpus to avoid the risk of subjec-
This last kind of summary will typically contain outlier
tive bias.
sentences. For this reason, only the first two kinds of ag-
The primary output of this project is this corpus of 765
gregated summaries (Level 2 and Level 3) should really be
human-generated summaries that we obtained, which is
viewed as providing genuine gold standards. The third one
now available to the community.7 To set the results in con-
(All) is considered here just for the purposes of providing a
text, and illustrate its use, we also conducted a number of
comparison.
evaluations.
A variety of evaluation methods have been developed for
5. Evaluations summarisation systems. As we are concerned with extrac-
tive summaries, we will concentrate on results obtained
To illustrate the use of the human-generated summaries from applying Dice’s coefficient (Manning and Schütze,
from Mechanical Turk in the evaluation of automatic sum- 1999), although we will discuss briefly results from N-gram
marisation, we created extractive summaries of the same set and substring-based methods ROUGE (Lin, 2004) and Au-
of documents using a number of systems, namely: toSummENG (Giannakopoulos et al., 2008).
Sakhr: an online Arabic summariser.8
5.1. Dice’s Coefficient
AQBTSS: a query-based document summariser based on We used Dice’s coefficient to judge the similarity of the sen-
the vector space model that takes an Arabic document tence selections in the gold-standard extractive summaries
— derived from the human-generated, Mechanical Turk
7
http://privatewww.essex.ac.uk/˜ melhaj/ summaries — with those generated by Sakhr, AQBTSS,
easc.htm Gen-Summ, LSA-Summ and Baseline-1 (Table 1). Statis-
8
http://www.sakhr.com tically significant differences can be observed in a number
37/119
Sakhr AQBTSS Gen-Summ LSA-Summ Baseline-1
All 39.07% 32.80% 39.51% 39.23% 25.34%
Level 2 48.49% 39.90% 48.95% 50.09% 26.84%
Level 3 43.40% 38.86% 43.39% 42.67% 40.86%

Table 1: Dice results: systems versus MTurk-derived gold standards.

Sakhr AQBTSS LSA-Summ Gen-Summ Baseline-1


Sakhr — 51.09% 58.77% 58.82% 38.11%
AQBTSS 51.09% — 54.61% 58.48% 47.86%
LSA-Summ 58.77% 54.61% — 84.70% 34.66%
Gen-Summ 58.82% 58.48% 84.70% — 34.99%

Table 2: Dice results: comparing systems.

of cases, but we will concentrate on some more general ob- account how many times two N-grams are found to be
servations. neighbours. Gen-Summ and LSA-Summ gave the highest
We observe that the commercial system Sakhr as well as values indicating that they produce results more similar to
the systems that build a summary around the first sentence our gold standard summaries than what Sakhr and AQBTSS
most closely approximate the gold standards, i.e. Level 2 produced.
and Level 3. This is perhaps not surprising as the overlap When applying ROUGE we considered the results of
with the document’s first sentence has been shown to be a ROUGE-2, ROUGE-L, ROUGE-W, and ROUGE-S which
significant feature in many summarisers (Yeh et al., 2008; have been shown to work well in single document summari-
Fattah and Ren, 2008). sation tasks (Lin, 2004). In line with the results discussed
It is interesting to note that summaries consisting of a sin- above, LSA-Summ and Gen-Summ performed better on av-
gle sentence only (i.e. Baseline-1) do not score particu- erage than the other systems in terms of recall, precision
larly well. That suggests that the first sentence is important and F -measure (when using Level 2 and Level 3 summaries
but not sufficient for a good summary. When comparing as our gold standards). Regarding the other systems, they
Baseline-1 with the Level 2 and Level 3 summaries, respec- all performed better than Baseline-1.
tively, we also note how the “wisdom of the crowd” seems These results should only be taken to be indicative. Dice’s
to converge on the first sentence as a core part of the sum- coefficient appears to be a better method for extractive sum-
mary. maries as we are comparing summaries on the sentence
Finally, the system that most closely approximates our level. It is however worth noting that the main results ob-
Level 2 gold standard uses LSA, a method shown to work tained from Dice’s coefficient are in line with results from
effectively in various NLP and IR tasks including summari- ROUGE and AutoSummENG.
sation, e.g. (Steinberger and Ježek, 2004; Gong and Liu,
2001). 6. Conclusions and Future Work
We also compared the baseline systems with each other (Ta- We have demonstrated how gold-standard summaries can
ble 2). This is to get an idea of how closely the summaries be extracted using the “wisdom of the crowd”.
each of these systems produce correlate with each other. Using Mechanical Turk has allowed us to produce a re-
The results suggest that the system that extracts the first source for evaluating Arabic extractive summarisation tech-
sentence only does not correlate well with any of the other niques at relatively low cost. This resource is now available
systems. At the same time we observe that Gen-Summ and to the community. It will provide a useful benchmark for
LSA-Summ generate summaries that are highly correlated. those developing Arabic summarisation tools. The aim of
This explains the close similarity when comparing each of the work described here was to create a relatively small but
these systems against the gold standards (see Table 1). It usable resource. We provided some comparison with alter-
also demonstrates (not surprisingly) that the difference be- native summarisation systems for Arabic. We have delib-
tween a standard vector space approach and LSA is not erately made no attempt in judging the individual quality
great for the relatively short documents in a collection of of each system. How this resource will be used and how
limited size. effective it can be applied remains the task of the users of
this corpus.
5.2. Other Evaluation Methods
In addition to using Dice’s coefficient, we also applied the 7. References
ROUGE (Lin, 2004) and AutoSummENG (Giannakopou- M-D. Albakour, U. Kruschwitz, and S. Lucas. 2010.
los et al., 2008) evaluation methods. Sentence-level attachment prediction. In Proceedings of
In our experiments with AutoSummENG we obtained val- the 1st Information Retrieval Facility Conference, Lec-
ues for “CharGraphValue” in the range 0.516–0.586. This ture Notes in Computer Science 6107, Vienna. Springer.
indicates how much the graph representation of a model M. Alghamdi, M. Chafic, and M. Mohamed. 2009. Arabic
summary overlaps with a given peer summary, taking into language resources and tools for speech and natural lan-
38/119
guage: Kacst and balamand. In 2nd International Con- USA. ACL.
ference on Arabic Language Resources & Tools, Cairo, C. Lin. 2004. ROUGE: A package for automatic evalua-
Egypt. tion of summaries. In Proceedings of the Workshop on
P. B. Baxendale. 1958. Machine-made index for technical Text Summarization Branches Out (WAS 2004), pages
literature—an experiment. IBM Journal of Research and 25–26.
Development, 2. F. Liu and Y. Liu. 2008. Correlation between rouge and hu-
C. Callison-Burch. 2009. Fast, cheap, and creative: Eval- man evaluation of extractive meeting summaries. In HLT
uating translation quality using Amazon’s Mechanical ’08: Proceedings of the 46th Annual Meeting of the As-
Turk. In Proceedings of the 2009 Conference on Empiri- sociation for Computational Linguistics on Human Lan-
cal Methods in Natural Language Processing (EMNLP), guage Technologies, pages 201–204. ACL.
pages 286–295. Association for Computational Linguis- H. P. Luhn. 1958. The automatic creation of litera-
tics. ture abstracts. IBM Journal of Research Development,
M. Diab, K. Hacioglu, and D. Jurafsky. 2007. Auto- 2(2):159–165.
matic Processing of Modern Standard Arabic Text. In B. Maegaard, M. Atiyya, K. Choukri, S. Krauwer, C. Mok-
A. Soudi, A. van den Bosch, and G. Neumann, editors, bel, and M. Yaseen. 2008. Medar: Collaboration be-
Arabic Computational Morphology: Knowledge-based tween european and mediterranean arabic partners to
and Empirical Methods, Text, Speech and Language support the development of language technology for ara-
Technology, pages 159–179. Springer Netherlands. bic. In In Proceedings of the 6th International Confer-
S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, ence on Language Resources and Evaluation (LREC),
and R. Harshman. 1988. Using latent semantic analy- Marrakech, Morocco.
sis to improve access to textual information. In CHI ’88: C. D. Manning and H. Schütze. 1999. Foundations of Sta-
Proceedings of the SIGCHI conference on Human fac- tistical Natural Language Processing. The MIT Press,
tors in computing systems, pages 281–285. ACM. Cambridge, Massachusetts.
M. El-Haj and B. Hammo. 2008. Evaluation of query- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
based Arabic text summarization system. In Proceeding BLEU: a method for automatic evaluation of machine
of the IEEE International Conference on Natural Lan- translation. In Proceeding of the 40th Annual Meeting
guage Processing and Knowledge Engineering, NLP- on Association for Computational Linguistics (ACL’02).
KE’08, pages 1–7, Beijing, China. IEEE Computer Soci- Association for Computational Linguistics.
ety. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. 2008.
M. El-Haj, U. Kruschwitz, and C. Fox. 2009. Experiment- Cheap and Fast - But is it Good? Evaluating Non-Expert
ing with Automatic Text Summarization for Arabic. In Annotations for Natural Language Tasks. In Proceed-
Proceedings of the 4th Language and Technology Con- ings of the 2008 Conference on Empirical Methods in
ference (LTC’09), pages 365–369, Poznań, Poland. Natural Language Processing, pages 254–263. Associa-
tion for Computational Linguistics.
M.A. Fattah and Fuji Ren. 2008. Automatic text summa-
J. Steinberger and K. Ježek. 2004. Using latent seman-
rization. In Proceedings of World Academy of Science,
tic analysis in text summarization and summary evalua-
volume 27, pages 192–195. World Academy of Science.
tion. In Proceedings of the 5th International Conference
M. Fiszman, D. Demner-Fushman, H. Kilicoglu, and T. C. on Information Systems Implementation and Modelling
Rindflesch. 2009. Automatic summarization of medline (ISIM), pages 93–100.
citations for evidence-based medical treatment: A topic-
J.-Y. Yeh, H.-R. Ke, and W.-P. Yang. 2008. iSpread-
oriented evaluation. Jouranl of Biomedical Informatics,
Rank: Ranking sentences for extraction-based summa-
42(5):801–813.
rization using feature weight propagation in the sentence
G. Giannakopoulos, V. Karkaletsis, G. Vouros, and P. Stam- similarity network. Expert Systems with Applications,
atopoulos. 2008. Summarization system evaluation re- 35(3):1451 – 1462.
visited: N-gram graphs. ACM Transactions on Speech
and Language Processing (TSLP), 5(3):1–39.
Y. Gong and X. Liu. 2001. Generic text summarization
using relevance measure and latent semantic analysis. In
SIGIR ’01: Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 19–25. ACM.
S. P. Hobson, B. J. Dorr, C. Monz, and R. Schwartz. 2007.
Task-based evaluation of text summarization using rel-
evance prediction. Information Processing & Manage-
ment, 43(6):1482–1499.
R. Katragadda, P. Pingali, and V. Varma. 2009. Sentence
position revisited: a robust light-weight update summa-
rization ’baseline’ algorithm. In CLIAWS3 ’09: Proceed-
ings of the Third International Workshop on Cross Lin-
gual Information Access, pages 46–52, Morristown, NJ,
39/119
DefArabicQA: Arabic Definition Question Answering System
Omar Trigui1, Lamia Hadrich Belguith1 , Paolo Rosso2
1
ANLP Research Group- MIRACL Laboratory, University of Sfax, Tunisia
2
Natural Language Engineering Lab. - ELiRF, Universidad Politécnica de Valencia, Spain
{omar.trigui,l.belguith}@fsegs.rnu.tn, prosso@dsic.upv.es

Abstract
Today the Web is the largest resource of knowledge and, therefore, sometimes this makes it difficult to find precise
information. Current search engines can only return ranked snippets containing the effective answers to a query user. But,
they can not return the exact answers. Question Answering systems present the solution to obtain effective and exact answers
to a user question asked in natural language question instead of keywords query. Unfortunately, Question Answering task for
the Arabic language has not been investigated enough in the last decade, compared to other languages. In this paper, we
tackle the definition Question Answering task for the Arabic language. We propose an Arabic definitional Question
Answering system based on a pattern approach to identify exact and accurate definitions about organization using Web
resources. We experimented this system using 2000 snippets returned by Google search engine and Wikipedia Arabic version
and a set of 50 organization definition questions. The obtained results are very encouraging: (90%) of the questions used have
complete (vital) definitions in the top-five answers and (64%) of them have complete definitions in the top-one answer. MRR
was (0.81).

NTCIR3). An analysis of the TREC QA task


1 Introduction experiments shows that two kinds of questions are
Definition questions of the type ‘What is X?’ is mainly involved: factual and definition questions. A
frequently asked on the Web. This type of question factual question is a simple fact retrieval where the
is generally asked for information about answer is often a named entity (e.g. ‘Who is the
organization or thing. Generally, dictionaries and president of the League of Arab States?’). Whereas
encyclopaedias are the best resources for this type a definition question is a question asking for any
of answers. However, these resources often do not important information about someone or something
contain the last information about a specific (e.g., ‘What is the League of Arab States?’).
organization or do not yet contain a definition of a Unfortunately, the evaluation platforms of QA task
new organization due to non instantaneous update. in the mainly evaluation conferences do not include
Thus, the user has the habit to look for a definition the Arabic language. To our knowledge, no research
from searching the Web. Our research takes place in has been done on Arabic definitional QA systems.
this context to make easy the obtaining of the However, there are some attempts to build factual
organization definition from Web resources. In this QA systems (e.g. Hammo et al.,2002; Benajiba et
paper, we present a definitional Question al.,2007a; Brini et al.,2009). We cited below an
Answering (QA) system for the Arabic language overview of these factual Question Answering
called DefArabicQA. This system outperforms the systems. (Hammo et al., 2002; 2004) developed
use of Web searching by two criteria: (i) permits to QARAB a factual QA system. They employed
ask by an ordinary question (e.g., ‘What is X?’) information retrieval techniques to identify
instead of asking by keywords query; (ii) returns an candidate passages, and sophisticated natural
accurate answer instead of mining the Web language processing techniques to parse the
searching results in order to find the expected question and the top 10 ranked passages. They
information. adopted a keyword matching strategy to identify
The paper is organized as follows: Section 2 answers. The answer identified is the whole
provides an overview of the Arabic QA systems. sentence matching the question keywords. The
Section 3 presents our definitional QA system evaluation process of this system was based on 113
DefArabicQA. Section 4 presents the realized questions and a set of documents collected from the
experiments and Section 5 discusses the obtained newspaper Al-Raya. They obtained a precision
results. A conclusion and some future directions for equal to 97.3%, recall equal to 97.3% and MRR
our work are exposed in Section 6. equal to 0.86 (Hammo et al.,2004). The average
length of the answers obtained was 31 words.
(Kanaan et al.,2004) developed a QA system using
2 Related works
approximately the same method of (Hammo et
QA systems are designed to retrieve the exact
answers from a set of knowledge resources to the 2
user question. Many researches are interested in this Cross-Language Evaluation Forum http://clef-
task in many competitions (e.g., TREC1, CLEF2 and campaign.org/
3
NII Test Collection for IR Systems
1
Text Retrieval Conference http://trec.nist.gov/ http://research.nii.ac.jp/ntcir/

40/119
al.,2002) system’s. Their evaluation was based on a The architecture of the DefArabicQA system is
set of 25 documents from the Web and 12 illustrated in Figure 1. From a general viewpoint,
questions. (Benajiba et al.,2007a) developed the system is composed of the following
‘ArabiQA’ a factual QA system. They employed components: i) question analysis, ii) passage
Arabic-JIRS4 (Benajiba et al.,2007b), a passage retrieval, iii) definition extraction and iv) ranking
retrieval system to search the relevant passages. candidate definitions.
They used also the named entity system ANERsys This system does not use any sophisticated
(Benajiba et al.,2007c) to identify and classify syntactic or semantic techniques, as those used for
named entities within the passages retrieved. The factual QA systems (Hammo et al.,2002; Benajiba
test-set consists of 200 questions and 11,000 et al.,2007).
documents from Wikipedia Arabic version. They
reached a precision of 83.3% (Benajiba et 3. 1 Question analysis
al.,2007a). (Brini et al.,2009) developed a prototype This module is a vital component of DefArabicQA.
to build an Arabic factual Question Answering The result of this module is the identification of the
system using Nooj platform5 to identify answers topic question (i.e., named entity) and the
from a set of education books. Most of these dedication of the answer type expected. The
researches cited above, have not made test-bed question topic is identified by using two lexical
publicly available, which makes it impossible to question patterns (Table. 1) and the answer type
compare their evaluation results. expected is deduced from the interrogative pronoun
As we have already said, there is not a research of the question.
focused on definitional QA systems for the Arabic
language. Therefore, we have considered that an
Expected answer
effort needs to be done in this direction. We built an Question patterns
types
Arabic QA system, which we named DefArabicQA
Who+be+<topic> ? ‫من ھو| من ھي >الموضوع<؟‬ Person
that identifies and extracts the answers (i.e., exact
definitions) from Web resources. Our approach is What+be+<topic> ? ‫ما ھو| ما ھي> الموضوع<؟‬ Organization
inspired from researches that have obtained good
results in TREC experiments. Among these Table 1. Question patterns and their expected
researches we cite the work of (Grunfeld & Kwok, answer types used by DefArabicQA system
2006) which is based on techniques from IR,
pattern matching and metakeyword detection with 3.2 Passage retrieval
little linguistic analysis and no natural language The passage retrieval module collects the top-n
understanding. snippets retrieved by the Web search engine. This
specific query is constituted of the question topic
3 The DefArabicQA system which is identified by the question analysis module.
After collecting the top-n snippets, only those
snippets containing the integrate question topic are
kept on the basis of some heuristic (e.g. length of a
snippet must be more than 13 characters).

3.3 Definition extraction


This module is the core module of a definitional
QA system and it is composed of two sub-modules
that are in charge of: i) identifying candidate
definitions, and ii) filtering candidate definitions.

3.3.1. Identifying candidate definitions


In this step, we identify and extract candidate
definitions from the collection of snippets collected
in the passage retrieval module. We use lexical
patterns to identify these candidate definitions.
Generally, a lexical pattern is a sequence of strings
(e.g., words, letters and punctuation symbols)
which provide a context to identify the exact
Figure. 1. Architecture of DefArabicQA answers. It reflects a common use of written styles
used to introduce an organization.
4
In our context, patterns are created manually and no
http://sourceforge.net/projects/jirs/ natural language processing is employed in their
5
http://www.nooj4nlp.net/pages/nooj.html
construction. A candidate definition is identified by
a specific pattern if the surrounding of the question

41/119
topic in a snippet is recognized by a specific centroid vector and in the candidate definition
pattern. CDi , 1 ≤ k ≤ n and f ik is the frequency of wordk.

3.3.2. Filtering candidate definitions 3.4.4. Criteria aggregation


We use heuristic rules to filter the identified In order to aggregate the three criteria described
candidate definitions. These heuristic rules are above, we first proceed to the normalization of the
deduced from the observation of a set of annotated score of each criterion by dividing it by the
candidate definitions (i.e., a collection of candidate maximum score as follows:
definitions divided in incorrect candidate
definitions and correct candidate definitions). Ci , j
Ci', j = (4)
MaxCi
3.4 Definition ranking
The component “definition ranking” is based on a Where i is a candidate definition and j a criterion.
statistical approach. We used a global score to rank Then, we combine the three normalized scores in
candidate definitions retained in the “Definition order to obtain the global score GS of the candidate
Extraction” module. This global score is a definition CDi . This global score is obtained by:
combination of three scores related to three criteria
of a candidate definition: i) pattern weight criterion, 3
GS (CDi ) = (5)
ii) snippet position criterion, and iii) word ∑
j =1
Ci', j
frequency criterion. We present to the user the first
top-5 candidate definitions ranked according to
their global scores.
4. Experiments and results
3.4.1. Pattern weight criterion ( C1 ) This section describes two experiments carried out
The score of this criterion is the weight of the using the DefArabicQA system. The first
pattern that has identified the candidate experiment was carried out using Google Search
definition CDi . This score is represented by: engine6, while the second experiment was carried
out using Google Search engine and the free
C1 (CDi ) = wi (1) encyclopedia Wikipedia Arabic version7. In both
experiments, we used 50 organization definition
Where wi presents the weight of pattern i. We
questions8 similar to these used in TREC. The
associate a weight to each pattern according to its system was assessed by an Arabic native speaker.
relevance. As evaluation metrics, we use MRR. It is a
measure used in TREC QA section and it is
3.4.2. Snippets position criterion ( C2 ) calculated as follows: each question is assigned a
score equal to the inverse rank of the first string that
The score of this criterion represents the position of
is judged to contain a correct answer. If none of the
the snippet that contains the candidate definition
five answer strings contain an answer, the question
(in the snippets collection). This score is done by:
is assigned a score of zero. The MRR value for the
C2 (CDi ) = pi (2) experiment is calculated by taking the average of
scores for all the questions (Voorhees, 2001).
Where pi is the snippet position containing the
candidate definition CDi . 4.1 Results of the first experiment
Out of the 50 questions in the test collection 41
3.4.3. Word frequency criterion ( C3 ) questions (82%) were answered correctly by
complete definitions in the top-five candidate
The score of this criterion represents the sum of the
definitions. 54% of the questions were answered by
frequencies of the words occurring in a candidate
the first candidate definition returned, 14% by the
definition. According to this criterion, the candidate
second candidate definition, 6% by the third
definition CDi score is calculated as follows. Firstly, candidate definition, 6% by the fourth candidate
we construct a centroid vector containing common
words across candidate definitions with their
6
frequencies, beyond stopwords. Secondly, we http://www.google.com/intl/ar/
calculate the frequency sum of the words recurring 7
in both CDi and centroid vector as indicated by the
http://ar.wikipedia.org/w/index.php?title=‫بحث‬:‫&خاص‬searc
following formulate: h=&go=‫اذھب‬
n
C3 (CDi ) = ∑ k =1
f ik (3) 8
Resources available for research purpose at:
Where n is the number of words which occur in the http://sites.google.com/site/omartrigui/downloads

42/119
definition, 2% by the fifth candidate definition as 10% (in the second experiment). Also, the Rate of
shown in Table 2. The systems missed 18% of the the questions answered by the first returned
questions as shown in Table 3. MRR was equal to candidate definition was increased from 54% (in the
0.70 as shown in Table 4. first experiment) to 64% (in the second
experiment).
4.2 Results of the second experiment
The main goal of the second experiment is to 6. Conclusion and future work
measure the value added by the Web resource In this paper we proposed a definitional Question
Wikipedia to the results obtained in the first Answering system called DefArabicQA. This
experiment with the Google search engine. system provides effective and exact answers to
In this experiment, we used the same set of definition questions expressed in Arabic language
questions of the first experiment with Google from Web resources. It is based on an approach
search engine and Wikipedia as Web resources. Out which employs a little linguistic analysis and no
of the 50 questions in the test collection, 45 language understanding capability. DefArabicQA
questions (90%) were answered correctly by identifies candidate definitions by using a set of
complete definitions in the top-five candidate lexical patterns, filters these candidate definitions
definitions. 64% of the questions were answered by by using heuristic rules and ranks them by using a
the first returned candidate definition, 16% by the statistical approach.
second candidate definition, 4% by the third Two evaluation experiments have been carried out
candidate definition, 2% by the fourth candidate on DefArabicQA. The first experiment was based
definition and 4% by the fifth candidate definition on Google as a Web resource and has obtained an
as shown in Table 2. The system missed 10% of the MRR equal to 0.70 and a rate of questions
questions as shown in Table 3. The obtained value answered by the first answer equal to 54%, while
of MRR is 0.81 (see Table 4). the second experiment was based on Google
coupled with Wikipedia as Web resources. In this
Experiment I Experiment II experiment, we obtained an MRR equal to 0.81 and
st
a rate of questions answered by the first answer
Rank 1 27 (54%) 32 (64%) equal to 64%. 50 definition questions are used for
nd
Rank 2 7 (14%) 8 (16%) both experiments.
Rank 3 th
3 (6%) 2 (4%) As future works, we plan to improve the quality of
th
the definitions when it is truncated. Indeed, in some
Rank 4 3 (6%) 1 (2%) cases, few words are missed at the end of the
Rank 5th 1 (2%) 2 (4%) definition answer. This is due to the fact that the
Top-five 41 (82%) 45 (90%) snippet itself is truncated. As a solution, we will
download the original Web page and segment the
Table 2. Rate of the answered questions for each useful snippet correctly using a tokenizer. We also
Rank (the Top-5 positions) plan to conduct an empirical study to determine
different weights to the three used criteria for
Experiment I Experiment II ranking the candidate definitions. These weights
Top-5 9 (18%) 5 (10%) will reflect the importance of each criterion.
Table 3. Rate of non answered questions (in the Acknowledgments
Top-5 positions)
This research work started thanks to the bilateral
Spain-Tunisia research project on "Answer
Experiment I Experiment II Extraction for Definition Questions in Arabic"
MRR 0.70 0.81 (AECID-PCI B/017961/08).
The work of the third author was carried out in the
Table 4. MRR values for both experiments framework of the AECID-PCI C/026728/09 and the
TIN2009-13391-C04-03 research projects.
5. Discussion
The two experiments cited above showed that our
approach applied in DefArabicQA system returned References
reasonably good results.
Benajiba, Y., Rosso, P., and Lyhyaoui, A. (2007.a).
The Web resource Wikipedia has improved the
Implementation of the ArabiQA Question
results of DefArabicQA when it was coupled with
Answering System's Components. In
Google in the second experiment. The MRR was
increased from 0.70 (in the first experiment) to 0.81 Proceedings of Workshop on Arabic Natural
(in the second experiment) and the rate of non Language Processing, 2nd Information
answered question in the Top-5 positions was Communication Technologies Int. Symposium.
decreased from 18% (in the first experiment) to

43/119
Benajiba, Y., Rosso, P., and J.M. Gomez. (2007.b).
Adapting the JIRS Passage Retrieval System to
the Arabic Language. In Proceeding of CICLing
conference, Springer-Verlag, 2007. pages 530--
541.

Benajiba, Y., Rosso., P. and Benedi Ruiz. J.M.


(2007.c). ANERsys: An Arabic Named Entity
Recognition System Based on Maximum
Entropy. In Proceeding of CICLing conference,
volume 4394 of Lecture Notes in Computer
Science, Springer-Verlag, 2007. pages 143--153.

Brini, W., Ellouze, M., Mesfar, S., Hadrich


Belguith, L. (2009). An Arabic Question-
Answering system for factoid questions. In
Proceeding of IEEE International Conference on
Natural Language Processing and Knowledge
Engineering. pages 797--805.

Hammou, B., Abu-salem, H., Lytinen, S., and


Evens. M. (2002). QARAB: A question
answering system to support the Arabic language.
In Proceedings of the workshop on
computational approaches to Semitic languages,
ACL, pages 55--65.

Hammo, B., Ableil, S., Lytinen S., and Evens, M.


(2004). Experimenting with a question answering
system for the Arabic language. In Computers
and the humanities, vol. 38, N°4, pages 397--
415.

Kanaan, G., Hammouri, A., Al-Shalabi R., and


Swalha. M. (2009). A New Question Answering
System for the Arabic Language. In American
Journal of Applied Sciences 6 (4), pages 797--
805.

Grunfeld, L., and Kwok, K.L. (2006). Sentence


ranking using keywords and meta-keywords. In:
Advances in Open Domain Question Answering,
T. Strzalkowski and S. Harabagiu, (eds.).
Springer, pages 229--258.

Voorhees, E., (2001). Overview of the TREC 2001


Question Answering Track. In Proceedings of the
10th Text REtrieval Conference, pages 42–51.

44/119
Techniques for Arabic Morphological Detokenization
and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash
Center for Computational Learning Systems, Columbia University
475 Riverside Drive New York, NY 10115
{akholy,habash}@ccls.columbia.edu

Abstract
The common wisdom in the field of Natural Language Processing (NLP) is that orthographic normalization and morphological tok-
enization help in many NLP applications for morphologically rich languages like Arabic. However, when Arabic is the target output,
it should be properly detokenized and orthographically correct. We examine a set of six detokenization techniques over various tok-
enization schemes. We also compare two techniques for orthographic denormalization. We discuss the effect of detokenization and
denormalization on statistical machine translation as a case study. We report on results which surpass previously published efforts.

1. Introduction though the implementation details are different, their so-


Arabic is a morphologically rich language. The common lution is comparable to one of our new (but not top per-
wisdom in the field of natural language processing (NLP) forming) decomposition models (T+LM). We do not com-
is that tokenization of Arabic words through decliticiza- pare directly to their implementation approach in this pa-
tion and reductive orthographic normalization is helpful for per. Regarding English-to-Arabic MT, Sarikaya and Deng
many applications such as language modeling and statisti- (2007) use joint morphological-lexical language models to
cal machine translation (SMT). Tokenization and normal- re-rank the output English-dialectal Arabic MT; and Badr
ization reduce sparsity and decrease the number of out-of- et al. (2008) report results on the value of morphological
vocabulary (OOV) words. However, in order to produce tokenization of Arabic during training and describe differ-
proper Arabic that is orthographically correct, tokenized ent techniques for detokenization of Arabic in the output.
and orthographically normalized words should be detok- The research presented here is most closely related to that
enized and orthographically corrected (enriched). As an ex- of Badr et al. (2008). We extend on their contribution and
ample, the output of English-to-Arabic machine translation present a comparison of a larger number of tokenization
(MT) systems is reasonably expected to be proper Arabic schemes and detokenization techniques that yield improved
regardless of the preprocessing used to optimize the MT results over theirs.
performance. Anything less is comparable to producing
3. Arabic Linguistic Issues
all lower-cased English or uncliticized and undiacritized
French. Detokenization is not a simple task because there In this section, we present relevant aspects of Arabic word
are several morphological adjustments that apply in the pro- orthography and morphology.
cess. In this paper we examine different detokenization 3.1. Arabic Orthography
techniques for various tokenization schemes and their ef-
fect on SMT output as a case study. Certain letters in Arabic script are often spelled inconsis-
tently which leads to an increase in both sparsity (multi-
This paper is divided as follows. Section 2 presents the
ple forms of the same word) and ambiguity (same form
previous related work. In Section 3, we discuss the Arabic
corresponding to multiple words). In particular, variants
linguistic issues and complexities that motivate the deto-
kenization techniques explained in Section 4. Section 5 of Hamzated Alif, @ Â1 or @ Ǎ are often written without
describes the various experiments we had followed by an their Hamza ( Z ’): @ A; and the Alif-Maqsura (or dotless
analysis of the results. Ya) ø ý and the regular dotted Ya ø y are often used inter-
changeably in word final position. This inconsistent vari-
2. Related Work ation in raw Arabic text is typically addressed in Arabic
Much work has been done on Arabic-to-English MT NLP through what is called orthographic normalization, a
(Habash and Sadat, 2006; Lee, 2004; Zollmann et al., 2006) reductive process that converts all Hamzated Alif forms to
mostly focusing on reducing the sparsity caused by Ara- bare Alif and dotless Ya/Alif Maqsura form to dotted Ya.
bic’s rich morphology. There is also a growing number We will refer to this kind of normalization as a Reduced
of publications with Arabic as target language. In previ- normalization (R ED). We introduce a different type of nor-
ous work on Arabic language modeling, OOV reduction malization that selects the appropriate form of the Alif. We
was accomplished using morpheme-based models (Heintz, call this Enriched normalization (E NR). E NR Arabic is op-
2008). Diehl et al. (2009) also used morphological decom- timally the desired correct form of Arabic to generate.
position for Arabic language modeling for speech recog-
nition. They described an SMT approach to detokeniza- 1
All Arabic transliterations are provided in the Habash-Soudi-
tion (or what they call morpheme-to-word conversion). Al- Buckwalter transliteration scheme (Habash et al., 2007).
45/119
Comparing a manually enriched (E NR) version of the Penn occurring Arabic text and English-Arabic SMT outputs. To
Arabic Treebank (PATB) (Maamouri et al., 2004) to its re- that end, we consider the following variants:
duced (R ED) version, we find that 16.2% of the words are
different. However, the raw version of the PATB is only dif- 4.1. Tokenization
ferent in 7.4% of the words. This suggests a major problem
We consider five tokenization schemes discussed in the lit-
in the recall of the correct E NR form in raw text.
erature, in addition to a baseline no-tokenization scheme
Another orthographic issue is the optionality of diacritics
(D0). The D1, D2, TB and D3 schemes were first pre-
in Arabic script. In particular, the absence of the Shadda
sented by Habash and Sadat (2006) and the S2 scheme was
diacritic (  ∼) which indicates a doubling of the consonant
presented by Badr et al. (2008). The S1 scheme used by
it follows leads to a different number of letters in the to-
Badr et al. (2008) is the same as Habash and Sadat (2006)’s
kenized and untokenized word forms (when the tokeniza-
D3 scheme. TB is the PATB tokenization scheme. We use
tion happens to split the two doubled consonants). See the
the Morphological Analysis and Disambiguation for Ara-
example in Table 1 under (Y-Shadda). Consequently, the
bic (MADA) toolkit (Habash and Rambow, 2005) to pro-
detokenization task for such cases is not a simple string
duce the various tokenization schemes. The schemes are
concatenation.
presented in Table 2 with various relevant statistics. The
3.2. Arabic Morphology schemes differ widely in terms of the increase of number
of tokens and the corresponding type count reduction. The
Arabic is a morphologically complex language with a large
more verbose schemes, i.e., schemes with more splitting,
set of morphological features producing a large number of
have lower out-of-vocabulary (OOV) rates and lower per-
rich word forms. While the number of (morphologically
plexity but are also harder to predict correctly.
untokenized) Arabic words in a parallel corpus is 20% less
than the number of corresponding English words, the num-
ber of unique Arabic word types is over twice the number 4.2. Detokenization
of unique English word types over the same corpus size. We compare the following techniques for detokenization:
One aspect of Arabic that contributes to this complexity is
its various attachable clitics. We define three degrees of • Simple (S): concatenate clitics to word without apply-
cliticization that are applicable in a strict order to a word ing any orthographic or morphological adjustments.
base:

[cnj+ [prt+ [art+ BASE +pro]]] • Rule-based (R): use deterministic rules to handle all
of the cases described in Table 1. We pick the most
At the deepest level, the BASE can have either the defi- frequent decision for ambiguous cases.
nite article (+ È@ Al+ ‘the’) or a member of the class of
pronominal enclitics, +pro, (e.g., Ñë+ +hm ‘their/them’). • Table-based (T): use a lookup table mapping tokenized
Next comes the class of particle proclitics (prt+), e.g., + È l+ forms to detokenized forms. The table is based on
‘to/for’. At the shallowest level of attachment we find the pairs of tokenized and detokenized words from our
conjunction proclitic (cnj+), e.g., + ð w+ ‘and’. The attach- language model data which had been processed by
ment of clitics to word forms is not a simple concatenation MADA. We pick the most frequent decision for am-
process. There are several orthographic and morphological biguous cases. Words not in the table are handled
adjustment rules that are applied to the word. An almost with the (S) technique. This technique essentially se-
complete list of these rules relevant to this paper are pre- lects the detokenized form with the highest conditional
sented and exemplified in Table 1. probability P (detokenized|tokenized).
It is important to make the distinction here between simple
word segmentation, which splits off word substrings with • Table+Rule(T+R): same as (T) except that we back off
no orthographic/morphological adjustments, and tokeniza- to (R) not (S).
tion, which does. Although segmentation by itself can have
important advantages, it leads to the creation of inconsistent The above four techniques are the same as those used by

or ambiguous word forms: consider the words éJ.JºÓ mktb~ Badr et al. (2008). We introduce two new techniques that
‘library’ and ÑîDJ. JºÓ mktbthm ‘their library’. A simple seg- use a 5-gram untokenized-form language model and the
mentation of the second word creates the non-word string disambig utility in the SRILM toolkit (Stolcke, 2002) to
 . JºÓ mktbt; however, applying adjustment rules as part of
I decide among different alternatives:
the tokenization generates the same form of the basic word
in the two cases. For more details, see (Habash, 2007). In • T+LM: we use all the forms in the (T) approach. Al-
this paper, we do not explore morphological tokenization ternatives are given different conditional probabilities,
beyond decliticization. P (detokenized|tokenized), derived from the tables.
Backoff is the (S) technique. This technique essen-
4. Approach tially selects the detokenized form with the highest
We would like to study the value of a variety of detokeniza- P (detokenized|tokenized) × PLM (detokenized).
tion techniques over different tokenization schemes and or-
thographic normalization. We report results on naturally • T+R+LM: same as (T+LM) but with (R) as backoff.
46/119
Rule Name Condition Result Example
Definite Article ? È+ È@+ È l+Al+l? + ÉË ll+ I.JºÓ+ È@+ È l+Al+mktb I.JºÒÊË llmktb ‘for the office’
éJm.Ì+ È@+ È l+Al+ljn~ éJj.ÊË lljn~ ‘for the committee’
Ta-Marbuta
è- -~ +pron H - -t +pron Ñë+ éJ . JºÓ mktb~+hm ÑîDJ. JºÓ mktbthm ‘their library’
Alif-Maqsura ø- -ý +pron @- -A +pron è+ øðP rwY+h è@ð P rwAh ‘he watered it’
exceptionally ø- -y +pron è+ úΫ ςlY+h éJÊ« ςlyh ‘on him’
Waw-of-Plurality @ð- -wA +pron ð- -w +pron è+ @ñJ.J» ktbwA+h èñJ.J» ktbwh ‘they wrote it’
Õç'- -tm +pron ñÖß- -tmw +pronè+ ÕæJ . J» ktbtmw+h èñÒJ.J» ktbtmwh ‘you [pl.] wrote it’
Hamza Z- -’ +pron ø- -ŷ +pron è+ ZAîE. bhA’+h éKAîE. bhAŷh ‘his glory [gen.]’
less frequently ð- -ŵ +pron è+ ZAîE. bhA’+h èðAîE. bhAŵh ‘his glory [nom.]’
less frequently Z- -’ +pron è+ ZAîE. bhA’+h èZAîE. bhA’h ‘his glory [acc.]’
Y-Shadda ø+ ø- -y +y øy ø+ úæ•A¯ qADy+y úæ•A¯ qADy ‘my judge’
N-Assimilation áÓ mn +m/n Ð m +m/n AÓ+ áÓ mn+mA AÜØ mmA ‘from which’
á« ςn +m/n ¨ ς +m/n áÓ+ á« ςn+mn áÔ« ςmn ‘about whom’
 
B+ à @ Ân +lA B @ ÂlA B+ à @ Ân+lA B @ ÂlA ‘that ... not’

Table 1: Orthographic and Morphological Adjustment Rules

Change Relative to D0 Prediction Error Rate OOV Perplexity


Definition E NR R ED
Token# E NR R ED S EG E NR R ED E NR R ED
Type# Type#
D0 word 0.62 0.09 0.00 2.22 2.17 412.3 410.6
D1 cnj+ word +7.2 -17.6 -17.8 0.76 0.23 0.14 1.91 1.89 259.3 258.2
D2 cnj+ prt+ word +13.3 -32.3 -32.6 0.89 0.37 0.25 1.50 1.50 185.5 184.7
TB cnj+ prt+ word +pro +17.9 -43.9 -44.2 1.07 0.57 0.42 1.22 1.22 142.2 141.5
S2 cnj+prt+art word +pro +40.6 -53.0 -53.3 1.20 0.73 0.60 0.91 0.91 69.3 69.0
D3 cnj+ prt+ art+ word +pro +44.2 -53.0 -53.3 1.20 0.73 0.60 0.90 0.90 61.9 61.7

Table 2: A comparison of the different tokenization schemes studied in this paper in terms of their definition, the relative
change from no-tokenization (D0) in tokens (Token#) and enriched and reduced word types (E NR Type# and R ED Type#),
MADA’s error rate in producing the enriched tokens, the reduced tokens and just segmentation (S EG); the out-of-vocabulary
(OOV) rate; and finally the perplexity value associated with different tokenization. OOV rates and perplexity values are
measured against the NIST MT04 test set while prediction error rates are measured against a Penn Arabic Treebank devset.

4.3. Normalization 5. Experimental Results


We consider two kinds of orthographic normalization 5.1. Detokenization
schemes, enriched Arabic (E NR) and reduced Arabic We compare the performance of the different detokeniza-
(R ED). For tokenized enriched forms, the detokenization tion techniques discussed in Section 4. for the E NR and the
produces the desired output. In case of reduced Arabic, we R ED normalization conditions. The performance of the dif-
consider two alternatives to automatic orthographic enrich- ferent techniques is measured against the Arabic side of the
ment. First, we use MADA to enrich Arabic text after deto- NIST MT evaluation set for 2004 and 2005 (henceforth,
kenization (MADA-E NR). MADA can predict the correct MT04+MT05) which together have 2,409 sentences com-
enriched form of Arabic words at 99.4%.2 Alternatively, prising 64,554 words. We report the results in Table 3 in
we jointly detokenize and enrich using detokenization ta- terms of sentence-level detokenization error rate defined as
bles that map reduced tokenized words to their enriched the percentage of sentences with at least one detokeniza-
detokenized form (Joint-D E T OK-E NR). tion error. The best performer across all conditions is the
In terms of evaluation, we report our results in both reduced T+R+LM technique. The previously reported best per-
and enriched Arabic forms. We only compare in the match- former was T+R (Badr et al., 2008), which was only com-
ing form, i.e., reduced hypothesis to reduced reference and pared with D3 and S2 tokenizations only.
enriched hypothesis to enriched reference. As illustrated in the results, the more complex the tokeniza-
tion scheme, the more prone it is to detokenization errors.
Moreover, R ED has equal or worse results than E NR un-
2 der all conditions except for the S detokenization technique
Statistics are measured on a devset from the Penn Arabic
Treebank (Maamouri et al., 2004). with the TB, S2 and D3 schemes. This is a result of the S
47/119
detokenization technique not performing any adjustments, two and three. Columns four and five present the two
which leads to the never-word-internal Alif-Maqsura char- approaches to enriching the tokenized reduced text. Al-
acter appearing incorrectly in word-internal positions in though the Joint-D E T OK-E NR technique does not outper-
E NR. While for R ED, the Alif-Maqsura is reductively nor- form MADA-E NR for T and T+R, it significantly benefits
malized to Ya, which is the correct form in some of the from the use of the LM extension to these two techniques.
cases. In fact, Joint-D E T OK-E NR produces the best results overall
The results for S2 and D3 are identical because these two under T+R+LM, with an error rate that is 20% lower than
schemes only superficially differ in whether proclitics are the best performance by MADA-E NR. Overall, however,
space-separated or not. Similarly, TB results are identical enriching and detokenizing R ED text yields output that has
to D3 for the S and R techniques. This can be explained almost 10 times the error rate compared to detokenizing
by the fact that the only difference between the D3 and TB E NR. This is expected since E NR is far less ambiguous than
schemes is that the definite article is attached to the word (in R ED. The best performer across all conditions for detok-
TB and not D3), a difference that does not produce different enization and enrichment is the T+R+LM approach.
results under the deterministic S and R techniques. All experiments reported so far in this paper start with a per-
We analyze the errors (14 cases) for the T+R+LM tech- fect pairing between the original and tokenized words. The
nique on D3 scheme and classify them into two categories. real challenge is applying the detokenization techniques on
The first category comprises 11 cases (≈ 80% of the er- automatically produced (noisy) text. The next section dis-
rors) and is caused by ambiguity resulting from the lack cusses the effect of detokenization on SMT output as a case
of diacritical marks. Seven (50% overall) of these errors study.
involve the selection of the correct Hamza form before
a pronominal enclitic. For example, the tokenized word 5.3. Tokenization and Detokenization for SMT
In this section we present English-to-Arabic SMT as a case
Aë+ ZA® ƒ @+ð w+šqA’+hA ‘and+siblings+her’ can be deto- study for the effect of tokenization in improving the qual-
kenized to AëZA® ƒ @ð wšqA’hA or AîEA® ƒ @ð wšqAŷhA or ity of translation. Then, we show the performance of the
AëðA® ƒ @ð wšqAŵhA depending on the grammatical case of different detokenization techniques on the output and their
 reflections over the overall performance of the SMT sys-
the noun ZA®ƒ @ šqA’, which is only expressible as a dia- tems.
critical mark. The other four cases involve two closed class
words, à@ Ǎn and áºË lkn, each of which corresponding 5.3.1. Experimental Data
to two diacritized forms that require different adjustments. All of the training data we use is available from the Linguis-
For example, the tokenized word úG+ à@ Ǎn+ny can be deto- tic Data Consortium (LDC).3 We use an English-Arabic
 parallel corpus of about 142K sentences and 4.4 mil-
kenized to úG@ Ǎny ( úG+ à@ Ǎin+niy → úG@ Ǎin∼iy) or úæK@ lion words for translation model training data. The par-
  allel text includes Arabic News (LDC2004T17), eTIRR
Ǎnny ( úG+ à@ Ǎin∼a+niy → úæK@ Ǎin∼aniy). In many cases,
(LDC2004E72), English translation of Arabic Treebank
the n-gram language model is able to select for the correct (LDC2005E46), and Ummah (LDC2004T18). Lemma
form, but it is not always successful. The second category based word alignment is done using GIZA++ (Och and
of errors compromises 3 cases (≈ 20% of the errors) which Ney, 2003). For language modeling, we use 200M words
involve automatic tokenization failures producing tokens from the Arabic Gigaword Corpus (LDC2007T40) together
that are impossible to map back to the correct detokenized with the Arabic side of our training data. Twelve language
form. models were built for all combinations of normalization and
tokenization schemes. We used 5-grams for all LMs unlike
5.2. Orthographic Enrichment and Detokenization (Badr et al., 2008) who used different n-grams sizes for tok-
As previously mentioned, it’s desirable for Arabic- enized and untokenized variants. All LMs are implemented
generating automatic applications to produce orthograph- using the SRILM toolkit (Stolcke, 2002).
ically correct Arabic. As such, reduced tokenized out- MADA is used to preprocess the Arabic text for translation
put should be enriched and detokenized to produce proper modeling and language modeling. MADA produced all en-
Arabic. We compare next the two different enrichment riched forms and tokenizations. Due to the fact that the
techniques discussed in Section 4.: using MADA to en- number of tokens per sentence changes from one tokeniza-
rich detokenized reduced text (MADA-E NR) versus deto- tion scheme to another, we filter the training data so that
kenizing and enriching in one joint step (Joint-D E T OK- all experiments are done on the same number of sentences.
E NR). We consider the effect of applying these two tech- We use the D3 tokenization scheme as a reference and set
niques together with the various detokenization techniques the cutoff at 100 D3 tokens. English preprocessing simply
when possible. The comparison is presented for D3 in included down-casing, separating punctuation from words
Table 4. D3 has the highest number of tokens per word and splitting off “’s”.
and it’s the hardest to detokenize as shown in Table 3. All experiments are conducted using the Moses phrase-
The MADA-E NR enrichment technique can be applied to based SMT system (Koehn et al., 2007). The decoding
the output of all detokenization techniques; however, the weight optimization was done using a set of 300 sentences
Joint-D E T OK-E NR enrichment technique can only be used from the 2004 NIST MT evaluation test set (MT04). The
as part of table-based detokenization techniques. The re-
3
sults for basic E NR and R ED detokenization are in columns http://www.ldc.upenn.edu
48/119
S R T T+R T+LM T+R+LM
E NR R ED E NR R ED E NR R ED E NR R ED E NR R ED E NR R ED
D1 0.17 0.17 0.17 0.17 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08
D2 22.50 22.50 0.58 0.79 0.37 0.37 0.21 0.21 0.37 0.37 0.21 0.21
TB 38.36 35.53 1.41 3.03 1.33 1.49 0.75 0.91 1.16 1.25 0.58 0.66
S2 38.36 35.53 1.41 3.03 1.37 1.54 0.79 0.95 1.20 1.29 0.62 0.71
D3 38.36 35.53 1.41 3.03 1.37 1.54 0.79 0.95 1.20 1.29 0.62 0.71

Table 3: Detokenization results in terms of sentence-level detokenization error rate.

E NR R ED
Detokenization
E NR R ED MADA-E NR Joint-D E T OK-E NR
S 38.36 35.53 39.73
R 1.41 3.03 10.59
T 1.37 1.54 8.92 9.46
T+R 0.79 0.95 8.68 9.22
T+LM 1.20 1.29 9.34 6.23
T+R+LM 0.62 0.71 7.39 5.89

Table 4: Detokenization and enrichment results for D3 tokenization scheme in terms of sentence-level detokenization error
rate.

tuning is based on the tokenized Arabic without detokeniza- bootstrap resampling (Koehn, 2004). Training over R ED
tion. We use a maximum phrase length of size 8 for all ex- Arabic then enriching its output sometimes yields better re-
periments. We report results on the 2005 NIST MT evalu- sults than training on E NR directly which is the case with
ation set (MT05). These test sets were created for Arabic- the TB tokenization scheme. However, sometimes the op-
English MT and have 4 English references. We use only posite is true as demonstrated in the D3 results. This is
one Arabic reference in reverse direction for both tuning due to the tradeoff between the quality of translation and
and testing. We evaluate using BLEU-4 (Papineni et al., the quality of detokenization which is discussed in the next
2002) although we are aware of its caveats (Callison-Burch section.
et al., 2006).
5.3.3. Detokenization Experiments
5.3.2. Tokenization Experiments
We measure the performance of the different detokeniza-
System E NR R ED tion techniques discussed in Section 4. against the SMT
Evaluation E NR R ED E NR R ED output for the TB tokenization scheme. We report results
in terms of BLEU scores in Table 6. The results for basic
D0 24.63 24.67 24.66 24.71
E NR and R ED detokenization are in columns two and three.
D1 25.92 25.99 26.06 26.12
Column four presents the results for the Joint-D E T OK-E NR
D2 26.41 26.49 26.06 26.15
approach to joint enriching and detokenization of tokenized
TB 26.46 26.51 26.73 26.80
reduced output discussed in Section 4.
S2 25.71 25.76 26.11 26.19
When comparing Table 6 (in BLEU scores) with the corre-
D3 25.68 25.75 25.03 25.10
sponding cells in Table 4 (in sentence-level detokenization
Table 5: Comparing different tokenization schemes for sta- error rate), we observe that the wide range of performance
tistical MT in BLEU scores over detokenized Arabic (using in Table 4 is not reflected in BLEU scores in Table 6. This
T+R+LM technique) is expected given the different natures of the tasks and met-
rics used. Although the various detokenization techniques
do not preserve their relative order completely, the S tech-
We compare the performance of the different tokenization nique remains the worst performer and T+R+LM remains
schemes and normalization conditions. The results are pre- the best in both tables. However, the R and T+LM tech-
sented in Table 5 using T+R+LM detokenization technique. niques perform relatively much better with MT output than
The best performer across all conditions is the TB scheme. they do with naturally occurring text. The most interest-
The previously reported best performer was S2 (Badr et ing observation is perhaps that under the best performing
al., 2008), which was only compared against D0 and D3 T+R+LM technique, joint detokenization and enrichment
tokenizations. Our results are consistent with Badr et al. (Joint-D E T OK-E NR) outperforms E NR detokenization de-
(2008)’s results regarding D0 and D3. However, our TB spite the fact that Joint-D E T OK-E NR has over nine times
result outperforms S2. The differences between TB and the error rate in Table 4. This shows that improved MT
all other conditions are statistically significant above the quality using R ED training data out-weighs the lower qual-
95% level. Statistical significance is computed using paired ity of automatic enrichment.
49/119
Detokenization
E NR R ED 7. Acknowledgement
E NR R ED Joint-D E T OK-E NR
S 25.57 26.04 N/A The work presented here was funded by a Google research
R 26.45 26.78 N/A award. We would like to thank Ioannis Tsochantaridis, Ma-
rine Carpuat, Alon Lavie, Hassan Al-Haj and Ibrahim Badr
T 26.40 26.78 22.44
for helpful discussions.
T+R 26.40 26.78 22.44
T+LM 26.46 26.80 26.73
T+R+LM 26.46 26.80 26.73 8. References
Table 6: BLEU scores for SMT outputs with different deto- Ibrahim Badr, Rabih Zbib, and James Glass. 2008. Seg-
kenization techniques over TB tokenization scheme mentation for english-to-arabic statistical machine trans-
lation. In Proceedings of ACL-08: HLT, Short Papers,
pages 153–156, Columbus, Ohio, June. Association for
Computational Linguistics.
5.3.4. SMT Detokenization Error Analysis Jeff A. Bilmes and Katrin Kirchhoff. 2003. Factored lan-
Since we do not have a gold detokenization reference for guage models and generalized parallel backoff. In Pro-
our MT output, we automatically identify detokenization ceedings of the Human Language Technology Confer-
errors resulting in non-words (i.e., invalid words). We an- ence/North American Chapter of Association for Com-
alyze the SMT output for the D3 tokenization scheme and putational Linguistics (HLT/NAACL-03), pages 4–6, Ed-
T+R+LM detokenization technique using the morphologi- monton, Canada.
cal analyzer component in the MADA toolkit,4 which pro- T. Buckwalter. 2004. Buckwalter Arabic Morphologi-
vides all possible morphological analyses for a given word cal Analyzer Version 2.0. Linguistic Data Consortium,
and identifies words with no analysis. We find 94 cases University of Pennsylvania, 2002. LDC Cat alog No.:
of words with no analysis out of 27,151 words (0.34%), LDC2004L02, ISBN 1-58563-324-0.
appearing in 84 sentences out of 1,056 (7.9%). Most of Chris Callison-Burch, Miles Osborne, and Philipp Koehn.
the errors come from producing incompatible sequences 2006. Re-evaluating the Role of BLEU in Machine
of clitics, such as having a definite article with a pronom- Translation Research. In Proceedings of the 11th con-

inal clitic. For instance, the tokenized word AK+ é¯C«+ È@ ference of the European Chapter of the Association for

Al+ςlAq~+nA ‘the+relation+our’ is detokenized to AJJ¯CªË@ Computational Linguistics (EACL’06), pages 249–256,
AlςlAqtnA which is grammatically incorrect. This is not a Trento, Italy.
detokenization problem per se but rather an MT error. Such F. Diehl, M.J.F. Gales, M. Tomalin, and P.C. Woodland.
errors could still be addressed with specific detokenization 2009. Morphological Analysis and Decomposition for
extensions such as removing either the definite article or the Arabic Speech-to-Text Systems. In Proceedings of In-
pronominal clitic. terSpeech.
Nizar Habash and Owen Rambow. 2005. Arabic Tokeniza-
tion, Part-of-Speech Tagging and Morphological Disam-
6. Conclusions and Future Work biguation in One Fell Swoop. In Proceedings of the
43rd Annual Meeting of the Association for Computa-
We presented experiments studying six detokenization tional Linguistics (ACL’05), pages 573–580, Ann Arbor,
techniques to produce orthographically correct and en- Michigan, June. Association for Computational Linguis-
riched Arabic text. We presented results on naturally oc- tics.
curring Arabic text and MT output against different tok-
Nizar Habash and Fatiha Sadat. 2006. Arabic Prepro-
enization schemes. The best technique under all conditions
cessing Schemes for Statistical Machine Translation. In
is T+R+LM for both naturally occurring Arabic text and
Proceedings of the 7th Meeting of the North American
MT output. Regarding enrichment, joint enrichment with
Chapter of the Association for Computational Linguis-
detokenization gives better results than performing the two
tics/Human Language Technologies Conference (HLT-
tasks in two separate steps. Moreover, the best setup for
NAACL06), pages 49–52, New York, NY.
MT is training on R ED text and then enriching and detok-
enizing the output using the joint technique. Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter.
2007. On Arabic Transliteration. In A. van den Bosch
In the future, we plan to investigate the creation of mappers
and A. Soudi, editors, Arabic Computational Mor-
trained on seen examples in our tables to produce ranked
phology: Knowledge-based and Empirical Methods.
detokenized alternatives for unseen tokenized word forms.
Springer.
In addition, we plan to examine language modeling ap-
Nizar Habash. 2007. Arabic Morphological Representa-
proaches that target Arabic’s complex morphology such as
tions for Machine Translation. In A. van den Bosch and
factored LMs (Bilmes and Kirchhoff, 2003). We also plan
A. Soudi, editors, Arabic Computational Morphology:
to explore ways to make detokenization robust to MT er-
Knowledge-based and Empirical Methods. Springer.
rors.
Ilana Heintz. 2008. Arabic language modeling with finite
state transducers. In Proceedings of the ACL-08: HLT
4 Student Research Workshop, pages 37–42, Columbus,
This component uses the databases of the Buckwalter Arabic
Morphological Analyzer (Buckwalter, 2004). Ohio, June. Association for Computational Linguistics.
50/119
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Fed-
erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,
R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst.
2007. Moses: open source toolkit for statistical machine
translation. In Proceedings of the 45th Annual Meeting
of the Association for Computational Linguistics Com-
panion Volume Proceedings of the Demo and Poster Ses-
sions, pages 177–180, Prague, Czech Republic, June.
Philipp Koehn. 2004. Statistical significance tests forma-
chine translation evaluation. In Proceedings of the Em-
pirical Methods in Natural Language Processing Con-
ference (EMNLP’04), Barcelona, Spain.
Young-Suk Lee. 2004. Morphological analysis for statisti-
cal machine translation. In Proceedings of the 5th Meet-
ing of the North American Chapter of the Association for
Computational Linguistics/Human Language Technolo-
gies Conference (HLT-NAACL04), pages 57–60, Boston,
MA.
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wig-
dan Mekki. 2004. The Penn Arabic Treebank : Building
a Large-Scale Annotated Arabic Corpus. In NEMLAR
Conference on Arabic Language Resources and Tools,
pages 102–109, Cairo, Egypt.
Franz Josef Och and Hermann Ney. 2003. A Systematic
Comparison of Various Statistical Alignment Models.
Computational Linguistics, 29(1):19–52.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
Zhu. 2002. BLEU: a Method for Automatic Evaluation
of Machine Translation. In Proceedings of the 40th An-
nual Meeting of the Association for Computational Lin-
guistics, pages 311–318, Philadelphia, PA.
Ruhi Sarikaya and Yonggang Deng. 2007. Joint
morphological-lexical language modeling for machine
translation. In Human Language Technologies 2007:
The Conference of the North American Chapter of the
Association for Computational Linguistics; Companion
Volume, Short Papers, pages 145–148, Rochester, New
York, April. Association for Computational Linguistics.
Andreas Stolcke. 2002. SRILM - an Extensible Language
Modeling Toolkit. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP),
volume 2, pages 901–904, Denver, CO.
Andreas Zollmann, Ashish Venugopal, and Stephan Vogel.
2006. Bridging the inflection morphology gap for ara-
bic statistical machine translation. In Proceedings of the
Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers, pages 201–204, New
York City, USA. Association for Computational Linguis-
tics.

51/119
Tagging Amazigh with AnCoraPipe

Mohamed Outahajala, Lahbib Zekouar, Paolo Rosso, M. Antònia Martí


Institut Royal de la Culture Amazighe, Ecole Mohammadia des Ingénieurs, Natural Language Engineering Lab –
EliRF(DSIC), CLiC - Centre de Llenguatge i Computació
Avenue Allal El Fassi, Madinat Al Irfane - Rabat - Instituts Adresse postale : BP 2055 Hay Riad Rabat Morocco, Avenue
Ibnsina B.P. 765 Agdal Rabat Morocco, Universidad Politécnica de Valencia, Spain, Universitat de Barcelona 08007
Barcelona, Spain
E-mail: outahajala@ircam.ma, zenkouar@emi.ac.ma, prosso@dsic.upv.es, amarti@ub.edu

Abstract
Over the last few years, Moroccan society has known a lot of debate about the Amazigh language and culture. The creation of a new
governmental institution, namely IRCAM, has made it possible for the Amazigh language and culture to reclaim their rightful place in
many domains. Taking into consideration the situation of the Amazigh language which needs more tools and scientific work to achieve
its automatic processing, the aim of this paper is to present the Amazigh language features for a morphology annotation purpose. Put in
another way, the paper is meant to address the issue of Amazigh’s tagging with the multilevel annotation tool AnCora Pipe. This tool is
adapted to use a specific tagset to annotate Amazigh corpora with a new defined writing system. This step may well be viewed as the
first step for an automatic processing of the Amazigh language; the main aim at very beginning being to achieve a part of speech tagger.

linguists. Next we give an overview on Amazigh corpora.


Introduction The fourth section describes how to tag with AnCoraPipe
Amazigh (Berber) is spoken in Morocco, Algeria, Tunisia, and the fifth section deals with Amazigh tagset.
Libya, and Siwa (an Egyptian Oasis); it is also spoken by
many other communities in parts of Niger and Mali. It is a 2. The Amazigh language
composite of dialects of which none have been considered
the national standard used by tens of millions of people in Amazigh belongs to the Hamito-Semitic/“Afro-Asiatic”
North Africa mainly for oral communication. languages (Cohen 2007, Chaker 1989) with rich templatic
With the emergence of an increasing sense of identity, morphology. In linguistic terms, the language is
Amazigh speakers would very much like to see their characterized by the proliferation of dialects due to
language and culture rich and developed. To achieve such a historical, geographical and sociolinguistic factors. In
goal, some Maghreb states have created specialized Morocco, one may distinguish three major dialects: Tarifit
institutions, such as the Royal Institute for Amazigh in the North, Tamazight in the center and Tashlhiyt in the
Culture (IRCAM, henceforth) in Morocco and the High southern parts of the country; 50% of the Moroccan
Commission for Amazigh (HCA) in Algeria. In Morocco, population speak Amazigh (Boukouss, 1995), but
Amazigh has been introduced in mass media and in the according to the last governmental demolinguisitc data of
educational system in collaboration with relevant 2004, the Amazigh language was spoken only by some
ministries. Accordingly, a new Amazigh television channel 28% of the Moroccan population (around 10 Million
was launched in first mars 2010 and it has become common inhabitants), showing an important decrease of its use.
practice to find Amazigh taught in various Moroccan Amazigh standardization cannot be achieved without
schools as a subject. adopting a realistic strategy that takes into consideration its
Over the last 7 years, IRCAM has published more than 140 linguistic diversity (Ameur et al., 2006a; Ameur et al.
books related to the Amazigh language and culture, a 2006b). As far as the alphabet is concerned, and because of
number which exceeds the whole amount of Amazigh historical and cultural reasons, Tifinaghe has become the
publications in the 20th century, showing the importance of official graphic system for writing Amazigh. IRCAM kept
an institution such as IRCAM. However, in Natural only pertinent phonemes for Tamazight, so the number of
Language Processing (NLP) terms, Amazigh, like most the alphabetical phonetic entities is 33, but Unicode codes
non-European languages, still suffers from the scarcity of only 31 letters plus a modifier letter to form the two
language processing tools and resources. phonetic units: ⴳⵯ(gʷ) and ⴽⵯ(kʷ). The whole range of
In this sense, since morphosyntactic tagging is an Tifinagh letters is subdivided into four subsets: the letters
important and basic step in the processing of any given used by IRCAM, an extended set used also by IRCAM,
language, the main objective of this paper is to explain how other neo-tifinaghe letters in use and some attested modern
we propose to supply the Amazigh language with this Touareg letters. The number reaches 55 characters
important tool. (Zenkouar 2004, Andries 2004). In order to rank strings
For clarity reasons, this paper is organized as follows: in and to create keyboard layouts for Amazigh in accordance
the first part we present an overview of the Amazigh with international standards, two other standards have been
language features. Then, we provide a brief retrospective adapted (Outahajala and Zenkouar, 2004):
on Amazigh morphology as conceived by IRCAM - ISO/IEC14651 standard related to international string

52/119
ordering and comparison method for comparing character which means “if”;
strings and description of the common template tailorable - Prepositions are always an independent set of characters
ordering; with respect to the noun they precede; however, if the
- Part 1: general principles governing keyboard layouts of preposition is followed by a pronoun, both the preposition
the standard ISO/IEC 9995 related to keyboard layouts for and the noun make a single whitespace-delimited string.
text and office systems. For example: ⵖⵔ (ɣr) “to, at” + ⵉ (i) “me” possessive
Most Amazigh words may be conceived of as having pronoun gives ⵖⴰⵔⵉ/ⵖⵓⵔⵉ (ɣari/ɣuri) “to me, at me, with
consonantal roots. They can have one, two, three or four me”;
consonants, and may sometimes extend to five. Words are - Particles are always isolated. There are aspect particles
made out of these roots by following a pattern (Chafiq such as ⴰⵇⵇⴰ (aqqa), ⴰⵔ (ar), ⴰⴷ (ad), particles of negation
1991). For example the word ‘aslmad’ is built up from the such as ⵓⵔ (ur), orientation particles like ⵏⵏ in ⴰⵡⵉ ⵏⵏ!(awi
root lmd “study” by following the pattern as12a3, where nn) “take it there” and a predicative particle ⴷ (d);
the number 1 is replaced by the first consonant of the root, - Determinants take always the form of single between two
number 2 is replaced by the second consonant of the root blank spaces. Determiners are divided into articles,
and number 3 is replaced by the 3rd consonant of the root. demonstratives, exclamatives, indefinite articles,
Concerning spelling, the system put by IRCAM is based on interrogatives, numerals, ordinals, possessives,
a set of rules and principles applied to “words” along which presentatives, quantifiers. ⴽⵓⵍⵍⵓ (kullu) “all” is a quantifier
the parsing of pronounced speech into written separated for instance;
words is effected. A grapheme, a written word, according - Amazigh punctuation marks are similar to the punctuation
to the spelling system is a succession of letters which can marks adopted in international languages and have the
sometimes be one letter delimited by whitespace or same functions. Capital letters, nonetheless, do not occur
punctuation. neither at the beginning of sentences nor at the initial of
The graphic rules for Amazigh words are set out as follows proper names.
(Ameur et al 2006a, 2006b, Boukhris et al 2008): The English terminology used above was extracted form
- Nouns consist of a single word occurring between two (Boumalk and Naït-Zerrad, 2009).
blank spaces. To the noun are attached the morphological
affixes of gender (masculine/ feminine), number 3. Amazigh corpora
(singular/plural) and state (free/construct) as it is shown in
the following examples: ⴰⵎⵣⴷⴰⵖ/ⵜⵜⴰⵎⵣⴷⴰⵖⵜⵜ (amzdaɣ 3.1 Amazigh corpora features
(masc.)/tamzdaɣt(fem.)) “a dweller”, ⴰⵎⵣⴷⴰⵖ/ⵉⵉⵎⵣⴷⴰⵖⵏⵏ
Amazigh corpora have the following characteristics:
(amzdaɣ (sing.)/imzdaɣn(plr.)) “dweller/dwellers”, and
- They are extracted from geographically circumscribed
ⴰⵎⵣⴷⴰⵖ/ⵓⵎⵣⴷⴰⵖ (amzdaɣ (free state)/umzdaɣ (construct
dialects;
state)). Kinship names constitute a special class since they
- Some varieties are less represented than others, or not
are necessarily determined by possessive markers which
studied at all;
form with them one word, for example: ⴱⴰⴱⴰⴽ ⴽ (babak)
- There is special need for a more general type of work
which means “your father”;
whose goal is to collect the data of all dialects;
- Quality names/adjectives constitute a single word along
- Existing publications are scattered and inaccessible in
with the morphological indicators of gender (masculine/
most cases. Some of them go back to the XIXth century
feminine), number (singular/plural), and state
and the beginning of the XXth century. The few existing
(free/construct);
copies of those references are only available in specialized
- Verbs are single graphic words along with its inflectional
libraries, mainly in France;
(person, number, aspect) or derivational morphemes. For
- General documents containing the data of all Amazigh
example: ⵜⵜⴰⵣⵣⵍ /ttazzl/which means “you run
dialects do not exist (phonetics, semantics, morphology,
(imperfective)”. The verb is separated by a blank space
phraseology…etc.).
from its predecessor and successor pronouns, i.e.: ⵢⴰⵙⵉ ⵜⵏ
- Some existing texts need revision because of
/ ⴰⴷ ⵜⵏ ⵢⴰⵙⵉ (“yasi tn / ad tn yasi” which means “he took
segmentation problems.
them / he will take them”);
To constitute an annotated corpus, we have chosen a list of
- Pronouns are isolated from the words they refer to.
corpora extracted from the Amazigh version of IRCAM’s
Pronouns in Amazigh are demonstrative, exclamative,
web site 1 , the periodical Inghmisn n usinag 2 (IRCAM
indefinite, interrogative, personal, possessive, or relative.
newsletter) and school textbooks. We were able to reach a
For instance, ⴰⴷ (ad) in the phrase ⴰⴱⵔⵉⴷ ⴰⴷ (abrid ad),
total number of words superior to 20k words. A
which means “this way”, is an example of a demonstrative
comparative quantity of corpora was used in tagging other
pronoun;
languages, for example (Allauzen and Bonneau-Maynard,
- An adverb consists of one word which occurs between
2008).
two blank spaces. Adverbs are divided into adverbs of
place, time, quantity, manner, and interrogative adverbs.
- Focus mechanisms, interjections and conjunctions are 1
written in the form of single words occurring between two www.ircam.ma
2
blank spaces. An example of a conjunction is: ⵎⵔ (mr) Freely downloadable from
http://www.ircam.ma/amz/index.php?soc=bulle

53/119
3.2 Writing systems U+2D56 ⵖ ɣ ‫غ‬ V, v 86, 118 G
Amazigh corpora produced up to now are written on the U+2D59 ⵙ s ‫س‬ S, s 83, 115 s
basis of different writing systems, most of them use U+2D5A ⵚ ṣ ‫ص‬ Ã, ã 195, 227 S
Tifinaghe-IRCAM (Tifinaghe-IRCAM makes use of U+2D5B ⵛ c ‫ش‬ C, c 67, 99 c
Tifinaghe glyphs but Latin characters) and Tifinaghe
U+2D5C ⵜ t ‫ت‬ T, t 84, 116 t
Unicode. It is important to say that the texts written in
U+2D5F ⵟ ṭ ‫ط‬ Ï, ï 207, 239 T
Tifinaghe Unicode are increasingly used.
Even though, we have decided to use a specific writing U+2D61 ⵡ w ‫ۉ‬ W, w 87, 119 w
system based on ASCII characters for the following U+2D62 ⵢ  +‫ي‬ ‫ي‬ Y, y 89, 121 y
reasons: U+2D63 ⵣ z ‫ز‬ Z, z 90, 122 z
- To have a common set of characters for annotated U+2D65 ⵥ ẓ ‫ژ‬ Ç, ç 199, 231 Z
corpora; No correspondant
- To facilitate texts treatment for annotators since ASCII U+2D6F ⵯ ɀ  in Tifinaghe-IRCAM °
characters are known by all systems;
- To handle its use due to the fact that people are still more Table1: The mapping from existing writing systems and the
familiar with Arabic and Latin writing systems. chosen writing system.
In Table 1 of correspondences between the different A transliteration tool was build in order to handle
writing systems and transliteration correspondences is transliteration to and from the chosen writing system and to
shown correct some elements such as the character “^” which
exists in some texts due to input error in entring some
Tifinaghe Used characters in Tifinaghe letters. So the sentence portion “ⴰⵙⵙ ⵏ ⵜⵎⵖⵔⴰ”
for tagging
Chosen characters

Unicode Transliteration Tifinaghe IRCAM


using Tifinaghe Unicode or “ass n tmvra” using
Tifinaghe-IRCAM will be transliterated as “ass n tmGra”
characters
Character

Arabic

codes
Code

(“When the day of the wedding arrives”).


Latin

4. AnCoraPipe tool
U+2D30 ⴰ a ‫ا‬ A, a 65, 97 a AnCoraPipe (Bertran et al. 2008) is a corpus annotation
U+2D31 ⴱ b ‫ب‬ B, b 66, 98 b
tool which allows different linguistic levels to be annotated
efficiently, since it uses the same format for all stages. The
U+2D33 ⴳ g ‫گ‬ G, g 71, 103 g
tool reduces the annotation time and makes easy the
U+2D33& ⴳ integration of the different annotators and the different
U+2D6F ⵯ gɀ +‫گ‬ Å, å 197, 229 g° annotation levels.
U+2D37 ⴷ d ‫د‬ D, d 68, 100 d The input documents may have a standard XML format,
U+2D39 ⴹ ḍ ‫ض‬ Ä, ä 196, 228 D allowing to represent tree structures (specially usefull at
U+2D3B ⴻ e3 ‫ﻳــــﻲ‬ E, e 69, 101 e syntactic anotation stages). As XML is a wide spread
standard, there are many tools available for its analysis,
U+2D3C ⴼ f ‫ف‬ F, f 70, 102 f
transformation and management.
U+2D3D ⴽ k  K, k 75, 107 k
AnCoraPipe includes an integrated search engine based on
U+2D3D& ⴽ XPath language (http://www.w3.org/TR/xpath/), which
U+2D6F ⵯ kɀ +‫گ‬ Æ, æ 198, 230 k allows to find structures of all kinds among the documents.
U+2D40 ⵀ h ‫ه‬ H, h 72,104 h For corpus analysis, an export tool can summarize the
U+2D40 ⵃ ḥ ‫ح‬ P, p 80,112 H attributes of all nodes in the corpus in a grid that can easily
U+2D44 ⵄ ε ‫ع‬ O, o 79, 111 E
be imported to basic analysis tools (such as Excel or
OpenOffice calc), statistical software (SPSS) or Machine
U+2D45 ⵅ x ‫خ‬ X, x 88, 120 x
Learning tools (Weka).
U+2D47 ⵇ q ‫ق‬ Q, q 81, 113 q A default tagset is provided in the standard installation. It
U+2D49 ⵉ i ‫ي‬ I, i 73, 105 i has been designed as generic as possible in order to match
U+2D4A ⵊ j ‫ج‬ J, j 74, 106 j the requisites of a wide amount of languages. In spite of
U+2D4D ⵍ l ‫ل‬ L, l 76, 108 l that, if the generic tagset is not useful, the interface is fully
U+2D4E ⵎ m ‫م‬ M, m 77, 109 m
customizable to allow different tagsets defined by the user.
In order to allow AnCoraPipe usable in a full variety of
U+2D4F ⵏ n ‫ن‬ N, n 78, 110 n
languages, the user can change the visualization font. This
U+2D53 ⵓ u ‫و‬ W, w 87, 119 u may help viewing non-latin scripts such as Chinese, Arabic
U+2D54 ⵔ r ‫ر‬ R, r 82, 114 r or Amazigh.
U+2D55 ⵕ ṛ  Ë, ë 203, 235 R AnCoraPipe is currently an Eclipse Plugin. Eclipse is an
extendable integrated development environment. With this
plugin, all features included in Eclipse are made available
note : different use in the IPA which uses the letter ə
3

54/119
for corpus annotation and developing. In particular, the In Table 2 the node Residual stands for attributes like
Eclipse’s collaboration and team plugins can be used to currency, number, date, math marks and other unknown
organize the work of a group of annotators. residual words.

Manual annotation is being carried out by a team of


5. AnCoraPipe for Amazigh linguists. Technically, manual annotation proceeds along
the requirements of the tool presented above.
AnCoraPipe allows the definition of different tagsets. We
have decided to work with a set of ASCII characters for the
A sample of annotated Corpora as presented in Section 3:
following reasons:
- Amazigh text corpora are written in different writing
Here follows the annotation of a sentence extracted from a
systems;
text about a wedding ceremony:
- Amazigh linguists are still familiar with Latin alphabets;
- the default tagset is a multilevel tagset;
“ass n tmGra, iwsn asn ayt tqbilt. illa ma issnwan, illa ma
- to simplify the interface for linguists;
yakkan i inbgiwn ad ssirdn”
- to avoid adding some tags which are not currently needed
as co-reference tags, syntactic tags...etc.
[English translation: “When the day of the wedding arrives,
the people of the tribe help them. Some of them cook; some
Based on the Amazigh language features presented above,
other help the guests get their hands washed ”]
Amazigh tagset may be viewed to contain 13 nodes with
two common attributes to each node: “wd” for “word” and
<sentence>
“lem” for “lemma”, whose values depend on the lexical
<n gen="m" lem="ass" num="s" state="free" wd="ass"/>
item they accompany.
<prep wd="n"/>
<n gen="f" lem="tamGra" num="s" state="construct"
Amazigh nodes and their attributes are set out in what
wd="tmGra"/>
follows:
<pu punct="comma" wd=","/>
<v aspect="perfective" gen="m" lem="aws" num="p"
attributes and subattributes with
person="3" wd="iwsn"/>
PoS number of values
<p gen="m" num="p" person="3" postype="personal"
gender(3), number(3), state(2),
wd="asn"/>
derivative(2),PoSsubclassification(4),
<d gen="m" num="p" postype="indefinite" wd="ayt"/>
person(3), possessornum(2),
<n gen="f" lem="taqbilt" num="s" postype="common"
Noun possessorgen(2)
state="construct" wd="tqbilt"/>
Adjective/ gender(3), number(3), state(2), <pu punct="period" wd="."/>
name of quality derivative(2), PoS subclassification(3) <v aspect="perfective" gen="m" lem="ili" num="s"
Verb gender(3), number(3), form(5), person="3" wd="illa"/>
aspect(3), negative(2), form(2) <p postype="relative" wd="ma"/>
gender(3), number(3), PoS <v aspect="imperfective" gen="m" lem="ssnw" num="s"
subclassification(7), deictic(3), person="3" form="participle" wd="issnwan"/>
autonome(2), person(3), <pu punct="comma" wd=","/>
Pronoun possessornum(2), possessorgen(2) <v aspect=" perfective" gen="m" lem="ili" num="s"
gender(3), number(3), PoS person="3" wd="illa"/>
Determiner subclassification(11) <p postype="relative" wd="ma"/>
Adverb PoS subclassification(5) <v aspect="imperfective" form="participle" gen="m"
gender(3), number(3), PoS lem="fk" num="s" person="3" wd="yakkan"/>
subclassification(6), person(3), <prep wd="i"/>
Preposition possessornum(2), possessorgen(2) <n gen="m" lem="anbgi" num="p" state="construct"
Conjunction PoS subclassification(2) wd="inbgiwn"/>
Interjection <pr postype="aspect" wd="ad"/>
Particle PoS subclassification(5) <v aspect="aorist" gen="m" lem="ssird" num="p"
Focus person="3" wd="ssirdn"/>
PoS subclassification(5), gender(3),
The main aim of this corpus is to achieve a part of speech
Residual number(3)
tagger based on Support Vector Machines (SVM) and
Punctuation punctuation mark type(16)
Conditional Random Fields (CRF) because they have been
proved to give good results for sequence classification
Table2: A synopsis of the features of the Amazigh PoS (Kudo and Matsumoto, 2000, Lafferty et al. 2001). We are
tagset with their attributes and values planning to use freely available tools like Yamcha and

55/119
CRF++ toolkits4. introduction au domaine berbère, éditions du CNRS, 1984.
P 232-242.
6. Conclusion and future works
Cohen, D. (2007). Chamito-sémitiques (langues). In
In this paper, after a brief description about social and
linguistic characteristics of the Amazigh language, we have Encyclopædia Universalis.
addressed the basic principles we followed for tagging Iazzi, E., Outahajala,M. (2008), Amazigh Data Base. In
Amazigh written corpora with AnCoraPipe: the tagset used, proceedings of LREC 08.
the transliteration and the annotation tool. Kudo, T., Yuji Matsumoto, Y. (2000). Use of Support
In the future, it is our goal to tag more corpora to constitute Vector Learning for Chunk Identification.
a reference corpus for works on Amazigh NLP and we plan Lafferty, J. McCallum, A. Pereira, F. (2001). Conditional
also to work on Amazigh Base Phrase Chunking. Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data. In proceedings of
Acknowledgments ICML-01 282-289
We would like to thank Manuel Bertran for improving the Outahajala, M., Zenkouar, L. (2005). La norme du tri, du
AnCora Pipe tool to support Amazigh features, all IRCAM clavier et Unicode. In proceedings of the workshop : la
researchers and Professor Iazzi El Mehdi from Ibn Zohr typographie entre les domaines de l'art et l'informatique,
University, Agadir for their explanations and precious help. pp. 223—238.
The work of the last two authors was carried out thanks to Saa, F. (2006). Les thèmes verbaux de l'Amazighe. In
AECID-PCI C/026728/09 and TIN2009-13391-C04-03/04 proceedings of the workshop: Structures morphologiques
research projects. de l’Amazighe, pp.102--111.
Zenkouar, L. (2004). L’écriture Amazighe Tifinaghe et
References Unicode, in Etudes et documents berbères. Paris (France).
n° 22, pp. 175—192.
Allauzen, A. Bonneau-Maynard, H. (2008).Training and
Zenkouar, L. (2008). Normes des technologies de
evaluation of POS taggers on the French MULTITAG
l’information pour l’ancrage de l’écriture Amazighe, in
corpus. In proceedings of LREC 08.
Etudes et documents berbères. Paris (France), n° 27, pp.
Ameur, M., Boujajar, A., Boukhris, F. Boukouss, A.,
159—172.
Boumaled, A., Elmedlaoui, M., Iazzi, E., Souifi, H.
(2006a), Initiation à la langue Amazighe. Publications de
l’IRCAM. pp. 45—77.
Ameur, M., Boujajar, A., Boukhris, F. Boukouss, A.,
Boumaled, A., Elmedlaoui, M., Iazzi, E. (2006b) Graphie
et orthographe de l’Amazighe. Publications de l’IRCAM.
Andries, P. (2004). La police open type Hapax berbère. In
proceedings of the workshop : la typographie entre les
domaines de l'art et l'informatique, pp. 183—196.
Bertran, M., Borrega, O., Recasens, M., Soriano, B. (2008).
AnCoraPipe: A tool for multilevel annotation.
Procesamiento del lenguaje Natural, nº 41. Madrid (Spain).
Boukhris, F. Boumalk, A. El moujahid, E., Souifi, H.
(2008). La nouvelle grammaire de l’Amazighe.
Publications de l’IRCAM.
Boukhris, F. (2006). Structure morphologique de la
préposition en Amazighe. In proceedings of the workshop:
Structures morphologiques de l’Amazighe. Publications de
l’IRCAM. pp. 46-56.
Boukouss, A. (1995). Société, langues et cultures au
Maroc: Enjeux symboliques, publications de la Faculté
des Lettres de Rabat.
Boumalk, A., Naït-Zerrad, K. (2009). Amawal n tjrrumt
-Vocabulaire grammatical. Publications de l’IRCAM.
Chafiq, M. (1991) . !"‫ز‬#$%‫ '& ا‬#(‫*)ن در‬+‫* وأر‬+‫أر‬. éd.
Arabo-africaines.
Chaker, S. (1989). Textes en linguistique berbère -

4
Freely downloadable from
http://chasen.org/~taku/software/YamCha/ and
http://crfpp.sourceforge.net/

56/119
Verb Morphology of Hebrew and Maltese — Towards an Open Source Type
Theoretical Resource Grammar in GF
Dana Dannélls∗ and John J. Camilleri†

Department of Swedish Language, University of Gothenburg
SE-405 30 Gothenburg, Sweden

Department of Intelligent Computer Systems, University of Malta
Msida MSD2080, Malta
dana.dannells@svenska.gu.se, jcam0003@um.edu.mt

Abstract
One of the first issues that a programmer must tackle when writing a complete computer program that processes natural language is
how to design the morphological component. A typical morphological component should cover three main aspects in a given language:
(1) the lexicon, i.e. how morphemes are encoded, (2) orthographic changes, and (3) morphotactic variations. This is in particular
challenging when dealing with Semitic languages because of their non-concatenative morphology called root and pattern morphology.
In this paper we describe the design of two morphological components for Hebrew and Maltese verbs in the context of the Grammatical
Framework (GF). The components are implemented as a part of larger grammars and are currently under development. We found that
although Hebrew and Maltese share some common characteristics in their morphology, it seems difficult to generalize morphosyntactic
rules across Semitic verbs when the focus is towards computational linguistics motivated lexicons. We describe and compare the verb
morphology of Hebrew and Maltese and motivate our implementation efforts towards a complete open source type theoretical resource
grammars for Semitic languages. Future work will focus on semantic aspects of morphological processing.

1. Introduction 2. Verb morphology


One of the first issues that a programmer must tackle when Each of the Semitic languages has a set of verbal patterns,
writing a complete computer program that processes natu- which is a sequence of vowels (and possibly consonants)
ral language is how to design the morphological compo- into which root consonants are inserted. The root itself has
nent. A typical morphological component should cover no definite pronunciation until combined with a vocalic pat-
three main aspects in a given language: (1) the lexicon, i.e. tern, i.e. a template. The combination of morphological
how morphemes are encoded, (2) orthographic changes, units is non-linear, i.e. it relies on intertwining between
and (3) morphotactic variations. This is in particular chal- two independent morphemes (root and pattern).1
lenging when dealing with Semitic languages because of There are different ways in how templates modify the
their non-concatenative morphology called root and pattern root consonants: doubling the middle consonants, inserting
morphology (Goldberg, 1994). vowels between consonants, adding consonantal affixes,
etc. Inflectional morphology systems are constructed by at-
The Grammatical Framework (GF) is a grammar formalism
taching prefixes and suffixes to lexemes. Verb lexemes are
for multilingual grammars and their applications (Ranta,
inflected for person, number, gender and tense. Common
2004). It has a Resource Grammar Library (Ranta, 2009)
tenses of Semitic languages are: present, perfect, imper-
that is a set of parallel natural language grammars that
fect, and imperative.2
can be used as a resource for various language processing
tasks. Currently, the only Semitic morphological compo- 2.1. Modern Hebrew
nent included in the library is for Arabic (Dada and Ranta,
Hebrew has seven verb pattern groups (binyanim) that are
2007). To increase the coverage of Semitic languages we
associated with a fixed morphological form, e.g. pa’al:
decided to develop two additional resource grammars for
C1aC2aC3, nif’al: niC1C2aC3, pi’el:C1iC2eC3. There are
Hebrew and Maltese. The availability of several languages
two major root classifications: regular (strong) and irreg-
belonging to the same language family in one framework
ular (weak). In the same manner that each verb belongs
fosters the development of common language modules
to a particular binyan, it also belongs to a particular group
where grammatical rules across languages are generalised.
of verbs (Hebrew gzarot) that classify them by their root
Thus, increasing the potential of yielding interesting in-
composition (for an extensive information about the He-
sights highlighting similarities and differences across lan-
brew root and pattern system see Arad (2005)). For reg-
guages. These kind of modules already exist in GF for Ro-
ular verbs, all root consonants are present in all the verb
mance and Scandinavian languages.
forms, there are fixed rules that distinguish how verbs are
In this paper we describe our implementations of Hebrew
and Maltese verb morphologies in the context of GF. We 1
Linguists consider the root to be a morpheme despite the fact
present how two of the three morphological aspects men- that it is not a continuous element in the word, and it is not pro-
tioned above are accounted departing from the similarities nounceable (McCarthy, 1979; McCarthy, 1981).
and differences of verb formation in each of the two lan- 2
In Semitic languages, the past tense is referred by the term
guages. perfect and the future tense by imperfect.
57/119
conjugated depending on their guttural root letters, i.e. A, cals instead of 3). A root bears a semantic meaning that
h, H, O.3 An example of two verbs that are conjugated dif- is converted into passive, active, reflexive forms depending
ferently in pa’al future tense because in one of the verbs on the pattern it belongs, e.g. h̄-r-ġ ‘out’, h̄riġna ‘we went
the root’s second guttural letter is h are: Agmwr (g.m.r, out’.
‘will finish’) with w stem vowel, and Anhag (n.h.g, ‘will Conjugations have predictable patterns and it is possible to
drive’) with a stem vowel. Irregular verbs are verbs where predict the patterns and the entire conjugation tables from
one or more of the root consonants are either missing or a given verb form (Aquilina, 1960; Aquilina, 1962). This
altered, which causes some deviation from a fully regu- may motivate the choice of representing lexemes in the lex-
lar conjugation. These verbs can be classified into three icon (Ussishkin and Twist, 2007).
main groups, each of which contains three to five sub-
groups (Coffin and Bolozky, 2005). Roots are conjugated 3. The Grammatical Framework (GF)
differently depending on the root classification group they
belong to, e.g. ywred (y.r.d, pa’al, group2_py, ‘he goes The Grammatical Framework is a functional grammar for-
down’), yasen (y.s.n, pa’al, group3_py, ‘he sleeps’). The malism based on Martin-Löf’s type-theory (Martin-Löf,
sub-groups of irregular verbs contain large root composi- 1975) implemented in Haskell.
tion variations which depend on the different occurrences GF has three main module types: abstract, concrete, and re-
of the root’s consonants; phonological changes contribute source. Abstract and concrete modules are top-level in the
to these irregularities in verbs forms and inflections. sense that they appear in grammars that are used at runtime
The Hebrew binyanim are associated with a semantic trait. for parsing and generation. One abstract grammar can have
This leads to a certain complexity when designing a mor- several corresponding concrete grammars; a concrete gram-
phological component. What leads to this complexity is mar specifies how the abstract grammar rules should be lin-
the fact that some morphemes (roots) are combined with earized in a compositional manner. A resource grammar is
more than one pattern, resulting in ambiguity problem. On intended to define common parts of the concrete syntax in
the other hand not all roots are realised in all patterns. To application grammars. It contains linguistic operations and
avoid inefficient parsing that generates too many results, parameters that are used to produce different forms and can
that in turn introduces new difficulties in identifying the be used as inherent features.
root’s consonants and in resolving ambiguities, it is nec- GF has a Resource Grammar Library, i.e. a set of parallel
essary to employ semantic markings in the lexicon. grammars that are built upon one abstract syntax. The GF’s
library, containing grammar rules for seventeen languages,4
2.2. Maltese plays the role of a standard software library (Ranta, 2009).
Maltese verb pattern groups (themes) are a subset Classi- It is designed to gather and encapsulate morphological and
cal Arabic pattern groups. These patterns involve affixa- syntactic rules of languages, which normally require expert
tion and prefixation, for example, niżżel (theme II, ‘bring knowledge, and make them available for non-expert appli-
down’), tniżżel (theme V, ‘be brought down’). Verbs in cation programmers by defining a complete set of morpho-
theme I must be specified as undergoing a vowel change logical paradigms and a syntax for each language.
which is always a → o or e → o. Theme II is defined
by double middle radical, the vowel possibilities are fixed. 4. The grammar design
Most of the Semitic Maltese verbal themes exhibit the same
In this section we present how the verb morphologies of
properties that can be seen in theme I and II.
Modern Hebrew and Maltese are implemented in GF. The
The vowels of the Semitic Maltese verb templates, unlike
presented code fragments do not cover all aspect of the
those of Classical Arabic, do not have a fixed vowel pattern,
verb, such as passive/active mood, Hebrew infinitive form,
rather a vast range of vowel patterns. Each template allows
Hebrew verbs with obligatory prepositions, English Mal-
several different vowel patterns determined by the tense and
tese, etc. However, the code provides a glimpse of the
person of conjugation. For example, the root’s template
two computational resources that are being developed. The
h̄-d-m, under perfect 3rd person singular, takes the pattern
presented code contains parameters, operations and lexicon
a-a (h̄adem), whilst the past participle takes the pattern i-a
linearizations which are defined according to GF’s concrete
(h̄idma).
and resource syntaxes. Parameters are defined to deal with
Each verb stem has two vowels and there are seven different
agreement, operations are functions that form inflection ta-
verb types. Since a very large number of Maltese verbs are
bles, linearizations are string realisations of functions that
borrowed from Romance (Sicilian and Italian) and English,
are defined in the abstract syntax.
the productive verbal morphology is mainly affixal with a
concatenative nature (Hoberman and Aronoff, 2003). The 4.1. Common parameters
synchronic, productive processes of verb derivation, has re-
sulted in three distinctive verb morphology features that are Both languages share the same parameter types and at-
often referred in terms of: Semitic Maltese, Romance Mal- tributes for verbs, including: number (Singular, Plural),
tese, and English Maltese (Mifsud, 1995).
4
Roots can be classified into one of five groups: strong, The Resource Grammar Library currently (2010) contains
weak, defective, hollow, double and quadriliteral (4 radi- the 17 languages: Arabic (complete morphology), Bulgarian,
Catalan, Danish, Dutch, English, Finnish, French, German, Ital-
3
Throughout the paper we regulate the encoding of Hebrew ian, Norwegian (bokmål), Polish, Romanian, Russian, Spanish,
characters using ISO-8859-8. Swedish and Urdu.
58/119
gender (Masculine, Feminine), case (Nominative, Ac- Patterns
cusative, Genitive), person (first, second, third), voice (Ac- Root patterns are defined in a separate resource. Patterns
tive, Passive) and tense (Perfect, Participle, Imperfect). specify consonant slots and morphological forms, some ex-
These types have the following definitions in GF syntax: amples are:
Number = Sg | Pl ; C1aC2aC3ty = {C1=""; C1C2=""; C2C3=""; C3="ty"};
Gender = Masc | Fem ; C1aC2aC3nw = {C1=""; C1C2=""; C2C3=""; C3="nw"};
Case = Nom | Acc | Gen ; C1aC2aC3th = {C1=""; C1C2=""; C2C3=""; C3="th"};
Person = P1 | P2 | P3 ;
Voice = Active | Passive ;
Lexicon
Tense = Perf | Part | Imperf ;
Lexicon entries are functions that are defined in the ab-
stract syntax. Below is an example of how the three verb
4.2. Modern Hebrew entries: write_V2, pray_V and sleep_V, are linearized in
the Hebrew lexicon. The lexicon generates verb paradigms
An additional parameter VPersonNumGen provides a de-
through their binyanim, using the Hebrew operations.
tailed description about how verbs are inflected. The pa-
rameter’s attributes indicate: first person singular/plural, write_V2 = mkVPaal "ktb" ;
pray_V = mkVHitpael "pll" ;
second and third person singular/plural and gender.
sleep_V = mkVPaalGroup3_py "ysn";
VPerNumGen = Vp1Sg | Vp1Pl | Vp2Sg Gender
| Vp2Pl Gender | Vp3Sg Gender
4.3. Maltese
| Vp3Pl Gender ;
There are additional parameters defined for Maltese, these
include: VerbType (Strong, Defective, Weak, Hollow, Dou-
Operations ble), VOrigin (Semitic, Romance), VForm (for possible
The Hebrew operations include: Pattern, i.e. a string con- tenses, persons and numbers).
sisting of a four position pattern slot, Root, i.e. a string VType = Strong | Defective | Weak | Hollow |
consisting of either three or four (Root4) consonants. The Double ;
Hebrew Verb is defined as a string that is inflected for tense VOrigin = Semitic | Romance ;
person, number and gender. The mkVPaal operation defines VForm = VPerf PerGenNum | VImpf PerGenNum |
regular verb paradigms for each tense and agreement fea- VImp Number ;
tures. The operation getRoot associates every consonant in
the input string v with a variable. This is accomplished by Operations
the operation C@? which binds each consonant in the string The operations for Maltese include: Pattern, i.e. a string
s to a variable, e.g. C1 and C2. These variables are than consisting of two vowels, Root, i.e. a string consisting of
coded into patterns using the operation appPattern which four consonants of which one can be eliminated. The Mal-
specifies how the root’s consonants should be inserted into tese Verb is defined as a string inflected for tense, person,
a pattern, given a root and a pattern. gender and number, that has the parameter values: Verb-
Type and VerbOrigin. The mkVerb operation utilizes ad-
Pattern : Type={C1, C1C2, C2C3, C3 : Str};
Root : Type={C1,C2,C3 : Str};
ditional operations such as classifyVerb, mkDefective, mk-
Root4 : Type=Root ** {C4 : Str}; Strong etc. to identify the correct verb. The operation clas-
Verb : Type={s : Tense ⇒ VPerNumGen ⇒ Str }; sifyVerb takes a verb string and returns its root, pattern, and
verb type, i.e. Strong, Defective, Quad etc. The operation
mkVPaal : Str → Verb = \v → v1@#Vowel matches the pattern Vowel and binds the vari-
let root = getRoot v
in {s = table {
able v1 to it. It is based on pattern matching of vowels.
Perf ⇒ table { Pattern : Type = {v1, v2 : Str} ;
Vp1Sg ⇒ appPattern root C1aC2aC3ty ; Root : Type = {K, T, B, L : Str} ;
Vp1Pl ⇒ appPattern root C1aC2aC3nw ; Verb : Type = {s : VForm ⇒ Str ; t : VType ; o :
Vp2Sg Masc ⇒ appPattern root C1aC2aC3th ; VOrigin} ;
Vp2Sg Fem ⇒ appPattern root
C1aC2aC3t ; mkVerb : Str → Verb = \mamma →
Vp2Pl Masc ⇒ appPattern root C1aC2aC3tM ; let
... } class = classifyVerb mamma
Imperf ⇒ table { . . . } in
} case class.t of {
}; Strong ⇒ mkStrong class.r class.p ;
Defective ⇒ mkDefective class.r class.p ;
getRoot : Str → Root = \s → case s of { Quad ⇒ mkQuad class.r class.p ;
C1@? + C2@? + C3 => ...
{C1 = C1 ; C2 = C2 ; C3 = C3} } ;
};
classifyVerb : Str → { t:VType ; r:Root ;
appPattern : Root → Pattern → Str = \r,p → p:Pattern } = \mamma → case mamma of {
p.C1 + r.C1 + p.C1C2 + r.C2 + p.C2C3 + r.C3 + K@#Consonant + v1@#Vowel
p.C3 ; + T@#Consonant + B@#Consonant
+ v2@#Vowel + L@#Consonant ⇒

59/119
{ t=Quad ; r={ K=K ; T=T ; B=B ; L=L } ; 6. Discussion and related work
p={ v1=v1 ; v2=v2 } } ;
} Although there are already some morphological analyzers
available for Hebrew (Itai and Wintner, 2008; Yona and
Wintner., 2008) and data resources available for Maltese
Lexicon
(Rosner et al., 1999), they are not directly usable within
In this example, functions are linearized by using two dif-
the Grammatical Framework. To exploit the advantages of-
ferent operations defined for: regular inflection of verbs
fered by GF, the language’s grammar must be implemented
(used in write_V2), where the verb is given in perfect tense,
in this formalism. One of the advantages of implementing
third person, singular, masculine and irregular inflection of
Semitic non-concatenative morphology in a typed language
verbs (used in pray_V), where two additional strings are
such as GF compared with other finite state languages is
given, namely the imperative singular and the imperative
that strings are formed by records, and not through con-
plural forms of the verb.
catenation. Moreover, once the core grammar is defined
write_V2 = mkVerb "kiteb" ; and the structure and the form of the lexicon is determined,
pray_V = mkVerb "talab" "itlob" "itolbu";
it is possible to automatically acquire lexical entries from
exiting lexical resources. In the context of GF, three wide-
4.4. Inflection paradigm coverage lexicons have been acquired automatically: Bul-
An example of the output produced by GF for the verb garian (Angelov, 2008b), Finnish (Tutkimuskeskus, 2006)
‘write’ is illustrated in Table 1. and Swedish (Angelov, 2008a).
In this work, the design decisions taken by the program-
Hebrew Maltese mers are based on different points of arguments concerning
mkVPaal “ktb” mkVerb “kiteb” the division of labour between a linguistically trained gram-
Perfect marian and a lexicographer. The Maltese implementation
Vp1Sg ⇒ “ktbty” (Per1 Sg) ⇒ “ktibt” consider stems in the lexicon rather than patterns and roots,
Vp1Pl ⇒ “ktbnw” (Per1 Pl) ⇒“ktibna” cf. Rosner et al. (1998); in the framework of GF, classes of
Vp2SgMasc ⇒ “ktbt” inflectional phenomena are given an abstract representation
(Per2 Sg) ⇒“ktibt”
Vp2SgFem ⇒ “ktbt” that interact with the root and pattern system. In Hebrew,
Vp2PlMasc ⇒ “ktbtM” recognizing prefixes and suffixes is not always sufficient for
(Per2 Pl) ⇒“ktibtu”
Vp2PlFem ⇒ “ktbtN” recognizing the root of the verb. Although root recognition
Vp3SgMasc ⇒ “ktb” (Per3Sg Masc) ⇒ “kiteb” is mandatory for generating the verb’s complete conjuga-
Vp3SgFem ⇒ “ktbh” (Per3Sg Fem) ⇒ “kitbet” tion table, changes in patterns and the absence of root let-
Vp3PlMasc ⇒ “ktbw” ters in different lexemes make it increasingly hard to infer
Per3Pl ⇒ “kitbu”
Vp3PlFem ⇒ “ktbw” the root (Deutsch and Frost, 2002) which requires a large
Imperfect amount of tri-consonantal constraints. This is in particular
Vp1Sg ⇒ “Aktwb” (Per1 Sg) ⇒ “nikteb” true for lexemes derived from weak roots where one of the
Vp1Pl ⇒ “nktwb” (Per1 Pl) ⇒ “niktbu” root consonants is often missing (Frost et al., 2000). To
Vp2SgMasc ⇒ “tktwb” avoid a large amount of morphosyntactic rules, we choose
(Per2 Sg) ⇒ “tikteb”
Vp2SgFem ⇒ “tktby” to employ semantic markings in the lexicon by specifying
Vp2PlMasc ⇒ “tktbw” roots and patterns instead of lexemes; this computationally
(Per2 Pl) ⇒ “tiktbu”
Vp2PlFem ⇒ “tktbw” motivated approach becomes plausible since the meaning
Vp3SgMasc ⇒ “yktwb” (Per3Sg Masc) ⇒ “jikteb” of the lexeme is already known.
Vp3SgFem ⇒ “tktwb” (Per3Sg Fem) ⇒ “tikteb”
Vp3PlMasc ⇒ “yktbw” 7. Conclusions and Future Work
Per3Pl ⇒ “jiktbu”
Vp3PlFem ⇒ “yktbw”
In this paper we have presented implementations of He-
Table 1: Example of Hebrew and Maltese verb inflection brew and Maltese components that tend to convey the non-
tables of the verb ‘write’. concatenative morphology of their verbs. Although we
could identify common characteristics among these two
Semitic languages, we found it difficult to generalize mor-
phosyntactic rules across Semitic verbs when the focus is
5. State of the work towards a computational motivated lexicon.
The core syntax implemented for the two languages has When designing a computer system that can process several
around 13 categories and 22 construction functions. It languages automatically it is useful to generalize as many
covers simple syntactic constructions including predication morphosyntactic rules across languages that belong to the
rules which are built from noun and verb phrases. same language group. One fundamental question that rises
The lexicons were manually populated with a small number from our implementations is to what extent we can general-
of lexical units, covering around 20 verbs and 10 nouns in ize the concrete syntaxes of Semitic languages. One way to
each language. The Maltese verb morphology covers the approach this question is by employing semantic markings
root groups: strong, defective and quadriliteral. In Hebrew, in the lexicons of the Semitic languages and focus on se-
the strong verb paradigms and five weak verb paradigms in mantic aspects of morphological processing. This remains
binyan pa’al are covered. for future work.
60/119
8. References ceedings of the Workshop on Computational Approaches
Krasimir Angelov. 2008a. Importing SALDO in GF. to Semitic Languages, pages 97–101, Morristown, NJ,
http://spraakbanken.gu.se/personal/ USA. Association for Computational Linguistics.
lars/kurs/lgres08/tpaper/angelov.pdf. M. Rosner, J. Caruana, and R. Fabri. 1999. Linguistic and
Krasimir Angelov. 2008b. Type-theoretical Bulgarian computational aspects of maltilex. In Arabic Translation
grammar. In In B. Nordström and eds. A. Ranta, edi- and Localisation Symposium: Proceedings of the Work-
tors, Advances in Natural Language Processing (GoTAL shop, pages 2—-10, Tunis.
2008), volume 5221 of LNCS/LNAI, pages 52—-64. Kotimaisten Kielten Tutkimuskeskus. 2006. KOTUS
Joseph Aquilina. 1960. The structure of Maltese. Valletta: wordlist. http://kaino.kotus.fi/sanat/
Royal University of Malta. nykysuomi.
Joseph Aquilina. 1962. Papers in Maltese linguistics. Adam Ussishkin and Alina Twist. 2007. Lexical access in
Royal University, Malta. Maltese using visual and auditory lexical decision. In
Conference of Maltese Linguistics, L-Ghaqda Internaz-
Maya Arad. 2005. Roots and patterns: Hebrew morpho-
zjonali tal-Lingwistika Maltija, University of Bremen,
syntax. Springer, The Netherlands.
Germany, October.
Edna Amir Coffin and Shmuel Bolozky. 2005. A Refer-
Shlomo Yona and Shuly Wintner. 2008. A finite-state mor-
ence Grammar of Modern Hebrew. Cambridge Univer-
phological grammar of Hebrew. Natural Language En-
sity Press.
gineering, 14(2):173–190, April.
Ali Dada and Aarne Ranta. 2007. Implementing an open
source arabic resource grammar in GF. In M. Mughazy,
editor, Perspectives on Arabic Linguistics., Papers from
the Twentieth Annual Symposium on Arabic Linguistics.
John Benjamins Publishing Company, March 26.
Avital Deutsch and Ram Frost, 2002. Lexical organization
and lexical access in a non-concatenated morphology,
chapter 9. John Benjamins.
R. Frost, A. Deutsch, and K. I. Forster. 2000. Decompos-
ing morphologically complex words in a nonlinear mor-
phology. Journal of Experimental Psychology: Learn-
ing, Memory and Cognition, 26(3):751–765.
Gideon Goldberg. 1994. Principles of Semitic word-
structure. In G. Goldberg and S. Raz, editors, In
Semantic and ushitic studies, pages 29–64. Wies-
baden:Harrassowitz.
Robert D. Hoberman and M. Aronoff, 2003. The verbal
morphology of Maltese: From Semitic to Romance, chap-
ter 3, pages 61–78. Amsterdam: John Benjamins.
Alon Itai and Shuly Wintner. 2008. Language resources
for Hebrew. Language Resources and Evaluation,
42(1):75–98, March.
P. Martin-Löf. 1975. An intuitionistic theory of types:
Predicative part. In H. E. Rose and J. C. Shepherdson,
editors, Proc. of Logic Colloquium ’73, Bristol, UK, vol-
ume 80, pages 73–118. North-Holland.
John J. McCarthy. 1979. On stress and syllabification. Lin-
guistic Inquiry, 10:443–465.
John J. McCarthy. 1981. The representation of consonant
length in Hebrew. Linguistic Inquiry, 12:322–327.
Manwel Mifsud. 1995. Loan verbs in Maltese: a descrip-
tive and comparative study. Leiden, New York, USA.
Aarne Ranta. 2004. Grammatical framework, a type-
theoretical grammar formalism. Journal of Functional
Programming, 14(2):145–189.
Aarne Ranta. 2009. The GF resource gram-
mar library. The on-line journal Linguistics
in Language Technology (LiLT), 2(2). http:
//elanguage.net/journals/index.php/
lilt/article/viewFile/214/158.
M. Rosner, J. Caruana R., and Fabri. 1998. Maltilex: a
computational lexicon for Maltese. In Semitic ’98: Pro-
61/119
Syllable Based Transcription of English Words into
Perso-Arabic Writing System

Jalal Maleki
Dept. of Computer and Information Science
Linkping University
SE-581 83 Linkping
Sweden
Email: jma@ida.liu.se

Abstract ified and transcribed into the Perso-Arabic Script (PA-


Script) according to the syllabification-based method
This paper presents a rule-based method for described in [11]. The romanized scheme we use is the
transcription of English words into the Perso- Dabire-romanization described in [10]. Since Arabic
Arabic orthography. The method relies on the
phonetic representation of English words such
and Persian essentially use the same script and have
as the CMU pronunciation dictionary. Some of the same syllabic structure, our method can easily be
the challenging problems are the context-based extended to the Arabic script.
vowel representation in the Perso-Arabic writing
system and the mismatch between the syllabic 2 Phonological Issues
structures of English and Persian. With some
minor extensions, the method can be applied to
English to Arabic transliteration as well.
The essence of our method is phonological mapping
between English and Persian and is defined as phone-
mic mapping of consonents and vowels and resyllabi-
1 Introduction fication of the source word using Persian syllable con-
straints. Just like transliteration between Arabic and
During the translation process from English to Per- English ([2]), transcription between English and Per-
sian certain words (usually names and trademarks) are sian is a dfficult task. However, although the mapping
transcribed rather than translated. This is a general between the sounds of Persian and english consonants
issue in machine translation between language pairs. and vowels is non-trivial, the most complicated step is
Unfortunately, there are no guidelines as to how these conversion of Persian vowels to PA-Script [11].
words should be written in the Perso-Arabic Script
(PA-Script) and some words are written in more than 2.1 Consonants
10 different ways ([9] ). This paper introduces a
rule-base method for English to PA-Script transcrip- Mapping English consonants into Persian phonol-
tion which is based on the syllable structure of words. ogy is imperfect but straightforward and it can be sum-
Syllables are important since transcription of vowels marized as a lookup operation. The mapping is how-
is mainly determined by the structure of the syllable ever not perfect and in many cases a consonant is
in which the vowel appears. Given an English word mapped into a Persian consonant that only approxi-
we use a syllabified version of the CMU pronuncia- mately reflects its original pronunciation. For exam-
tion dictionary (CMUPD) to lookup its pronunciation ple, /th/ in ’thanks’ (/TH, AE1, NG, K, S/) is tran-
and use it for generating a phonemic romanized Per- scribed to /t/, whereas, the /th/ of ’that’ (/DH, AE1,
sian transcription of the word which is finally resyllab- T/) is transcribed to Persian /d/.

62/119
2.2 Vowels 3. The syllables are individually modified to ful-
fill the contraints of Persian syllable structures.
From a transcription point of view, vowel corre- For example, spring (CCCVCC) is transformed
spondence between Persian and English phonology is to espering (VCCVCVCC) using e epenthesis,
also imperfect and relatively simple. Some examples prompt (CCVCCC) is transformed to perompet
are shown in Table-1. Some English diphthongs are (CVCVCCVC). See Table-3 for more examples.
treated as two separate vowels whereas some others
are interpreted as a single vowel. 4. The resulting Dabire word is resyllabified. For
Phonological mapping is followed by conversion of example, espering is syllabified as es.pe.ring
phonemic romanized Persian to PA-Script. Type of
5. Application of context-dependent replace rules
syllable containing a vowel and the characteristics of
[3] to enforce orthographical conventions of Per-
the neighboring graphemes determine the choice of
sian [5, 13, 1]
grapheme (or allographs) for the vowel. As an exam-
ple, Table-2 shows the various and digraphs used for 6. Finally, the Dabire-word is transliterated to
writing the vowel /i/ in different contexts [11]. Perso-Arabic Unicode.

2.3 Syllable Constraints and Consonant Clusters Step 1-3 are currently implemented in Lisp and steps
4-6 are implemented as transducers in XFST [3]
Syllable structure in Persian is restricted to The syllabification step (4) which is one of the main
(C)V(C)(C), whereas, English allows the more com- modules of the system is explained further. The syl-
plex structure (C)(C)(C)V(C)(C)(C)(C). labification transducer works from left to right on the
One of the main problems in writing English words input string and ensures that the number of consonants
in PA-Script is the transformation of syllables. For ex- in the onset is maximized. Given the syllabic struc-
ample, the word ’question’ represented as /K, W, EH1, ture of Persian, this essentially means that if a vowel,
S, CH, AH0, N/ in CMUPD with the syllables /K, W, V, is preceded by a consonant, C, then CV initiates
EH1, S/ and /CH, AH0, N/ is transcribed to kuesšen a syllable. For example, for a word such as jârue,
one syllable at a time and finally resyllabified as ku- the syllabification jâ.ru.e (CV.CV.V) is selected and
 ñ». Resyl-
es-šen and transliterated to PA-Script ႂ jâr.u.e (CVC.V.V) is rejected. The correct syllabifica-
labification is necessary since consonant clusters are tion would naturally lead to correct writing since as
broken by vowel epenthesis. mentioned earlier, vowels are written differently de-
In general, the Persian transcription of English pending on their position in the syllable.
words involves short vowel insertion into consonant The following XFST-definitions form the core of
clusters and resyllabification (See Table-3 for exam- the syllabification [11]:
ples.)
define Sy V|VC|VCC|CV|CVC|CVCC;
3 The Implementation
define Sfy C* V C* @->
Transcription of an English word w into P-Script in- ... "." || _ Sy;
volves a number of steps which are briefly discussed
below. The first statement defines a language (Sy) contain-
ing all syllables of Dabire. V, VC etc. are defined as
1. w is looked up in the syllabified CMUPD dictio- regular languages that represent well-formed syllables
nary [4] and its syllabified pronunciation p(w) is in Dabire. For example, CVCC is defined as,
retrieved. For example, given the word ’surgical’,
we get: ((S ER1) (JH IH0) (K AH0 L)) define CVCC [C V C C] .o. ˜$NotAllowed;
2. Syllables of p(w) are transcribed to Dabire which which defines the language containing all possible
is a phonemic orthorgraphy for Persian. For the CVCC syllables and excluding the untolerated conso-
’surgical’, we get ((s e r) (g i) (kâl)). nant clusters in NotAllowed such as bp, kq, and cc.

63/119
Vowel Example Word Phonemes Persian Phoneme Romanized Persian Perso-Arabic

AA odd AA D â âd X@
AE at AE T a at 
H@
AH hut HH AH T â hât HAë
AO ought AO T o ot 
Hð@
AW cow K AW â kâv ðA¿
AY hide HH AY D ây hâyd YK Aë
Table 1. Some Vowels from CMU Pronunciation Dictionary with Examples

The second statement defines a replacement rule [3] [3] K. R. Beesley and L. Karttunen. Finite State Mor-
that represents the syllabification process. The oper- phology. CSLI Publications, 2003.
ator @> ensures that the shortest possible strings (of [4] Carnegie Mellon University. CMU pronuncia-
tion dictionary. http://www.speech.cs.cmu.edu/cgi-
the form C* V C*) are selected in left to right direc-
bin/cmudict, 2008.
tion and identified as syllables which are separated by [5] Farhangestan. Dastur e Khatt e Farsi (Persian Or-
a dot. thography), volume Supplement No. 7. Persian
Table-4 includes examples that illustrate examples Academy, Tehran, 2003.
of input/output for this. [6] J. Johanson. Transcription of names written in farsi
into english. In A. Farghaly and K. Megerdoomian,
4 Discussion and Evaluation editors, Proceedings of the 2nd workshop on com-
putational approaches to Arabic Script-based lan-
guages, pages 74–80, 2007.
We have introduced a rule based transcription of [7] S. Karimi, A. Turpin, and F. Scholer. English to per-
English to PA-Script. Earlier work [2, 8, 6, 7] mainly sian transliteration. In Lecture Notes in Computer Sci-
relies on statistical methods. ence, volume 4209, pages 255–266. Springer, 2006.
Our method produces correct transcriptions for [8] M. M. Kashani, F. Popowich, and A. Sarkar. Auto-
most of the data-set randomly selected from CMUPD. matic transliteration of proper nouns from arabic to
Quantitative evaluation of the method is in progress. english. In A. Farghaly and K. Megerdoomian, edi-
tors, Proceedings of the 2nd workshop on computa-
The performance of the system is dependent on the
tional approaches to Arabic Script-based languages,
availability of syllabified English words and future im- pages 81–87, 2007.
provements would require use of statistical methods [9] R. R. Z. Malek. Qavâed e Emlâ ye Fârsi. Golâb,
for automatically handling words that do not exist in 2001.
the dictionary. Some early experiments [14] based on [10] J. Maleki. A Romanized Transcription for Persian. In
CMUPD show a success rate of 71.6% in automatic Proceedings of Natural Language Processing Track
grapheme to phoneme conversion of English words not (INFOS2008), Cairo, 2008.
[11] J. Maleki and L. Ahrenberg. Converting Romanized
present in CMUPD. Further development would also
Persian to Arabic Writing System Using Syllabifica-
require integration of automatic syllabification of En- tion. In Proceedings of the LREC2008, Marrakech,
glish [12] into the system. 2008.
[12] Y. Marchand, C. R. Adsett, and R. I. Damper. Au-
References tomatic Syllabification in English: A Comparison
of Different Algorithms. Language and Speech,
[1] M. S. Adib-Soltâni. An Introduction to Persian Or- 52(1):1–27, 2009.
thography - (in Persian). Amir Kabir Publishing [13] S. Neysari. A Study on Persian Orthography - (in Per-
House, Tehrân, 2000. sian). Sâzmân e Câp o Entešârât, 1996.
[2] Y. Al-Onaizan and K. Knight. Machine transliteration [14] S. Stymne. Private communication. Linköping, 2010.
of names in arabic text. In ACL Workshop on Compu-
tational Approaches to Semitic Languages, 2002.

64/119
Word Segment Segment Segment Intra-Word
/i/ Initial Initial Medial Final Isolated
V, VC, VCC K@ JK JJ úæ ùK, ø@
áK @ QKAK ÕæJË úæJËAg ùKñKAK., ø@ éJ¯P
CVC, CVCC K J
 XQK èXQƒ
CV K J ù ø
P@YK X P@YJK. ú» Ag øPA¿
Table 2. Mapping /i/ to P-Script Graphemes

English Onset/Coda Transcription Example Clusters


/šr/ Onset /šer/ shrink→šerink /šr/
/sC1 / Onset /esC1 / school→eskul /sp, st, sk, sm, sn, sl/
/šC2 / Onset /ešC2 / schmock→ešmâk /šp, št, šk, šm, šn, šl/
/C3 C1 / Onset /C3 eC1 / trunk→terânk /pr, pl, bl, br, .../
/sCw/ Onset /esCu/ squash→eskuâš /skw/
/sCy/ Onset /esCiy/ student→estiyudent /spy, sty/
/sCC1 / Onset /esCeC1 / spring→espering /spl, spr, str, skr/
/C1 Cs/ Coda /C1 Ces/ corps→korpes /lps, rps, rts, rks/
/CCCC/ Coda /CCeCeC/ prompts→perâmpetes

Table 3. Epenthesis in consonant cluster transcription. C1 stands for all consonants except /w/ and /y/. C2
stands for all consonants except /w/, /y/ and /r/. C3 Stands for all consonants except /s/ and /š/.

English CMU Dabire


Word Pronunciation Romanization Syllabification PA-Script
GEORGE JH AO1 R JH jorj jorj h. Pñk.
BUSH B UH1 SH buš buš €ñK.
BIOGEN B AY1 OW0 JH EH2 N bâyojen bâ.yo.jen ák. ñK AK.
LOUISE L UW0 IY1 Z luiz lu.iz QKñË
LOUISIANA L UW0 IY2 Z IY0 AE1 N AH0 luizianâ lu.i.zi.a.nâ AJK QKñË
INDOSUEZ IH1 N D OW0 S UW0 EY1 Z indosuez in.do.su.ez QKñƒ ðYJK @
SPRITE S P R AY1 T esperâyt es.pe.râyt  @Qƒ@
IK
Table 4. Examples showing some of the steps in the transliteration

65/119
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Altantawy, Yassine Benajiba
Center for Computational Learning Systems
475 Riverside Drive, Suite 850
New York, NY 10115
Columbia University
{mdiab,habash,rambow,mtantawy,ybenajiba}@ccls.columbia.edu

Abstract
In this paper, we describe COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs. We describe
the objectives of the project, the process flow and the interaction between the different components. We briefly describe the manual
annotation effort and the resources created. Finally, we sketch how these resources and tools are put together to create DIRA, a term-
expansion tool for information retrieval over dialectal Arabic collections using Modern Standard Arabic queries.

1. Introduction these genres. In fact, applying NLP tools designed for MSA
directly to DA yields significantly lower performance, mak-
The Arabic language is a collection of historically related ing it imperative to direct the research to building resources
variants. Arabic dialects, collectively henceforth Dialectal and dedicated tools for DA processing.
Arabic (DA), are the day to day vernaculars spoken in the DA lacks large amounts of consistent data due to two fac-
Arab world. They live side by side with Modern Standard tors: a lack of orthographic standards for the dialects, and
Arabic (MSA). As spoken varieties of Arabic, they differ a lack of overall Arabic content on the web, let alone DA
from MSA on all levels of linguistic representation, from content. These lead to a severe deficiency in the availabil-
phonology, morphology and lexicon to syntax, semantics, ity of computational annotations for DA data. The project
and pragmatic language use. The most extreme differences presented here – Cross Lingual Arabic Blog Alerts (CO-
are on phonological and morphological levels. LABA) – aims at addressing some of these gaps by building
The language of education in the Arab world is MSA. DA is large-scale annotated DA resources as well as DA process-
perceived as a lower form of expression in the Arab world; ing tools.1
and therefore, not granted the status of MSA, which has This paper is organized as follows. Section 2. gives a high
implications on the way DA is used in daily written venues. level description of the COLABA project and reviews the
On the other hand, being the spoken language, the native project objectives. Section 3. discusses the annotated re-
tongue of millions, DA has earned the status of living lan- sources being created. Section 4. reviews the tools created
guages in linguistic studies, thus we see the emergence of for the annotation process as well as for the processing of
serious efforts to study the patterns and regularities in these the content of the DA data. Finally, Section 5. showcases
linguistic varieties of Arabic (Brustad, 2000; Holes, 2004; how we are synthesizing the resources and tools created for
Bateson, 1967; Erwin, 1963; Cowell, 1964; Rice and Sa’id, DA for one targeted application.
1979; Abdel-Massih et al., 1979). To date most of these
studies have been field studies or theoretical in nature with 2. The COLABA Project
limited annotated data. In current statistical Natural Lan- COLABA is a multi-site partnership project. This paper,
guage Processing (NLP) there is an inherent need for large- however, focuses only on the Columbia University contri-
scale annotated resources for a language. For DA, there butions to the overall project.
has been some limited focused efforts (Kilany et al., 2002; COLABA is an initiative to process Arabic social media
Maamouri et al., 2004; Maamouri et al., 2006); however, data such as blogs, discussion forums, chats, etc. Given
overall, the absence of large annotated resources continues that the language of such social media is typically DA, one
to create a pronounced bottleneck for processing and build- of the main objective of COLABA is to illustrate the signif-
ing robust tools and applications. icant impact of the use of dedicated resources for the pro-
DA is a pervasive form of the Arabic language, especially cessing of DA on NLP applications. Accordingly, together
given the ubiquity of the web. DA is emerging as the lan- with our partners on COLABA, we chose Information Re-
guage of informal communication online, in emails, blogs, trieval (IR) as the main testbed application for our ability to
discussion forums, chats, SMS, etc, as they are media that process DA.
are closer to the spoken form of language. These genres Given a query in MSA, using the resources and processes
pose significant challenges to NLP in general for any lan- created under the COLABA project, the IR system is able
guage including English. The challenge arises from the to retrieve relevant DA blog data in addition to MSA
fact that the language is less controlled and more speech data/blogs, thus allowing the user access to as much Arabic
like while many of the textually oriented NLP techniques
are tailored to processing edited text. The problem is com- 1
We do not address the issue of augmenting Arabic web con-
pounded for Arabic precisely because of the use of DA in tent in this work.
66/119
content (in the inclusive sense of MSA and DA) as possi- tuition is that the more words in the blogs that are not
ble. The IR system may be viewed as a cross lingual/cross analyzed or recognized by a MSA morphological an-
dialectal IR system due to the significant linguistic differ- alyzer, the more dialectal the blog. It is worth noting
ences between the dialects and MSA. We do not describe that at this point we only identify that words are not
the details of the IR system or evaluate it here; although we MSA and we make the simplifying assumption that
allude to it throughout the paper. they are DA. This process results in an initial ranking
There are several crucial components needed in order for of the blog data in terms of dialectness.
this objective to be realized. The COLABA IR sys-
tem should be able to take an MSA query and convert 3. Content Clean-Up. The content of the highly ranked
it/translate it, or its component words to DA or alterna- dialectal blogs is sent for an initial round of manual
tively convert all DA documents in the search collection clean up handling speech effects and typographical er-
to MSA before searching on them with the MSA query. In rors (typos) (see Section3.2.). Additionally, one of the
COLABA, we resort to the first solution. Namely, given challenging aspects of processing blog data is the se-
MSA query terms, we process them and convert them to vere lack of punctuation. Hence, we add a step for
DA. This is performed using our DIRA system described sentence boundary insertion as part of the cleaning up
in Section 5.. DIRA takes in an MSA query term(s) and process (see Section 3.3.). The full guidelines will be
translates it/(them) to their corresponding equivalent DA presented in a future publication.
terms. In order for DIRA to perform such an operation it
requires two resources: a lexicon of MSA-DA term corre- 4. Second Ranking of Blogs and Dialectalness De-
spondences, and a robust morphological analyzer/generator tection. The resulting cleaned up blogs are passed
that can handle the different varieties of Arabic. The pro- through the DI pipeline again. However, this time,
cess of creating the needed lexicon of term correspondences we need to identify the actual lexical items and add
is described in detail in Section 3.. The morphological an- them to our lexical resources with their relevant infor-
alyzer/generator, MAGEAD, is described in detail in Sec- mation. In this stage, in addition to identifying the
tion 4.3.. dialectal unigrams using the DI pipeline as described
For evaluation, we need to harvest large amounts of data in step 2, we identify out of vocabulary bigrams and
from the web. We create sets of queries in domains of in- trigrams allowing us to add entries to our created re-
terest and dialects of interest to COLABA. The URLs gen- sources for words that look like MSA words (i.e. cog-
erally serve as good indicators of the dialect of a website; nates and faux amis that already exist in our lexica,
however, given the fluidity of the content and variety in di- yet are specified only as MSA). This process renders
alectal usage in different social media, we decided to per- a second ranking for the blog documents and allows
form dialect identification on the lexical level. us to hone in on the most dialectal words in an ef-
Moreover, knowing the dialect of the lexical items in a doc- ficient manner. This process is further elaborated in
ument helps narrow down the search space in the under- Section 4.2..
lying lexica for the morphological analyzer/generator. Ac-
5. Content Annotation. The content of the blogs that
cordingly, we will also describe the process of dialect an-
are most dialectal are sent for further content annota-
notation for the data.
tion. The highest ranking blogs undergo full word-by-
The current focus of the project is on blogs spanning four
word dialect annotation as described in Section 3.5..
different dialects: Egyptian (EGY), Iraqi (IRQ), Levantine
Based on step 4, the most frequent surface words that
(LEV), and (a much smaller effort on) Moroccan (MOR).
are deemed dialectal are added to our underlying lex-
Our focus has been on harvesting blogs covering 3 do-
ical resources. Adding an entry to our resources en-
mains: social issues, religion and politics.
tails rendering it in its lemma form since our lexical
Once the web blog data is harvested as described in Sec-
database uses lemmas as its entry forms. We create the
tion 3.1., it is subjected to several processes before it is
underlying lemma (process described in Section 3.6.)
ready to be used with our tools, namely MAGEAD and
and its associated morphological details as described
DIRA. The annotation steps are as follows:
in Section 3.7.. Crucially, we tailor the morphologi-
1. Meta-linguistic Clean Up. The raw data is cleaned cal information to the needs of MAGEAD. The choice
from html mark up, advertisements, spam, encoding of surface words to be annotated is ranked based on
issues, and so on. Meta-linguistic information such as the word’s frequency and its absence from the MSA
date and time of post, poster identity information and resources. Hence the surface forms are ranked as
such is preserved for use in later stages. follows: unknown frequent words, unknown words,
then known words that participate in infrequent bi-
2. Initial Ranking of the Blogs. The sheer amount of grams/trigrams compared to MSA bigrams/trigrams.
data harvested is huge; therefore, we need to select All the DA data is rendered into a Colaba Conven-
blogs that have the most dialectal content so as to tional Orthography (CCO) described in Section 3.4..
maximally address the gap between MSA and DA re- Annotators are required to use the CCO for all their
sources. To that end, we apply a simple DA identifi- content annotations.
cation (DI) pipeline to the blog document collection
ranking them by the level of dialectal content. The DI To efficiently clean up the harvested data and annotate its
pipeline is described in detail in Section 4.2.. The in- content, we needed to create an easy to use user interface
67/119
with an underlying complex database repository that orga- for such a task is not straight forward. Thus, we simpli-
nizes the data and makes it readily available for further re- fied the task to the narrow identification of the following
search. The annotation tool is described in Section 4.1.. categories:

• MSA with non-standard orthography, e.g., è Yë



3. Resource Creation
hðh̄ ‘this’ becomes è Yë hðh, and Yg. A‚ÖÏ @ AlmsAjð
Resource creation for COLABA is semi automatic. As
mentioned earlier, there is a need for a large collection of ‘mosques’ becomes Yg. A‚ÖÏ @ AlmsAjd.
data to test out the COLABA IR system. The data would
• Speech Effects (SE) are typical elongation we see
ideally have a large collection of blogs in the different rele-
in blog data used for emphasis such as èPðððñ»
vant dialects in the domains of interest, annotated with the 
relevant levels of linguistic knowledge such as degree of kwwwwrh ‘ball’ is rendered èPñ» kwrh̄.
dialectness and a lexicon that has coverage of the lexical • Missing/Added Spaces (MS/AS) are cases where there
items in the collection. Accordingly, the blog data is har- is obviously a missing space between two or more
vested using a set of identified URLs as well as queries that words that should have been rendered with a space.
are geared towards the domains of interest in the dialect. 
For example, in EGY, éK AKQ.ËA‚ʾ
 JÓ mtklšAlbrtÂnh̄
3.1. Data Harvesting 
‘don’t eat the orange’ is turned into éK AKQ.Ë@ Ê¾
 JÓ
Apart from identifying a set of URLs in each of the rele- mtklš AlbrtÂnh̄. Note that in this dialectal example,
vant dialects, we designed a set of DA queries per dialect we do not require the annotator to render the word
to harvest large quantities of DA data from the web. These 
for orange éK AKQ.Ë@ AlbrtÂnh̄ in its MSA form, namely,
queries were generated by our annotators with no restric- éËA® KQË@ AlbrtqAlh̄.
tions on orthographies, in fact, we gave the explicit request .
that they provide multiple variable alternative orthogra-
3.3. Sentence Boundary Detection
phies where possible. The different dialects come with their
unique challenges due to regional variations which impact In blogs, sentence boundaries are often not marked explic-
the way people would orthographically represent different itly with punctuation. In this task, annotators are required to
pronunciations. For example, DA words with MSA cog- insert boundaries between sentences. We define a sentence

nates whose written form contains the † q2 (Qaf) consonant in our guidelines as a syntactically and semantically coher-
 ent unit in language. Every sentence has to have at least
may be spelled etymologically (as † q) or phonologically
a main predicate that makes up a main clause. The predi-
as one of many local variants: ¼ k, @ Â or À G. cate could be a verb, or in the case of verb-less sentences,
We collected 40 dialectal queries from each of our 25 an- the predicate could be a nominal, adjectival or a preposi-
notators specifically asking them when possible to identify tional phrase. Table 2 illustrates a blog excerpt as it occurs
further regional variations. In our annotations in general, naturally on the web followed by sentence boundaries ex-
we make the gross simplifying assumption that Levantine plicitly inserted with a carriage return splitting the line in
(Syrian, Lebanese, Palestinian and Jordanian) Arabic is a three sentences.
single dialect. However, for the process of query gen-
eration, we asked annotators to identify sub-dialects. So 3.4. COLABA Conventional Orthography
some of our queries are explicitly marked as Levantine- Orthography is a way of writing language using letters and
Palestinian or Levantine-Syrian for instance. Moreover, symbols. MSA has a standard orthography using the Ara-
we asked the annotators to provide queries that have verbs bic script. Arabic dialects, on the other hand, do not have
where possible. We also asked them to focus on queries a standard orthographic system. As such, a variety of ap-
related to the three domains of interest: politics, religion proximations (phonological/lexical/etymological) are often
and social issues. All queries were generated in DA using pursued; and they are applied using Arabic script as well as
Arabic script, bearing in mind the lack of orthographic stan- Roman/other scripts. In an attempt to conventionalize the
dards. The annotators were also asked to provide an MSA orthography, we define a phonological scheme which we
translation equivalent for the query and an English trans- refer to as the COLABA Conventional Orthography (CCO).
lation equivalent. Table 1 illustrates some of the queries This convention is faithful to the dialectal pronunciation as
generated. much as possible regardless of the way a word is typically
written. This scheme preserves and explicitly represents all
3.2. Typographical Clean Up the sounds in the word including the vowels. For example,
Blog data is known to be a challenging genre for any lan- H. AK. bAb ‘door’ is rendered as be:b in CCO for LEV (specif-
guage from a textual NLP perspective since it is more akin ically Lebanese) but as ba:b for EGY.3 The full guidelines
to spoken language. Spelling errors in MSA (when used) will be detailed in a future publication.
abound in such genres which include speech effects. The
problem is compounded for Arabic since there are no DA 3
Most CCO symbols have English-like/HSB-like values, e.g.,
orthographic standards. Accordingly, devising guidelines
H. b or Ð m. Exceptions include T ( H θ), D ( X ð), c ( € š), R ( ¨ γ),
2
All Arabic transliterations are provided in the Habash-Soudi- 7 ( h H), 3 ( ¨ ς), and 2 ( Z ’). CCO uses ‘.’ to indicate empha-
Buckwalter (HSB) transliteration scheme (Habash et al., 2007). sis/velarization, e.g., t. ( T).
68/119
DA Query DA MSA English

èQëA£ ù®K . †C¢Ë@ EGY
èQëA£ iJ“@ †C¢Ë@
.
 divorce became very common
ÕºÊJm.k @ h@P IRQ ÕºË ø ðP@ ¬ñƒ I will tell you a story
éJK. Q.¯ ¨ PðX h@P LEV éJK. @ Q.¯ úÍ@ @P ñ¯ I.ë X He went directly to visit his father’s tomb
ñƒ@Q¯ XAƒ È@P AÓ MOR YJk. ©“ð ú¯ È@P B he is still in good shape

Table 1: Sample DA queries used for harvesting blog data

Input text
èXñÓ éÊ¿ ø Qå @ ñm' 
. .  « ø YK. ð XBð I.Jk. ð h. ð QK@ ø YK. è@P ñJ»Xð QJ‚k. AÓ Yg@ ø YK.
After manual sentence boundary detection
è@P ñJ»Xð QJ‚k. AÓ Yg@ ø YK.
XBð I.Jk. ð h. ð QK@ ø YK.
èXñÓ éÊ¿ ø Qå @ ñm' 
. .  « ø YK. ð
Table 2: LEV blog excerpt with sentence boundaries identified.

• CCO explicitly indicates the pronounced short vow- 3.5. Dialect Annotation
els and consonant doubling, which are expressed in Our goal is to annotate all the words in running text with
Arabic script with optional diacritics. Accordingly, their degree of dialectalness. In our conception, for the
there is no explicit marking for the sukuun diacritic purposes of COLABA we think of MSA as a variant di-
which we find in Arabic script. For example, the CCO alect; hence, we take it to be the default case for the Arabic
for I. »QÓ mrkb in EGY could be markib ‘boat’ or mi- words in the blogs. We define a dialectal scale with respect
rakkib ‘something put together/causing to ride’ or mu- to orthography, morphology and lexicon. We do not han-
rakkab ‘complex’. dle phrasal level or segment level annotation at this stage
• Clitic boundaries are marked with a +. This is an of our annotation, we strictly abide by a word level annota-
attempt at bridging the gap between phonology and tion.4 The annotators are required to provide the CCO rep-
morphology. We consider the following affixations resentation (in Section 3.4.) for all the words in the blog.
as clitics: conjunctions, prepositions, future particles, If a word as it appears in the original blog maintains its
progressive particles, negative particles, definite arti- meaning and orthography as in MSA then it is considered
cles, negative circumfixes, and attached pronouns. For the default MSA for dialect annotation purposes, however
example, in EGY CCO ÐCƒð wslAm ‘and peace’ is if it is pronounced in its context dialectically then its CCO
 . JºK AÓ mAyktbš ‘he doesn’t representation will reflect the dialectal pronunciation, e.g.
rendered we+sala:m and 
write’ is rendered ma+yiktib+c.
I.JºK, yktb ‘he writes’ is considered MSA from a dialect
annotation perspective, but in an EGY context its CCO rep-
• We use the ^ symbol to indicate the presence of the resentation is rendered yiktib rather than the MSA CCO of
Ta Marbuta (feminine marker) morpheme or of the yaktub.
Tanween (nunation) morpheme (marker of indefinite- Word dialectness is annotated according to a 5-point scale

ness). For example, éJ.JºÓ mktbh̄ ‘library’ is rendered building on previous efforts by Habash et al. (2008):
 JºJK. bi+yiktib
in CCO as maktaba^ (EGY). Another example is AJÊÔ« • WL1: MSA with dialect morphology I .
ςmlyAã ‘practically’, which is rendered in CCO as ‘he is writing’, I 
. JºJë ha+yiktib ‘he will write’
3amaliyyan^.
• WL2: MSA faux amis where the words look MSA but
CCO is comparable to previous efforts on creating re- are semantically used dialectically such as Ñ« 3am a
sources for Arabic dialects (Maamouri et al., 2004; Kilany LEV progressive particle meaning ‘in the state of’ or
et al., 2002). However, unlike Maamouri et al. (2004), MSA ‘uncle’
CCO is not defined as an Arabic script dialectal orthogra-
phy. CCO is in the middle between the morphophonemic • WL3: Dialect lexeme with MSA morphology such as
and phonetic representations used in Kilany et al. (2002) É«Qƒ sa+yiz3al ‘he will be upset’
for Egyptian Arabic. CCO is quite different from com-
• WL4: Dialect lexeme where the word is simply a di-
monly used transliteration schemes for Arabic in NLP such  mic ‘not’
alectal word such as the negative particle Ó
as Buckwalter transliteration in that CCO (unlike Buckwal-
ter) is not bijective with Arabic standard orthography.
For the rest of this section, we will use CCO in place of the 4
Annotators are aware of multiword expressions and they note
HSB transliteration except when indicated. them when encountered.
69/119
• WL5: Dialect lexeme with a consistent systematic identify the various POS based on form, meaning, and

phonological variation from MSA, e.g., LEV éKCK grammatical function illustrated using numerous examples.

tala:te^ ‘three’ versus éKCK Tala:Ta^. The set of POS tags are as follows: (Common) Noun,
Proper Noun, Adjective, Verb, Adverb, Pronoun, Preposi-
In addition, we specify another six word categories that are tion, Demonstrative, Interrogative, Number, and Quantifier.
of relevance to the annotation task on the word level: For- We require the annotators to provide a detailed morphologi-
eign Word (ñKCJk., jila:to, ‘gelato ice cream’), Borrowed cal profile for three of the POS tags mentioned above: Verb,
Word ( YK@ ½K ð, wi:k 2end, ‘weekend’), Arabic Named En- Noun and Adjective. For this task, our main goal is to iden-
tify irregular morphological behavior. They transcribe all
. AKX ðQÔ«, 3amr dya:b, ‘Amr Diab’), Foreign Named
tity ( H
their data entries in the CCO representation only as defined
Entity ( QKPA¿ ùÒJk., jimi kartar, ‘Jimmy Carter’), Typo (fur-
in Section 3.4.. We use the Arabic script below mainly for
ther typographical errors that are not caught in the first illustration in the following examples.
round of manual clean-up), and in case they don’t know
the word, they are instructed to annotate it as unknown. • Verb Lemma: In addition to the basic 3rd person
masculine singular (3MS) active perfective form of
3.6. Lemma Creation
This task is performed for a subset of the words in the
the dialectal verb lemma, e.g., H . Qå cirib ‘he drank’
(EGY), the annotators are required to enter: (i) the
blogs. We focus our efforts first on the cases where an MSA
morphological analyzer fails at rendering any analysis for
3MS active imperfective H . Qå„ yicrab; (ii) the 3MS
a given word in a blog. We are aware that our sampling
passive perfective is H . Qå„@ incarab; (iii) the 3MS
ignores the faux amis cases with MSA as described in Sec- passive imperfective H. Qå„JK yincirib; and (iv) and the
tion 3.5.. Thus, for each chosen/sampled dialectal surface masculine singular imperative H . Qå @ icrab.
word used in an example usage from the blog, the annotator
is required to provide a lemma, an MSA equivalent, an En- • Noun Lemma: The annotators are required to en-
glish equivalent, and a dialect ID. All the dialectal entries ter the feminine singular form of the noun if avail-
are expected to be entered in the CCO schema as defined in able. They are explicitly asked not to veer too much
Section 3.4.. away from the morphological form of the lemma, so
for example, they are not supposed to put Iƒ  sit
We define a lemma (citation form) as the basic entry form
of a word into a lexical resource. The lemma represents ‘woman/lady’ as the feminine form of Ég. @P ra:gil
the semantic core, the most important part of the word that ‘man’. The annotators are asked to specify the ratio-
carries its meaning. In case of nouns and adjectives, the nality/humanness of the noun which interacts in Ara-
lemma is the definite masculine singular form (without the bic with morphosyntactic agreement. Additional op-
explicit definite article). And in case of verbs, the lemma is tional word forms to provide are any broken plurals,
the 3rd person masculine singular perfective active voice. mass count plural collectives, and plurals of plurals,
All lemmas are clitic-free. e.g rigga:la^ and riga:l ‘men’ are both broken plurals
A dialectal surface word may have multiple underlying of ra:gil ‘man’.
lemmas depending on the example usages we present to the
annotators. For example, the word éJ.»QÓ mrkbh occurs in • Adjective Lemma: For adjectives, the annotators pro-
vide the feminine singular form and any broken plu-
two examples in our data: 1. éK YK AK. éJ.»QÓ ú×Aƒ sa:mi mi-
rals, e.g. the adjective Èð @ 2awwel ‘first [masc.sing]’
rakkib+uh be+2ide:+h ‘Sami built it with his own hands’
has the corresponding EGY lemma mirakkib ‘build’; and has the feminine singular form úÍð @ 2u:la and the bro-
  @ñk@P éËAg
2. éJÓ éJ.»QÓ @ð Q‚
 QË@ ir+rigga:la^ ra:7u yictiru
. ken plural ÉK@ð @ 2awa:2il.
markib+uh minn+uh ‘The men went to buy his boat from
him’ with the corresponding lemma markib ‘boat’. The an-
4. Tools for COLABA
notators are asked to explicitly associate each of the created
lemmas with one or more of the presented corresponding In order to process and manage the large amounts of data
usage examples. at hand, we needed to create a set of tools to streamline the
annotation process, prioritize the harvested data for manual
3.7. Morphological Profile Creation annotation, then use the created resources for MAGEAD.
Finally, we further define a morphological profile for the
entered lemmas created in Section 3.6.. A computation- 4.1. Annotation Interface
ally oriented morphological profile is needed to complete Our annotation interface serves as the portal which annota-
the necessary tools relevant for the morphological analyzer tors use to annotate the data. It also serves as the repository
MAGEAD (see Section 4.3.). We ask the annotators to se- for the data, the annotations and management of the anno-
lect (they are given a list of choices) the relevant part-of- tators. The annotation interface application runs on a web
speech tag (POS) for a given lemma as it is used in the server because it is the easiest and most efficient way to al-
blogs. For some of the POS tags, the annotators are re- low different annotators to work remotely, by entering their
quested to provide further morphological specifications. annotations into a central database. It also manages the an-
In our guidelines, we define coarse level POS tags by pro- notators tasks and tracks their activities efficiently. For a
viding the annotators with detailed diagnostics on how to more detailed description of the interface see (Benajiba and
70/119
Diab, 2010). For efficiency and security purposes, the an- 4.2. DA Identification Pipeline
notation application uses two different servers. In the first We developed a simple module to determine the degree to
one, we allocate all the html files and dynamic web pages. which a text includes DA words. Specifically, given Ara-
We use PHP to handle the dynamic part of the application bic text as input, we were interested in determining how
which includes the interaction with the database. The sec- many words are not MSA. The main idea is to use an MSA
ond server is a database server that runs on PostgreSQL.5 morphological analyzer, Buckwalter Arabic Morphological
Our database comprises 22 relational databases that are cat- Analyzer (BAMA) (Buckwalter, 2004), to analyze the input
egorized into tables for: text. If BAMA is able to generate a morphological analysis
• Basic information that is necessary for different mod- for an input word, then we consider that word MSA.
ules of the application. These tables are also signif- As a result, we have a conservative assessment of the di-
icantly useful to ease the maintenance and update of alectness of an input text. A major source of potential errors
the application. are names which are not in BAMA.
We assessed our pipeline on sample blog posts from our
• User permissions: We have various types of users with
harvested data. In an EGY blog post6 19% of the word
different permissions and associated privileges. These
types failed BAMA analysis. These words are mainly DA
tables allow the application to easily check the permis-
words with few named entities. Similar experiments were
sions of a user for every possible action.
conducted on IRQ,7 LEV,8 and MOR9 blog posts yielding
• Annotation information: This is the core table cat- 13.5%, 8% and 26% of non-MSA word types, respectively.
egory of our database. Its tables save the annota- It is worth noting the high percentage of out of vocabulary
tion information entered by each annotator. They also words for the Moroccan thread compared to the other di-
save additional information such as the amount of time alects. Also, by comparison, the low number of misses for
taken by an annotator to finish an annotation task. Levantine. This may be attributed to the fact that BAMA
For our application, we define three types of users, hence covers some Levantine words due to the LDC’s effort on
three views (see Figure 1): the Levantine Treebank (Maamouri et al., 2006).
We further analyzed BAMA-missed word types from a 30K
1. Annotator. An Annotator can perform an annota- word blog collection. We took a sample of 100 words from
tion task, check the number of his/her completed an- the 2,036 missed words. We found that 35% are dialectal
notations, and compare his/her speed and efficiency words and that 30% are named entities. The rest are MSA
against other annotators. An annotator can only work word that are handled by BAMA. We further analyzed two
on one dialect by definition since they are required to 100 string samples of least frequent bigrams and trigrams of
possess native knowledge it. An annotator might be word types (measured against an MSA language model) in
involved in more than one annotation task. the 30K word collection. We found that 50% of all bigrams
2. Lead Annotator. A Lead annotator (i) manages the an- and 25% of trigrams involved at least one dialectal word.
notators’ accounts, (ii) assigns a number of task units The percentages of named entities for bigrams and trigrams
to the annotators, and, (iii) checks the speed and work in our sample sets are 19% and 43%, respectively.
quality of the annotators. Leads also do the tasks
themselves creating a gold annotation for comparison 4.3. MAGEAD
purposes among the annotations carried out by the an- M AGEAD is a morphological analyzer and generator for
notators. A lead is an expert in only one dialect and the Arabic language family, by which we mean both MSA
thus s/he can only intervene for the annotations related and DA. For a fuller discussion of M AGEAD (including an
to that dialect. evaluation), see (Habash et al., 2005; Habash and Rambow,
2006; Altantawy et al., 2010). For an excellent discussion
3. Administrator. An Administrator (i) manages the of related work, see (Al-Sughaiyer and Al-Kharashi, 2004).
Leads’ accounts, (ii) manages the annotators’ ac-
M AGEAD relates (bidirectionally) a lexeme and a set of lin-
counts, (iii) transfers the data from text files to the
guistic features to a surface word form through a sequence
database, (iv) purges the annotated data from the data
of transformations. In a generation perspective, the features
base to xml files, and (v) produces reports such as
are translated to abstract morphemes which are then or-
inter-annotator agreement statistics, number of blogs
dered, and expressed as concrete morphemes. The concrete
annotated, etc.
templatic morphemes are interdigitated and affixes added,
The website uses modern JavaScript libraries in order to finally morphological and phonological rewrite rules are
provide highly dynamic graphical user interfaces (GUI). applied. In this section, we discuss our organization of lin-
Such GUIs facilitate the annotator’s job leading to signifi- guistic knowledge, and give some examples; a more com-
cant gain in performance speed by (i) maximizing the num- plete discussion of the organization of linguistic knowledge
ber of annotations that can be performed by a mouse click in M AGEAD can be found in (Habash et al., 2005).
rather than a keyboard entry and by (ii) using color cod-
ing for fast checks. Each of the GUIs which compose our 6
http://wanna-b-a-bride.blogspot.com/2009/09/blog-
web applications has been carefully checked to be consis- post_29.html
7
tent with the annotation guidelines. http://archive.hawaaworld.com/showthread.php?t=606067&page=76
8
http://www.shabablek.com/vb/t40156.html
5 9
http://www.postgresql.org/ http://forum.oujdacity.net/topic-t5743.html
71/119
Figure 1: Servers and views organization.

Lexeme and Features Morphological analyses are rep- DA/MSA independent. Although as more Arabic variants
resented in terms of a lexeme and features. We define the are added, some modifications may be needed. Our current
lexeme to be a triple consisting of a root, a morphological MBC hierarchy specification for both MSA and Levantine,
behavior class (MBC), and a meaning index. We do not which covers only the verbs, comprises 66 classes, of which
deal with issues relating to word sense here and therefore 25 are abstract, i.e., only used for organizing the inheritance
do not further discuss the meaning index. It is through this hierarchy and never instantiated in a lexeme.
view of the lexeme (which incorporates productive deriva- MAGEAD Morphemes To keep the MBC hierarchy
tional morphology without making claims about semantic variant-independent, we have also chosen a variant-
predictability) that we can have both a lexeme-based repre- independent representation of the morphemes that the MBC
sentation, and operate without a lexicon (as we may need hierarchy maps to. We refer to these morphemes as abstract
to do when dealing with a dialect). In fact, because lex- morphemes (AMs). The AMs are then ordered into the
emes have internal structure, we can hypothesize lexemes surface order of the corresponding concrete morphemes.
on the fly without having to make wild guesses (we know The ordering of AMs is specified in a variant-independent
the pattern, it is only the root that we are guessing). Our context-free grammar. At this point, our example (1) looks
evaluation shows that this approach does not overgenerate. like this:
We use as our example the surface form HQëX  P@ Aizda-
harat (Azdhrt without diacritics) “she/it flourished". The (2) [Root:zhr][PAT_PV:VIII]
M AGEAD lexeme-and-features representation of this word [VOC_PV:VIII-act] + [SUBJSUF_PV:3FS]
form is as follows:
Note that the root, pattern, and vocalism are not ordered
(1) Root:zhr MBC:verb-VIII POS:V PER:3 GEN:F with respect to each other, they are simply juxtaposed.
NUM:SG ASPECT:PERF The ‘+’ sign indicates the ordering of affixival morphemes.
Only now are the AMs translated to concrete morphemes
Morphological Behavior Class An MBC maps sets (CMs), which are concatenated in the specified order. Our
of linguistic feature-value pairs to sets of abstract mor- example becomes:
phemes. For example, MBC verb-VIII maps the feature-
value pair ASPECT:PERF to the abstract root morpheme (3) <zhr,V1tV2V3,iaa> +at
[PAT_PV:VIII], which in MSA corresponds to the concrete
root morpheme V1tV2V3, while the MBC verb-II maps AS- Simple interdigitation of root, pattern and vocalism then
PECT:PERF to the abstract root morpheme [PAT_PV:II], yields the form iztahar+at.
which in MSA corresponds to the concrete root morpheme MAGEAD Rules We have two types of rules. Mor-
1V22V3. We define MBCs using a hierarchical representa- phophonemic/phonological rules map from the morphemic
tion with non-monotonic inheritance. The hierarchy allows representation to the phonological and orthographic repre-
us to specify only once those feature-to-morpheme map- sentations. For MSA, we have 69 rules of this type. Ortho-
pings for all MBCs which share them. For example, the graphic rules rewrite only the orthographic representation.
root node of our MBC hierarchy is a word, and all Arabic These include, for example, rules for using the gemination
words share certain mappings, such as that from the lin- shadda (consonant doubling diacritic). For Levantine, we
guistic feature conj:w to the clitic w+. This means that have 53 such rules.
all Arabic words can take a cliticized conjunction. Sim- For our example, we get /izdaharat/ at the phonological
ilarly, the object pronominal clitics are the same for all level. Using standard MSA diacritized orthography, our
transitive verbs, no matter what their templatic pattern is. example becomes Aizdaharat (in transliteration). Remov-
We have developed a specification language for expressing  P@
ing the diacritics turns this into the more familiar HQëX
MBC hierarchies in a concise manner. Our hypothesis is Azdhrt. Note that in analysis mode, we hypothesize all pos-
that the MBC hierarchy is Arabic variant-independent, i.e. sible diacritics (a finite number, even in combination) and
72/119
perform the analysis on the resulting multi-path automaton. types of surface forms for the search engine (the contextual
We follow (Kiraz, 2000) in using a multi-tape representa- material is left unchanged):
tion. We extend the analysis of Kiraz by introducing a fifth
tier. The five tiers are used as follows: Tier 1: pattern and • Mode 1: MSA inflected forms. For example, the
affixational morphemes; Tier 2: root; Tier 3: vocalism; Tier MSA query term iJ.“ @ ÂSbH ‘he became’ is expanded
4: phonological representation; Tier 5: orthographic repre-
sentation. In the generation direction, tiers 1 through 3 are to several MSA forms including AJj.“ @ ÂSbHnA ‘we
always input tiers. Tier 4 is first an output tier, and subse- became’, iJ.’ƒ sySbH ‘he will become’, etc.
quently an input tier. Tier 5 is always an output tier.
We implemented our multi-tape finite state automata as a • Mode 2: MSA inflected with dialectal morphemes.
layer on top of the AT&T two-tape finite state transducers It is common in DA to borrow an MSA verb and in-
(Mohri et al., 1998). We defined a specification language flect it using dialectal morphology; we refer to this
for the higher multi-tape level, the new M ORPHTOOLS for- phenomenon as intra-word code switching. For exam-
mat. Specification in the M ORPHTOOLS format of different
ple, the MSA query term iJ.“ @ ÂSbH can be expanded
types of information such as rules or context-free gram-
mars for morpheme ordering are compiled to the appro- into iJ.’Jë hySbH ‘he will become’ and @ñjJ.’Jë hyS-
priate L EXTOOLS format (an NLP-oriented extension of bHwA ‘they will become’.
the AT&T toolkit for finite-state machines, (Sproat, 1995)).
For reasons of space, we omit a further discussion of M OR - • Mode 3: MSA lemma translated to a dialectal lemma,
PHTOOLS . For details, see (Habash et al., 2005). and then inflected with dialectal morphemes. For ex-
From MSA to Levantine and Egyptian We modified ample, the MSA query term iJ.“ @ ÂSbH can be ex-
M AGEAD so that it accepts Levantine rather than MSA 
verbs. Our effort concentrated on the orthographic repre-
panded into EGY ù®K. bqý ‘he became’ and ù®J . Jë hy-
bqý ‘he will become’.
sentation; to simplify our task, we used a diacritic-free or-
thography for Levantine developed at the Linguistic Data
Currently, DIRA handles EGY and LEV; with the exis-
Consortium (Maamouri et al., 2006). Changes were done
tence of more resources for additional dialects, they will
only to the representations of linguistic knowledge, not to
be added. The DIRA system architecture is shown in Fig-
the processing engine. We modified the MBC hierarchy,
ure 2. After submitting an MSA query to DIRA, the verb is
but only minor changes were needed. The AM ordering
extracted out of its context and sent to the MSA verb lemma
can be read off from examples in a fairly straightforward
detector, which is responsible for analyzing an MSA verb
manner; the introduction of an indirect object AM, since
(using MAGEAD in the analysis direction) and computing
it cliticizes to the verb in dialect, would, for example, re-
its lemma (using MAGEAD in the generation direction).
quire an extension to the ordering specification. The map-
The next steps depend on the chosen dialects and modes.
ping from AMs to CMs, which is variant-specific, can be
If translation to one or more dialects is required, the in-
obtained easily from a linguistically trained (near-)native
put lemma is translated to the dialects (Mode 3). Then,
speaker or from a grammar handbook. Finally, the rules,
the MAGEAD analyzer is run on the lemma (MSA or DA,
which again can be variant-specific, require either a good
if translated) to determine the underlying morphemes (root
morpho-phonological treatise for the dialect, a linguisti-
and pattern), which are then used to generate all inflected
cally trained (near-)native speaker, or extensive access to
forms using MAGEAD (again, which forms are generated
an informant. In our case, the entire conversion from MSA
depends on the mode). Finally, the generated forms are
to Levantine was performed by a native speaker linguist in
re-injected in the original query context (duplicates are re-
about six hours. A similar but more limited effort was done
moved).
to extend the Levantine system to Egyptian by introducing
the Egyptian concrete morpheme for the future marker +ë
h+ ‘will’. 6. Conclusions and Future Work
We presented COLABA, a large effort to create resources
5. Resource Integration & Use: DIRA
and processing tools for Dialectal Arabic. We briefly de-
DIRA (Dialectal Information Retrieval for Arabic) is a scribed the objectives of the project and the various types
component in an information retrieval (IR) system for Ara- of resources and tools created under it. We plan to continue
bic. It integrates the different resources created above in its working on improving the resources and tools created so
pipeline. As mentioned before, one of the main problems of far and extending them to handle more dialects and more
searching Arabic text is the diglossic nature of the Arabic types of dialectal data. We are also considering branching
speaking world. Though MSA is used in formal contexts on into application areas other than IR that can benefit from
the Internet, e.g., in news reports, DA is dominant in user- the created resources, in particular, machine translation and
generated data such as weblogs and web discussion forums. language learning.
Furthermore, the fact that Arabic is a morphologically rich
language only adds problems for IR systems. DIRA ad-
dresses both of these issues. DIRA is basically a query-
Acknowledgments
term expansion module. It takes an MSA verb (and possi- This work has been mostly funded by ACXIOM Corpora-
bly some contextual material) as input and generates three tion.
73/119
Figure 2: DIRA system architecture

7. References N. Habash, O. Rambow, M. Diab, and R. Kanjawi-Faraj. 2008.


Guidelines for Annotation of Arabic Dialectness. In Proceed-
Ernest T. Abdel-Massih, Zaki N. Abdel-Malek, and El-Said M.
ings of the LREC Workshop on HLT & NLP within the Arabic
Badawi. 1979. A Reference Grammar of Egyptian Arabic.
world.
Georgetown University Press.
Clive Holes. 2004. Modern Arabic: Structures, Functions, and
Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi. 2004. Arabic
Varieties. Georgetown Classics in Arabic Language and Lin-
morphological analysis techniques: A comprehensive survey.
guistics. Georgetown University Press.
Journal of the American Society for Information Science and
H. Kilany, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, and
Technology, 55(3):189–213.
C. McLemore. 2002. Egyptian Colloquial Arabic Lexicon.
Mohamed Altantawy, Nizar Habash, Owen Rambow, and Ibrahim LDC catalog number LDC99L22.
Saleh. 2010. Morphological Analysis and Generation of Ara-
George Anton Kiraz. 2000. Multi-tiered nonlinear morphology
bic Nouns: A Morphemic Functional Approach. In Proceed-
using multi-tape finite automata: A case study on Syriac and
ings of the seventh International Conference on Language Re-
Arabic. Computational Linguistics, 26(1):77–105.
sources and Evaluation (LREC), Valletta, Malta.
Mohamed Maamouri, Tim Buckwalter, and Christopher Cieri.
Mary Catherine Bateson. 1967. Arabic Language Handbook. 2004. Dialectal Arabic Telephone Speech Corpus: Principles,
Center for Applied Linguistics, Washington D.C., USA. Tool design, and Transcription Conventions. In NEMLAR In-
Yassine Benajiba and Mona Diab. 2010. A web application for ternational Conference on Arabic Language Resources and
dialectal arabic text annotation. In Proceedings of the LREC Tools.
Workshop for Language Resources (LRs) and Human Lan- Mohamed Maamouri, Ann Bies, Tim Buckwalter, Mona Diab,
guage Technologies (HLT) for Semitic Languages: Status, Up- Nizar Habash, Owen Rambow, and Dalila Tabessi. 2006. De-
dates, and Prospects. veloping and using a pilot dialectal Arabic treebank. In Pro-
Kristen Brustad. 2000. The Syntax of Spoken Arabic: A Compar- ceedings of the Fifth International Conference on Language
ative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Di- Resources and Evaluation, LREC’06, Genoa, Italy.
alects. Georgetown University Press. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 1998.
Tim Buckwalter. 2004. Buckwalter Arabic morphological ana- A rational design for a weighted finite-state transducer library.
lyzer version 2.0. In D. Wood and S. Yu, editors, Automata Implementation, Lec-
Mark W. Cowell. 1964. A ReferenceGrammar of Syrian Arabic. ture Notes in Computer Science 1436, pages 144–58. Springer.
Georgetown University Press. Frank Rice and Majed Sa’id. 1979. Eastern Arabic. Georgetown
Wallace Erwin. 1963. A Short Reference Grammar of Iraqi Ara- University Press.
bic. Georgetown University Press. Richard Sproat. 1995. Lextools: Tools for finite-state linguistic
Nizar Habash and Owen Rambow. 2006. Magead: A morpho- analysis. Technical Report 11522-951108-10TM, Bell Labora-
logical analyzer for Arabic and its dialects. In Proceedings of tories.
the 21st International Conference on Computational Linguis-
tics and 44th Annual Meeting of the Association for Computa-
tional Linguistics (Coling-ACL’06), Sydney, Australia.
Nizar Habash, Owen Rambow, and Geroge Kiraz. 2005. Mor-
phological analysis and generation for Arabic dialects. In Pro-
ceedings of the ACL Workshop on Computational Approaches
to Semitic Languages, Ann Arbor, MI.
Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. 2007. On
Arabic Transliteration. In A. van den Bosch and A. Soudi, ed-
itors, Arabic Computational Morphology: Knowledge-based
and Empirical Methods. Springer.
74/119
A Linguistic Search Tool for Semitic Languages

Alon Itai
Knowledge Center for Processing Hebrew
Computer Science Department
Technion, Haifa, Israel
E-mail: itai@cs.technion.ac.il

Abstract
The paper discusses searching a corpus for linguistic patterns. Semitic languages have complex morphology and ambiguous writing
systems. We explore the properties of Semitic Languages that challenge linguistic search and describe how we used the Corpus
Workbench (CWB) to enable linguistic searches in Hebrew corpora.

be analyzed as the preposition b (in) + the noun bit


1. Introduction (house), whereas the word bit, which also starts with a b
As linguistics matures, so the methods it uses turn towards can be analyzed only as the noun bit, since the remainder,
the empirical. It is no longer enough to introspect to it, is not a Hebrew word. Thus to find a preposition one
gather linguistic insight. Data is required. While most needs to perform a morphological analysis of the word to
search engines look for words, linguists are interested in decide whether the first letter is a preposition or part of the
grammatical patterns and usage of words within these lemma. Hence, in order to extract useful information the
patterns. For example, searching the adjective "record" text has to be first morphologically analyzed.
followed by a noun should yield "record highs" but not
"record songs"; searching the verb to eat (in any inflection) All this leads to a high degree of morphological ambiguity.
followed by a noun, should yield sentences such as "John Ambiguity is increased since the writing systems of
ate dinner." but not "Mary ate an apple" (the verb is Arabic and Hebrew omit most of the vowels. In a running
followed by an article, not a noun) Hebrew text a word has 2.2-2.8 different analyses on the
average. (The number of analyses depends on the corpus
To answer these needs, several systems have been and the morphological analyzer – if the analyzer
constructed. Such systems take a corpus and preprocess it distinguishes between more analyses and if it uses a larger
to enable linguistic searches. We argue that general lexicon it will find more analyses.)
purpose search tools are not suitable for Semitic
languages unless special measures are taken. We then Ideally one would wish to use manually tagged corpora,
show how to use one such tool to enable linguistic i.e., corpora where the correct analysis of each word was
searches in Semitic Languages, with Modern Hebrew as a manually chosen. However, since it is expensive to
test case. manually tag a large corpus, the size of such corpora is
limited and many interesting linguistic phenomena will
not be represented. Thus, one may either use
2. Semitic Languages automatically tagged corpora or use an ambiguous corpus
and retrieve a sentence if any one of its possible analyses
Semitic languages pose interesting challenges to
satisfies the query. We preferred the latter approach
linguistic search engines. The rich morphology entails
because of the high error rate of programs that attempt to
that each word contains, in addition to the lemma, a large
find the right analysis in context. (An error rate of 5% per
number of morphological and syntactical features: part of
word entails that a 20 word sentence has probability
speech, number, gender, case. Nouns also inflect for status
(1− 0.05) ≈ e of being analyzed incorrectly.) Moreover,
20 −1

(absolute or construct) and possessive. Verbs inflect for


since these systems use Machine learning (HMM or SVM)
person, tense, voice and accusative. Thus one may want to
(Bar Haim et al. 2005, Diab et al. 2004, Habash et al
search for a plural masculine noun, followed by a plural
2005), they prefer the more common structure, thus rare
verb in past tense with accusative inflection first person
linguistic structures will be more likely to be incorrectly
singular.
tagged. However these are exactly the phenomena a
corpus linguist would like to search. Consequently, to
Additional problems arise by the writing system. Some
successfully perform linguistic searches one cannot rely
prepositions and conjunctives are attached to the word as
on the automatic morphological disambiguation and it
prefixes. For example, in Hebrew the word bbit1 may only
would be better to allow all possible analyses and retrieve
1 a sentence even if only one of the analyses satisfies the
We use the following transliteration query.
‫ תשרקצפעסנמלכיטחזוהדגבא‬abgdhwzxTiklmnsypcqršt

75/119
3. CWB mxiibim
CWB – the Corpus Workbench – is a tool created at the :mxaiib-ADJECTIVE-masculine-plural-abs-indef
University of Stuttgart for searching corpora. The tool :xiib-PARTICIPLE-Pi'el-xwb-unspecified-masculine-
enables linguistically motivated searches, for example plural-abs-indef
one may search a single word, say "interesting". :xiib-VERB-Pi'el-xwb-unspecified-masculine-plural-
The query language consists of Boolean combinations of present:PREFIX-m-preposition
regular expressions, which uses the POSIX EGREP :xiib-NOUN-masculine-plural-abs-indefinite
syntax, e.g. the query :PREFIX-m-preposition- xiib-ADJECTIVE- masculine-
"interest(s|(ed|ing)(ly)?)?" plural -abs-indef:
yields a search for either of the words interest,
interests, interested, interesting, The analyses are:
interestedly, interestingly. 1. The adjective mxiib, gender masculine, number
plural, status absolute and the word is indefinite;
One can also search for the lemma, say "lemma=go" 2. The verb xiib it is a participle of a verb whose binyan
should yield sentences containing the words go, goes, (inflection pattern of verb) is Pi'el, the root is xwb, the
going, went and gone. The search can be focused on person is unspecified, gender masculine, number
part of speech "POS=VERB". CWB deals with incomplete plural, the type of participle is noun, the status
specifications by using regular expressions. For example, absolute and the word is indefinite.
a verb can be subcategorized as VBG (present/past) and 3. A verb whose root is xwb, binyan Pi'el, person
VGN (participle). The query [pos="VB.*"] matches unspecified, number plural and tense present.
both forms and may be used to math all parts of speech 4. The noun xiib, prefixed by the preposition m.
that start with the letters VB. ("." matches any single 5. The adjective xiib, prefixed by the preposition m.
character and "*" after a pattern indicates 0 or more
repetitions, thus ".*" matches any string of length 0 or Thus one can retrieve the word by any one of the queries
more.). Finally, a query may consist of several words thus by POS:
["boy"][POS=VERB] yields all sentences that contain [POS=".*-ADJECTIVE-.*"], [POS=".*-PARTICIPLE-.*"],
the word boy followed by a verb. [POS=".*-VERB-.*"], [POS=".*-NOUN-.*"].
However, one may also specify additional properties by
To accommodate linguistic searches, the corpus needs to using a pattern that matches subfields:
be tagged with the appropriate data (such as, lemma, [POS=".*PREFIX-[^:]*preposition[^:]*-NOUN-.*"]
POS). The system then loads the tagged corpus to create indicating that we are searching for a noun that is prefixed
an index. To that end the corpus should be reformatted in a by a preposition. The sequence [^:]* denotes any
special format. sequence of 0 or more characters that does not contain ":"
and is used to skip over unspecified sub-fields. Since the
CWB has been used for a variety of languages. It also different analyses of a word are separated by ":" and ":"
supports UTF-8, thus allowing easy processing of non cannot appear within an analysis, the query cannot be
Latin alphabets. satisfied by matching the part of the query by one analysis
and the remainder of the query by a subsequent analysis.

4. Creating an Index To create an index from a corpus, we first run the


morphological analyzer of MILA (Itai and Wintner 2008)
In principle we adopted the CWB solution to partial and
that creates XML files containing all the morphological
multiple analyses, i.e., use regular expressions for partial
analyses for each word. We developed a program to
matches. We created composite POS consisting of the
transform the XML files to the above format, which
concatenation of all subfields of the analysis. For example,
conforms to CWB's index format. Thus we were able to
the complete morphological analysis of hšxqnim "the
create CWB files.
(male) players" is
"NOUN-masculine-plural-absolute-definite", we encode
Our architecture enables some Boolean combinations.
all this information as the POS of the word, [pos="
Suppose we wanted to search for a two-word expression
šxqnim-NOUN-masculine-plural-absolute-definite"], the
noun-adjective that agree in number. We therefore could
lemma is šxqnim , the main POS is noun, the gender
require that the first word be a singular noun and the
masculine, the number plural, the status absolute and the
second word a singular adjective or the first word is a
prefix h indicates that the word is definite. We included
plural noun and the second word a plural adjective. The
the lemma, since each analysis might have a different
query
lemma.
([pos=".*NOUN-singular-.*"]
To accommodate for multiple analyses, we concatenate [pos=".*ADJECTIVE-singular-.*"])
all the analyses (separated by ":"). For example, |([pos=".*NOUN-plural-.*"]
[pos=".*ADJECTIVE-plural-.*"])

76/119
6. Writing Queries
However, one must be careful to avoid queries of the type Even though it is possible to write queries in the above
[pos=".*NOUN.*" & pos=".*-singular-.*"] format we feel that it is unwieldy. First the format is
since then we might return a word that has one analysis as complicated and one may easily err. However more
a plural noun and another analysis as a singular verb. importantly, in order to write a query one must be familiar
with all the features of each POS and in which order they
5. Performance appear in the index. This is extremely user-unfriendly and
To test the performance of the system we uploaded a file we don't believe many people will be able to use such a
of 814,147 words, with a total of 1,564,324 analyses, i.e., system.
2.36 analyses per word. Table 1 shows a sample of
queries and their performance. The more general the To overcome this problem, we are in the process of
queries the more time they required. However, the creating a GUI which will show for each POS the
running time for these queries is reasonable. If the running appropriate subfields and once a subfield is chosen a
time is linear in the size of the corpus, CWB should be menu will show all possible values of that subfield.
able to support queries to 100 million word corpora. Unspecified subfields will be filled by placeholders. The
One problem we encountered is that of space. The index graphic query will then be translated to a CWB query and
of the 814,147 word file required 25.2 MB. Thus each the results of this query will be presented to the user. We
word requires about 31 bytes. Thus a 100 Million word believe that the GUI will also be helpful for queries in
corpus would require a 3.09 Gigabyte index file. languages that now use CWB format.

Regular Expression Time (sec) Output File (KB)

[pos=".*-MODAL-.*"] [pos=":‫היה‬-.*"][pos=".*-VERB-[^:]*-infinitive:.*"]; 0.117 13

[word="‫[]"על‬word="‫[]"מנת‬pos=".*-VERB-[^:]*-infinitive:.*"]; 0.038 28

[word="‫[]"על‬word="‫[]"מנת‬pos=".*PREFIX-‫ש‬.*"]; 0.025 5

[pos=".*:‫הלך‬-[^:]*-present:.*"] [pos=".*-VERB-[^:]*-infinitive:.*"]; 0.099 2

[word="‫[]"בית‬pos=":‫ספר‬.*"]; 0.017 7

[word="‫[]"בית‬pos=".*:‫^[ספר‬:]*-SUFFIX-possessive-.*"]; 0.014 1

"‫;"כותב‬ 0.009

[pos=".*:[^:]*-VERB-[^:]*:.*"]; 0.569

".*"; 1.854

[pos=".*"]; 1.85

".*"; [pos=".*"]; 3.677

All the previous regular expressions concatenated 7.961

("‫[ | "כותב‬pos=".*:[^:]*-VERB-[^:]*:.*"] | ".*" | [pos=".*"] ); 2.061

([pos=".*:[^:]*-VERB-[^:]*:.*" & word="‫ & "כותב‬word=".*" & pos=".*"]); 0.168

Table 1: Example queries and their performance.

77/119
Association for Computational Linguistics, pp.
7. Conclusion 573--580, Ann Arbor.
Until now Linguistic searches were oriented to Western Hajič, J. (2000). Morphological tagging: data vs.
languages. Semitic languages exhibit more complex dictionaries. In Proceedings of NAACL-ANLP, pp.
patterns, which at first sight might require designing 94--101, Seattle, Washington.
entirely new tools. We have showed how to reuse existing
tool to efficiently conduct sophisticated searches. Hajič, J. and Barbora Hladká, B. (1998). Tagging
Inflective Languages: Prediction of Morphological
The interface of current systems is UNIX based. This Categories for a Rich, Structured Tagset. In
might be acceptable when the linguistic features are Proceedings of COLING-ACL 1998. pp. 483--490,
simple, however, for complex features, it is virtually Montreal, Canada
impossible to memorize all the possibilities and render the Itai, A. and Wintner, S. (2008). Language Resources for
queries properly. Thus a special GUI is necessary. Hebrew. Language Resources and Evaluation, 42, pp.
75--98.
Lee, Y-S. et al. (2003). Language model based Arabic
8. Acknowledgements word segmentation. In ACL 2003, pp. 399--406.
It is a pleasure to thank Ulrich Heid, Serge Heiden and Segal, E. (2001). Hebrew morphological analyzer for
Andrew Hardie who helped us use CWB. Last and Hebrew undotted texts. M.Sc. thesis, Computer
foremost I wish to thank Gassan Tabajah whose technical Science Department, Technion, Haifa, Israel.
assistance was invaluable.

9. References

Official Web page of CWB: http://cwb.sourceforge.net/

Bar Haim, R., Sima’an, K. and Winter, Y. (2005).


Choosing an Optimal Architecture for Segmentation
and POS-Tagging of Modern Hebrew. ACL Workshop
on Computational Approaches to Semitic Languages.
Buckwalter, T. (2002). Buckwalter Arabic Morphological
Analyzer Version 1.0. Linguistic Data Consortium
catalog number LDC2002L49, ISBN 1-58563-257-0.
Buckwalter, T. (2004). Buckwalter Arabic Morphological
Analyzer Version 2.0. Linguistic Data Consortium
catalog number LDC2004L02, ISBN 1-58563-324-0.
Christ, O. (1994). A modular and flexible architecture for
an integrated corpus query system. In Papers in
Computational Lexicography (COMPLEX ’94), pp.
22--32, Budapest, Hungary.

Christ, O. and Schulze, B. M. (1996). Ein flexibles und


modulares Anfragesystem für Textcorpora. In H. Feldweg
and E. W. Hinrichs (eds.), Lexikon und Text, pp.
121--133. Max Niemeyer Verlag, Tübingen.

Diab, M., Hacioglu, K. and Jurafsky, D. (2004).


Automatic Tagging of Arabic Text: From Raw Text to
Base Phrase Chunks. In HLT-NAACL: Short Papers,
pp. 149--152.
Habash, N. and Rambow, O. (2005). Arabic
Tokenization, Part-of-Speech Tagging and Morpho-
logical Disambiguation in One Fell Swoop. In
Proceedings of the 43rd Annual Meeting of the

78/119
Algerian Arabic Speech Database (ALGASD):
Description and Research Applications

G. Droua-Hamdani 1, S. A. Selouani 2, M.Boudraa 3


1 Speech Processing Laboratory (TAP), CRSTDLA, Algiers, Algeria.
2 LARIHS Laboratory. University of Moncton, Canada.
3 Speech Communication Laboratory, USTHB, Algiers, Algeria.
gh.droua@post.com, selouani@umcs.ca, mk.boudraa@yahoo.fr

Abstract
This paper presents Algerian Speech Database (ALGASD) for standard Arabic language and relative research applications. The
project concerns 300 Algerian native speakers whom are selected statistically from 11 regions of the country. These different areas
are assumed representing the principal variations of pronunciations denoted between the populations.
ALGASD took into consideration many features as: gender, age and education level of every speaker. The basic text used to
elaborate the database is constituted by 200 phonetically balanced sentences. Number of recordings achieves to 1080 read sentences.
ALGASD provides an interesting set of perspectives for speech science applications, such as: automatic speech recognition,
acoustic phonetic analysis, prosodic researches; etc. This voice bank is used until now in several studies like rhythm analysis of
Algerian speakers. It is also used in training and testing phases of speech recognition systems, etc.

1. Introduction 2. Standard Arabic in Algeria's Languages


ALGASD project consists on conception and realization
The utilization of databases in the development of of Algerian voice bank with the Standard Arabic as the
"human-machine" applications is very important as for: substratum.
Text-To-Speech systems, Speech Recognition systems, Situated in the north of Africa, Algeria extends over the
etc. These last decades, many different databases were vast territory of 2 380 000 km2 occupied by about 34.8
realized. They can be: million inhabitants. The majority of them are
- Multilingual which constitute important projects concentrated in the north.
containing several languages, (Van den Heuvel & & Algerians' official language is Standard Arabic (SA). It is
Galounov & Tropf , 1998; Schultz & Waibel, 1998; used for all administrative tasks (government, media, etc).
Roach & Vicsi, 1996; Chan & al., 1995; Siemund & It is taught approximately for 13 years to children from 6
al., 2000) or monolingual limiting themselves to only to17 years old in three academic levels: primary, middle
one language (Timit, 1990; Vonwiller & al., 1995; and secondary school. However, written SA differs very
ELRA Ref120); substantially from Algerian spoken languages (mother
- Official / dialectal languages (Gopalakrishna & al., tongues). Indeed, approximately 72 % of the population
2005); speaks in their daily life the Darija, Algerian Arabic
- Reserved for a restricted domain or not such as: dialects and 28% of them have a second mother tongue
telephony (Petrovska & al., 1998; Zherng & al., called Tamazight which is Berber language.
2002) etc. Algerian Dialects are variants of SA stemming from the
- Dealing with continuous speech, read texts, etc ethnic, geographical and colonial occupiers influences as
(Siemund & al., 2000; ELRA Ref120; Gopalakrishna Spanish, French, Turkish, Italian, etc. While, Algiers
& al., 2005). Darija is influenced by both Berber and Turkish,
Constantine dialect is affected by Italian, Oran by
In comparison with the multitude of oral corpora realized Spanish, Tlemcen by Andalusia Arabic, etc. As a result,
for European and Asians languages, the corpora dedicated within Algerian Arabic itself, there are significant local
to the Arabic language and to its dialects are less frequent. variations (in pronunciation, grammar, etc.) observed
It exists to our knowledge: the corpus of LDC for the from town to town even they are near to each other
spontaneous phone word realized by Egyptian, Syrian, (Taleb, 1997; Marçais, 1954; Caubet, 2001).
Palestinian and Jordanian speakers (LDC), ELRA's
corpora for the Standard Arabic read by Moroccan
speakers [LDC], the oral corpus of GlobalPhone These two native languages (Darija and Tamazight)
(Siemund & al., 2000), Nemlar corpus recorded from constitute so the principal oral communication between
radio stations (Choukri & Hamid & Paulsson, 2005) and Algerians. In addition, the third language used by some
finally SAAVB for the Saudi accent (Mohamed &. Algerians is French language though it has no official
Alghamdi & Muzaffar, 2007). status, but it is still widely used by government, culture,

79/119
media (newspapers), universities, etc. (Arezki, 2008; Recordings are made in quiet environments well known
Cheriguen,1997). by the speakers. The same conditions of sound recording
are respected for all regions. We selected the best reading
3. Corpus Design and deleted all sentences which contained hesitations, re-
Text material of ALGASD is built from 200 Arabic recorded utterances which were not spoken clearly,
Phonetically Balanced Sentences (APBS) (Boudraa, & correctly, too soft or too loud. The average duration of
Boudraa & Guerin 1998). From which we conceived sentences is about 2.8 seconds. Rate of recording is
three types of corpora. Every corpus aims to provide us a normal. The sound files are in wave format, coded on 16
specific acoustic-phonetic knowledge. Common Corpus bits and sampling at 16 KHz.
(Cc): is used to list a maximum of dialectal variations of
pronunciation observed among Algerians. It is composed
of two utterances of APBS read by all speakers. Reserved
5. Recordings
Corpus (Cr): brought all existing phonetic oppositions in Recordings were preceded as follow:
the Arabic language. It is endowed with 30 sentences of Every 3 texts of Cr are distributed periodically on the
APBS which are divided into 10 texts of 3 sentences and 11 regions. In the beginning, we shared these 3 texts and
sheared between groups of speakers. In order to increase gave them to 3 speakers (2 male /1 female), excepted for
some consonants' occurrences, we broke some times the R9, where it was endowed only by 2 texts for 2 speakers
balance. Individual Corpus (Ci): is constituted of 168 (1 male/1 female). But after, we augmented the number
remaining sentences. They are used to gather maximum of recordings by increasing the number of speakers for
of contextual allophones. each region (Table.2). Total speakers and recordings
reached then respectively to 86 and 258 sound files.
Cc text material was read by all speakers of ALGASD
To elaborate ALGASD, we selected 300 Algerian
(300 speakers). Number of readings achieved to 600
speakers from 11 regions of the country which mapped
the most important variation of pronunciations between recordings. As regards to Ci' text, we realized 2 different
sub-sets of recordings: the first one contains 32 utterances
inhabitants. All participants are native speakers and had
read by all speakers of Cr corpus. The second one is
grown up in or near localities selected for this research.
According to the most recent census of inhabitants constituted of 136 sentences statistically distributed
between 136 other speakers for all regions. From this
available in ONS web site (ONS), we distributed
operation, two sentences were remaining. We added them
statically all speakers between these areas with regard to
the real number and gender of inhabitants for each region to R9 texts because it contained the less number of
speakers
(Table.1).

M F Recordings
Regions Female Male T. Speaker/
Region R1 12 11 69
R1 Algiers 40 (50%) 40 (50%) 80 (27%) R2 5 5 30
R2 Tizi Ouzou 17 (50%) 17 (50%) 34 (11%) R3 4 3 21
R3 Medea 13 (52%) 12 (48%) 25 (8%) R4 4 3 21
R4 Constantine 13 (52%) 12 (48%) 25 (8%) R5 3 2 15
R5 Jijel 09 (50%) 09 (50%) 18 (6%) R6 3 2 15
R6 Annaba 09 (52%) 08 (48%) 17 (6%) R7 6 5 33
R7 Oran 19 (50%) 19 (50%) 38 (13%) R8 4 3 21
R8 Tlemcen 13 (50%) 13 (50%) 26 (9%) R9 1 1 6
R9 Bechar 04 (52%) 03 (48%) 07 (2%) R10 3 2 15
R10 El Oued 08 (50%) 08 (50%) 16 (5%) R11 2 2 12
R11 Ghardaïa 07 (50%) 07 (50%) 14 (5%) 11 47 39 258
Total 11 152 148 (49%) 300 (100%) 86 speakers
(51%)
Table 2: Recordings of Cr corpus
Table 1: Speakers' distribution in ALGASD
In conclusion, 28 % of speakers read 6 sentences, 45 %
read 3 sentences and 26% read only 2 ones. Total number
4. ALGASD Features of ALGASD recordings reached to 1080 (Table 3).
The speaker profile used in database takes into
consideration age and education level of every speaker.
We suggested, so, for these two features respectively
three different categories: (18-30/ 30-45/ +45) and
(Middle/Graduate/Post Graduate).

80/119
Corpora N° utterances speakers Total
Cheriguen, F. (1997). Politique linguistique en Algérie,
Cc 2 300 600 (55.5%) in Mots, Les langages du politique, n52, pp. 62-74.
Cr 30 86 258 (24.0%)
Choukri, K. Hamid, S. Paulsson, N. (2005). Specifiaction
Ci 168 222 222 (20.5%) of the Arabic Broadcast News Speech Corpus
TOTAL 200 300 1080 (100%) NEMLAR: http://www.nemlar.org.

Table 3: Total corpora and speakers of ALGASD Droua-Hamadani, G. & al. (2009). ALGASD PROJECT:
Statistical Study of Vocalic Variations according to
Education Levels of Algiers Speakers., Intonational
6. Research Applications of ALGASD Variation in Arabic Conference IVA09, York,
speech corpus (England).
ALGASD corpus is characterized by many aspects as: a
high quality of recordings, a large number of speakers, Droua-Hamadani, G. & Selouani, S.A. & Boudraa, M.
speaker's features which reflect many differences due to (2009). ALGASD Algerian voice bank project
region, age, gender, education levels and the dialect ALGASD's adaptation for continuous speech
varieties. All these characteristics provide an interesting recognition system. The 5th International Conference
set of perspectives for speech science applications, such on Computer Science Practice in Arabic (CSPA '09
as: automatic speech recognition, acoustic phonetic AICCSA09- IEEE), Rabat (Marroco).
analysis, perceptual experiments to study classification of
the different regional varieties spoken within Algeria SA, Gopalakrishna, A. & al (2005). Development of Indian
prosodic studies as rhythm, comparison of Algerian SA Language Speech Databases of Large Vocabulary
with Arabic of Maghreb countries or eastern ones, etc.
Speech Recognition Systems. Proceedings of
International Conference On Speech an Computer
ALGASD database was used until now in many studies (SPECOM), Patras, Greece.
as: statistical study of qualitative and quantitative vocalic
variations according to education levels of Algiers Linguistic Data Consortium (LDC):
speakers (Droua-Hamadani, & Selouani & Boudraa &
http://www.ldc.upenn.edu.
Boudraa , 2009); Location of Algerian Standard Arabic
Rhythm between stressed languages (to appear); Impact
of education levels on duration and rhythm of Algerian Marçais, P. Textes arabes de Djidjelli (1954). Presse
modern Standard Arabic (to appear). By respecting universitaire de France.
some recommendations in the selection and the
distribution of both sound material and speakers, we built Mohamed, A. M. Alghamdi, M. & Z. Muzaffar, Z.
from ALGASD two required corpora to train and test (2007). Speaker Verification Based on Saudi Accented
speech recognition system for Algerian Standard Arabic Arabic Database. International Symposium on Signal
(Droua-Hamadani, & Selouani & Boudraa & Boudraa , Processing and its Applications in conjunction with
2009). the International Conference on Information Sciences,
Signal Processing and its Applications. Sharjah,
7. References United Arab United Arab Emirates.
Arezki, A. (2008). Le rôle et la place du français dans le
système éducatif algérien. Revue du Réseau des National Office of Statistics (ONS): http://www.ONS.dz.
Observatoires du Français Contemporain en Afrique,
N° 23. pp 21-31. Petrovska, D. & al (1998). POLYCOST: A Telephone-
Speech Database for Speaker recognition. Proceedings
Boudraa, M. & B. Boudraa, B. & Guerin, B. (1998). RLA2C ("Speaker Recognition and its Commercial
Twenty Lists of Ten Arabic Sentences for Assessment. and Forensic Applications"), Avignon, France, pp.
ACUSTICA Acta-acustica. Vol.84. 211-214. ( http://circhp.epfl.ch/polycost).

Caubet, D. (2001). Questionnaire de diactologie du REF120 corpus available in European Language


Maghreb (D'après les travaux de W Marçais, M Resources Association:
Cohen, G.S Colin, J Cantineau, D. Cohen, Ph. http://www.icp.grenet.fr/ELRA.
Marçais. S Levy, ect.). Estudios de dialectologiia
norteafricana y andalusi, pp. 73-92. Roach, P. & Vicsi, K. (1996). BABEL An Eastern
European Multi-Language Database. COST249
Chan, D. & al. (1995) EUROM- a Spoken Language meeting Zurich.
Resource for the EU. Eurospeech 9. Proceedings of
the 4th European Conference on Speech Schultz, T. & Waibel, A.GlobalPhone (1998). Das
Communication and Technology. Madrid, Spain. Projekt GlobalPhone: Multilinguale Spracherkennung

81/119
Computers, Linguistics, and Phonetics between
Language and Speech, Bernhard Schröder et al (Ed.)
Springer, Berlin 1998, ISBN Proceedings of the 4th
Conference on NLP - Konvens-98, Bonn, Germany.

Siemund, R. & al. (2000). SPEECON – Speech Data for


Consumer Devices. Proceedings of LREC 2000.

Taleb Ibrahimi, K. (1997) Les Algériens Et Leur(s)


Langue(S), Eléments pour une approche
sociolinguistique de la société algérienne. Les editions
EL HIKMA. Deuxième Edition.

Texas Instrument and Massachusetts Institute of


Technology corpus (TIMIT) (1990): Acoustic-
Phonetic Continuous Speech Corpus. DMI.

Tseng, C. & Lee, W. & Huang, F. (2003). Collecting


Mandarin Speech Databases for Prosody
Investigations. The Oriental COCOSDA. Singapore.

Van den Heuvel, H. & Galounov, V. & Tropf, H.S


(1998). The SPEECHDAT (E) project: Creating
speech databases for eastern European languages.
Proceedings Workshop on Speech Database
Development for Central and Eastern European
Languages. Granada, Spain.

Vonwiller, J. & al (1995). Speaker and Material Selection


for the Australian National Database of Spoken
Language. Journal of Quantitative Linguistics, 2: 177-
211.

Zherng, T.F & al (2002). Collection of a Chinese


Spontaneous Telephone Speech Corpus and Proposal
of Robust Rules for Robust Natural Language Parsing.
Joint International Conference of SNLP- O-
COCOSDA, Hua Hin, Thailand.

82/119
Integrating Annotated Spoken Maltese Data into Corpora
of Written Maltese
Alexandra Vella †*, Flavia Chetcuti†, Sarah Grech†, Michael Spagnol‡
University of Malta†, University of Cologne*, University of Konstanz‡
alexandra.vella@um.edu.mt, fchetcuti@hotmail.com, sgrec01@um.edu.mt, michael.spagnol@uni-konstanz.de

Abstract
Spoken data features to a lesser extent in corpora available for languages than do written data. This paper addresses this issue by
presenting work carried out to date on the development of a corpus of spoken Maltese. It outlines the standards for the PRAAT
annotation of Maltese data at the orthographic level, and reports on preliminary work on the annotation of Maltese prosody and
development of ToBI-style standards for Maltese. Procedures being developed for exporting PRAAT TextGrid information for the
purposes of incorporation into a predominantly written corpus of Maltese are then discussed. The paper also demonstrates how
characteristics of speech notoriously difficult to deal with have been tackled and how the exported output from the PRAAT annotations
can be enhanced through the representation also of phenomena, sometimes referred to as “normal disfluencies”, which include “filled
pauses” and other vocalisations of a quasi-lexical nature having various functions of a discourse-management type such as
“backchannelling”.

within the context of two projects, MalToBI and SPAN,


1. Introduction has produced preliminary guidelines for the annotation of
spoken data from Maltese together with a small amount of
Annotation of spoken, as compared to written data, tends to
feature to a lesser extent in corpora available for languages. annotated data. Annotation is being done using PRAAT
For instance, the British National Corpus, “a 100 million and has to date concentrated on quasi-spontaneous Map
Task dialogue data from the MalToBI corpus (Vella &
word collection of spoken and written language from a
Farrugia, 2006). The available annotations have a
wide range of sources, designed to represent a wide
structure involving different types of information
cross-section of British English” (British National Corpus),
contains only about 10% of spoken data. One reason for the included in separate TIERS. The tiers in the current
lesser inclusion of spoken data into corpora is the greater annotation are the following:
degree of pre-processing work which needs to be done to it 1. SP(eaker)1 and SP(eaker)2;
before it can be included, as text, into a corpus (Gibbon et 2. Br(eak)-Pa(use)-O(verlap)s;
al., 1997).
3. T(arget)I(tem)s;
The purpose of this contribution is threefold. Firstly it 4. F(illed)P(ause)s;
reports on the current state and continuing development of 5. MISC(ellaneous).
the corpus of spoken Maltese and its annotation, as well as
on the development of standards and guidelines for this
(Vella et al., 2008). Availability of such standards and Three further tiers are also included in the TextGrids for
guidelines should enable the training of annotators and data for which prosodic annotation has been carried out.
allow for more spoken data to be prepared for inclusion in These are as follows:
corpora of Maltese. Second, it demonstrates what 6. Tone;
procedures need to be developed in order for the PRAAT 7. Prom(inence);
(Boersma & Weenick) TextGrid annotations available to
date to be converted into “running text”. Such conversion is 8. Functions.
important since it will allow incorporation of the
annotations carried out into other corpora of Maltese Standards for prosodic annotations are still undergoing
developed in the context of projects such as MaltiLex and development. Standards and guidelines for orthographic
MLRS (Dalli, 2001; Rosner et al., 1998; Rosner et al., 2000; annotation of spoken data, by contrast, are at an advanced
Rosner, 2009). Third, it outlines continuing linguistic stage of development. A detailed description of the
analysis of features of (quasi-)spontaneous speech known standards and guidelines which have been used in
to be particularly difficult to deal with. The features in producing the annotations available to date is given in
question include intonation, but also phenomena of the sort Section 3 below. Procedures which have started being
sometimes referred to as “normal disfluencies” which developed to allow incorporation of the annotations of the
include, amongst others, “repetitions”, “repairs” and “filled spoken data into corpora consisting mainly of written
pauses” (Cruttenden, 1997), as well as other features which Maltese are then discussed in Section 4. Lastly Section 5
have been shown to serve various functions of a describes and discusses how we annotated features of
discourse-management type such as “backchannelling” speech which are particularly difficult to handle in that
(original use due to Yvnge, 1970). Work on analysis of they have no direct and/or obvious “written” correlate.
these features is ongoing (Vella et al., 2009). Such features include prosody (which will only be
discussed briefly here), but also phenomena sometimes
2. Work to date referred to as “normal disfluencies” (see Cruttenden, 1997
and above). Particular attention will be given to
Work on the annotation of spoken Maltese carried out vocalisations sometimes classified as “filled pauses”, as

83/119
well as to “backchannels” or features involved in first letter of target items (on which, see also Subsection
providing feedback to other participants in a dialogue. 3.1.4) is always capitalised. When lexical stress in target
Some concluding remarks are provided in Section 6. items is misplaced, the syllable which has been stressed is
capitalised in the annotation, e.g. the expected position of
3. Structure of the annotations stress in the proper noun PERgola in the target item Hotel
As mentioned in Section 2, information of different types Pergola ‘Hotel Pergola’ is antepenultimate; capitalisation
is included in separate tiers in the PRAAT annotations in the TextGrid annotation PerGOla indicates that stress
carried out. A sample of a very short extract from one was assigned penultimately by the speaker in this
recording, together with the associated annotation, is instance.2
shown in Figure 1 below.
Sentential punctuation marks such as question marks ( ? )
and full-stops ( . ) are included in the annotation, and are
MC_CB_C1_NEWfullyrevised
22.5993279 generally used in line with punctuation conventions rather
than to indicate a fall or rise in pitch. Final punctuation
marks such as exclamation marks ( ! ), ellipsis ( … ),
quotation marks ( ‘ ’, “ ” ), etc., by contrast, have not been
included in the annotation. The punctuation marks in this
M-hm. G ? addi
minnbej’ Sqaq il-Merill u Triq Marmarà. tielg ? a.
Ibqa’ group are intended to indicate, in written text, the
M-hm? presence of elements typical of speech. Specifically, the
Pa Overlap Br Pa-C exclamation mark indicates use of intonation of a
TB8 TE8 TB17 TE17
particularly “marked” kind, whilst ellipsis often indicates
FP FP
a pause in speech or an unfinished sentence. Both these
Aspiration
elements are catered for in tiers other than the SP1 and
18.71 22.6
SP2 tiers (see Subsection 3.1.4 and 3.1.2 respectively).
Time (s) Quotation marks in written texts often indicate direct, as
Figure 1: Sample excerpt from MC_CB_C1 opposed to indirect speech or narrative, not a relevant
factor with respect to the annotation standards being
3.1 Annotation tiers discussed given that the texts in question consist solely of
The standards used in carrying out the annotations are speech. Hyphens ( - ), accents ( ` ) and apostrophes ( ’ ) are
summarised below in Subsections 3.1.1 – 3.1.4. used in the normal way as for written Maltese. Note that
Subsections 3.1.1 – 3.1.3 deal with the SP1 and SP2, apostrophes are also used to indicate elision, as noted
Br-Pa-Os and FPs tiers respectively, while Subsection above. Internal punctuation marks such as dashes ( – ),
3.1.4 deals with the remaining TIs and MISC tiers, as semi-colons ( ; ), colons ( : ), and particularly commas ( , ),
well as, briefly, with the prosodic Tone, Prom and an important element of punctuation in written texts, are
Functions tiers. avoided although also catered for in the annotations (see
Subsection 3.1.2). Such punctuation marks sometimes
3.1.1. SP1 and SP2 tiers coincide with the location of phrase boundaries of
The word-by-word annotation makes use of standard different sorts, but do not always do so. Their use is not as
orthography, including the new spelling rules published clearly regulated as is that of other punctuation marks, and
by Il-Kunsill Nazzjonali tal-Ilsien Malti in 2008 and all therefore would give poor results in terms of
Maltese characters use Unicode codification (see inter-transcriber reliability.
Akkademja tal-Malti, 2004; Kunsill tal-Malti, 2008). In a
number of cases, however, there is some variation with Phrasal units involving a determiner and a noun or
respect to regular standard orthography, as it is considered adjective (e.g. il-bajja ‘the bay’, il-kbir ‘big, lit. the big’,
important for the word-by-word annotation to provide as etc.), as well as units with a particle plus determiner and
close a record as possible to what was actually said. Thus, noun or adjective (fid-direzzjoni ‘in the direction (of)’,
for example, in cases of elision of different sorts, a tar-re ‘of the king’, etc.) are segmented together. Simple
convention similar to that used in standard orthography particles, on the contrary, are segmented as separate
(e.g. tazz’ilma for tazza ilma ‘a glass of water’), that is the expressions from the word they precede (e.g. ta’ Mejju ‘of
use of an apostrophe, is extended to include initial elision, May’, fi Triq Ermola ‘in Ermola Street’, etc.). Additional
e.g. ’iġifieri for jiġifieri ‘that is to say’.1 There are also conventions used which are at odds with standard
instances of insertions. In such cases, inserted segments punctuation are question marks at the beginning of a word
are added to the transcription in square brackets (e.g. to mark a dubious or unclear expression (e.g. ?Iwa/Imma
nagħmlu [i]l-proġett ‘we [will] perform the task’). ‘yes/but’), asterisks immediately before a word to mark
an ungrammatical or non-existent expression in the
Capitalisation follows punctuation rules in Maltese. The language (e.g. il-bajja *tar-Ray lit. ‘Ray’s the bay’) as

1 2
Where possible, examples provided are taken from the Map Where an indication of lexical stress is necessary in the above,
Task annotations carried out. the syllable in question is shown in bold capitals.

84/119
used in linguistics to indicate unacceptability, and slashes roughly defined as “forms not usually found in a
on both sides of a word to indicate non-Maltese words dictionary”. The FPs tier is extremely useful to the
(e.g. /anticlockwise/). phonetician using PRAAT as her/his main analysis tool
since it increases searchability (but see also Subsection
3.1.2. Br-Pa-Os tier 3.1.4).
The Br-Pa-Os tier is used to indicate the presence of
breaks and pauses, as well as that of overlap, in the One of the difficulties encountered in the annotation of
dialogues. Examples of the distinctions made, together “non-standard forms” is that such forms have no clearly
with a description of specific characteristics in each case, recognisable “standard” representation, something which
are given in Table 1 below. can prove problematic even to established writers . In fact,
of the forms whose occurrence is noted in the FP tier, only
Examples Characteristics six, namely “e”, “eħe”, “eqq”, “ew”, “ħeqq” and “ta” are
Triq Mar<...> Br False start or repair listed in Aquilina’s (1987/1990) dictionary. In addition,
Triq Mannarino. truncation and correction other researchers refer to different forms or to similar
‘Mar<...> Street Br Mannarino Street.’ unexpected forms having functions that seem to be different to the
Għaddi minn bej’ Sqaq il-Merill Br Intra-turn ones in the data analysed (see, e.g., Borg &
u Triq Marmarà. break within constituent Azzopardi-Alexander, 1997; Mifsud & Borg, 1997).
‘Go between Merill Alley Br and Marmarà Street.’ unexpected
Għaddi Br Intra-turn Forms of the sort whose occurrence is noted in this tier are
minn bejniethom. break before adverbial typically found in spontaneous speech and include
‘Go Br between them.’ less unexpected phenomena such as “repetitions”, “repairs” and “filled
Mela Br Similar to comma pauses” (mentioned earlier and reported in Cruttenden,
tibda mill-Bajja ta’ Ray. break before main clause 1997). Such phenomena often serve clear functions and
‘So Br begin at Ray’s Bay.’ expected have their own specific characteristics, phonetic,
u Triq Marmarà. Pa Intra-speaker; full-stop including prosodic, as well as otherwise (see, e.g.,
Ibqa’ tielgħa. break across sentences Shriberg, 1999).
‘and Marmarà Street. Pa Keep walking upwards.’ expected
SP1: Ibqa’ tielgħa. Pa-C Inter-speaker; full-stop As things stand at present in fact, the FPs tier conflates
SP2: Sewwa. break across sentences into one relatively undifferentiated group, a number of
‘Keep walking upwards. Pa-C Good.’ expected different phenomena. 3 Preliminary analysis of elements
SP2: M-hm? Inter-speaker included in this tier reported by Vella et al. (2009) makes
SP1: Għaddi minn bejn Sqaq il-... overlap possible a distinction between “real” filled pauses (FPs)
‘M-hm? O Go between Merill Alley...’ – and other phenomena – the latter will also be discussed in
this Subsection.
Table 1: Examples of Br-Pa-O distinctions made
Although consensus amongst researchers on what exactly
Differentiation between breaks and pauses is based on a “counts” as an FP is limited, linguists usually agree that
broad distinction between intra-sentence gaps, labelled FPs are discourse elements which, rather than
Break, and inter-sentence ones, labelled Pause. contributing information, “fill” silences resulting from
Transcribers were instructed to allow intonation, as well pauses of various sorts. They also agree that such
as their intuitions, to inform their decisions. The elements can contribute meaning and/or communicative
distinction between Break and Pause, correlates, roughly function but do not always do so, and that they often have
speaking, with the comma vs. full-stop distinction made a role to play in the organisation of discourse.
in writing. Unexpected intra-speaker mid-turn pauses
associated with “normal disfluency”-type phenomena Preliminary analysis of the forms flagged in the FPs tier
mentioned earlier (see Sections 1 and 2) are however also in the data annotated included a durational, distributional
labelled as Break. Within speaker pauses across and phonetic (particularly prosodic) study of the forms.
sentences are distinguished from those across speakers by One outcome of the durational study is that it has made
means of the label Pause vs. Pause-C(hange). A study of possible the standardisation of annotation guidelines for
both the distribution and durational characteristics of the various “non-standard forms” found in the data. To
breaks vs. pauses is planned. Such a study is expected to give one example, the original annotations include three
throw light on the nature of different types of “different” forms: e, ee and eee. The durational study
phonological boundaries and related boundary strength in carried out however suggests that the different labels do
Maltese (but see also Section 3.1.4 below). not in fact correlate with a difference in the duration of the
entities in question: instances transcribed as eee are not in
3.1.3. The FPs tier fact longer than instances transcribed as ee, which are not,
This tier is a very important part of the annotations. It is
used to note the position of any “non-standard forms” 3
transcribed in the SP1 and SP2 tiers, such forms being It is possible that the FP tier will in fact be renamed in
subsequent annotation work.

85/119
in turn, longer than instances transcribed as e: the one found four times in the data annotated, is described by
label eee is therefore being suggested as the “standard” Aquilina as “short for taf, you know”, the latter as an
for all occurrences of this type of FP and the original “occasional variant of jew” (1990:1382; 1987:290).
annotations have been amended accordingly. References Although the status of both these vocalisations as
below are to the labels as amended rather than to the “quasi-lexical” is unclear, they are mentioned here since
original labels. they may share with m-hm the function of
backchannelling mentioned earlier.
The analysis mentioned above has led to the identification
of a number of “real” FPs in Maltese, similar to FPs A third class of forms, in this case vocalisations which
described for other languages. These are eee, mmm and seem to be similar to the “ideophones and interjections”
emm, all of which contain the element/s [e] and [m]. The category listed by Borg & Azzopardi-Alexander (1997)
analysis also noted two other “forms” of FPs annotated in also occur in the data. The latter include forms which have
the data, namely ehh and ehm. While the latter may be been annotated in our data as follows: fff, eqq, ħeqq and
phonetic variants of the above-mentioned eee and emm, ttt. The latter is described by Borg &
instantiations of ehh and ehm in the data annotated are Azzopardi-Alexander (1997:338) as an “alveolar click ttt
significantly longer than their eee and emm counterparts. [!]...commonly used to express lack of agreement with an
They also appear to have an element of “glottalisation” interlocutor” and in fact it is this use that is attested in the
not normally characteristic of instances of eee and emm.4 data annotated, rather than a use indicating disapproval,
often transcribed orthographically as tsk, and involving
The distributional analysis of the “real” FPs eee, mmm repetition of the click (see Mifsud & Borg, 1997).
and emm suggests that, overall, there is a very high
tendency for silence to occur, to the left, to the right, or on 3.1.4. Other tiers
both sides of these FPs. A slightly greater tendency for The TIs tier indicates the presence, within the text, of
these kind of FPs (particularly eee and mmm) to occur Map Task target items. Target items included in the Map
following a silence, rather than preceding one is exhibited. Task allow comparability across speakers and contain
Analysis of the intonation of the “real” FPs identified is solely sonorant elements to allow for better pitch tracking
still ongoing, however, a preliminary characterisation of e.g. l-AmoRIN in Sqaq l-Amorin ‘Budgerigar Street’,
the intonation of such forms is one involving a long period l-Ewwel ta’ MEJju in Triq l-Ewwel ta’ Mejju ‘First of
of level pitch around the middle of the speaker’s pitch May Street’ and Amery in Triq Amery ‘Amery Street’.
range (see also Vella et al., 2009). Target items were carefully selected to represent different
syllable structure and stress possibilities in Maltese (see
A number of phenomena other than “real” FPs also occur Vella & Farrugia 2006). The TIs tier is extremely useful
in the data. The most important of these is a highly to the phonetician using PRAAT as her/his main analysis
frequent class of forms involving “quasi-lexical” tool since it increases searchability. It is not of great
vocalisations such as m-hm and eħe/aħa/ija/iwa, which importance to the computational linguist, however, and
tend to have clear meanings (perhaps similar to will therefore not be considered further here.
Cruttenden’s 1997:175 “intonational idioms”). The form
m-hm is particularly worthy of note. This was originally As the tier name suggests, MISC contains miscellaneous
transcribed as mhm in the data annotated. The main reason information of different sorts. One example of a feature
for the use of the hyphen in the amended annotations is which gets recorded in the MISC tier is the case of
that this form is very different phonetically from the inaudibly released final stops. Words ending in such
“real” FPs described above in that it is a two-syllable plosives have been transcribed in standard orthography, a
vocalisation having a specific intonational form note also being added in the MISC tier to say that a final
consisting of a “stylised” rise in pitch from relatively low, plosive had been inaudibly released. Cases of vowel
level F0 on the first syllable, to higher, but still level F0 on coalescence, particularly at word boundaries, are also
the second syllable. The hyphenated form m-hm was transcribed as in standard orthography, a note once more
thought to better mirror the characteristics of this being inserted in the MISC tier to this effect. Other
vocalisation, thus rendering the orthographic annotation features noted in this tier include unexpected vowel
more immediately transparent to the reader. quality realisations and idiosyncratic pronunciations
including unusual use of aspiration, devoicing etc.;
M-hm parallels neatly with informal renderings of iva various “normal dysfluency”-type phenomena such as
‘yes’ such as ija and iwa, as well as with the more frequent interruptions, abandoning of words, trailing off, unclear
eħe, in having a significant “backchannelling” function stretches of speech; also voice quality features such as
(see Savino & Vella, forthcoming). Two further short creak and non-linguistic elements such as noise.
expressions are annotated in the FP tier: ta and ew. The
former, very common in everyday conversation, but only The contents of this tier are transcriber-dependent to an
extent which is not fully desirable. Some general
4
It should be noted however that there may be an idiosyncratic observations on dealing with features such as those in the
element to these particular forms given that all instances of ehhs MISC tier will be made in Section 4.
and ehms noted in the data come from the same speaker.

86/119
Only rudimentary guidelines are available to date in the 3.3 Alignment of tiers
case of the three prosodic tiers Tone, Prom and As mentioned earlier in this Section (see Subsection 3.1.1)
Functions, for which the following outline will suffice. an important feature of PRAAT-style annotations is that
The annotation in these tiers is intended as a means of involving the time-alignment of the waveform
furthering research on Maltese prosody. Issues such as the information to information in other tiers. Thus, the
relationship between perceived stress and intonation and orthographic annotation of the spoken data goes hand in
the nature of the intonation patterns typical of Maltese and hand with the word-by-word segmentation in such a way
their distribution relative to discourse functions such as that also allows information such as the starting time and
ending time, and consequently the duration of each
those involved in initiating a conversation, issuing an
“segment” to be captured. Thus, the information in the
instruction, etc. SP1 and SP2 tiers in particular, but more generally that in
the separate tiers of the annotation, involves
An important aim of this analysis is that of developing an time-alignment either of particular intervals or of
adaptation for Maltese of the Tone and Break Indices particular points in the waveform to the information in
(ToBI) framework for the annotation of prosody in the other tiers of the annotation. This information, which is
tradition of recent work in this area (see for example viewed in PRAAT as shown in Figure 1, is an extremely
Silverman et al., 1992). Such an adaptation will impact on useful feature for the purposes of analysis. However, it
the development of standards for a Break Indices poses a number of problems when it comes to
component, something which it is hoped the study of incorporating the information from PRAAT TextGrid
phonological boundaries and boundary strength annotations into a corpus composed mainly of texts of a
written form and it is this issue which will be discussed in
mentioned above in 3.1.2 will in fact input into. A B(reak)
the next Section.
I(ndices) tier does not in fact yet feature in the annotation
carried out. Annotation of data using preliminary
4. Preparing SPeech ANnotations for
ToBI-style standards for Maltese based on the analysis of
integration into corpora of Maltese
Maltese intonation carried out within the
Autosegmental-Metrical framework (Pierrehumbert 1990; As mentioned above, TextGrid annotations, whilst useful
Ladd, 2008) by Vella (see, for example, 1995, 2003, 2007, to phoneticians, do not necessarily allow for
2009a, 2009b) should also contribute to further straightforward incorporation into corpora consisting
consolidation of the phonological analysis of Maltese mainly of written texts. The annotations, though stored
prosody. in .txt format, contain information which is as such
“redundant” for corpus linguistics. The TextGrid
Typically, annotation begins in the Prom tier, to identify information relevant to the short excerpt shown in Figure
perceived prominence of accented and/or stressed 1 has been extracted from the relevant TextGrid and
syllables. With a stretch of speech thus highlighted as presented in the Appendix. The information entered into
important in some way, related intonation patterns on the each interval labelled is listed together with an indication
Tone tier, as well as discourse features on the Functions of start time and end time by tier. In the case of point tiers
tier can then be annotated. The decision to include a Prom – such as the Tone tier in our annotations, which is
tier is based on work by Grabe (2001), on the annotation however not illustrated in Figure 1 – any label inputted
of the IViE (Intonational Variation in English) corpus, into the TextGrid is listed together with its position in
which specifies how pitch movement is “anchored” to time.
syllables marked as prominent. The Tone tier then
describes tones in terms of the way these link to identified Having produced the TextGrid annotations using PRAAT,
prominent syllables and boundaries. This feature of the it was considered necessary to establish procedures for
annotations should prove useful in the case of Maltese exporting the information of relevance for the
since a distinction between different degrees of incorporation of samples of spoken Maltese in a
prominence at the Prom tier may make it possible to predominantly written corpus of Maltese. The desired
account not only for more common “pitch accent”-type outcome is machine-readable text containing not only the
phenomena (see, e.g., Bolinger, 1958), but also for orthographic transcriptions relevant to the contributions
phenomena of the so-called “phrase accent”-type of the speakers in the dialogue in the form of a playscript,
identified for Maltese (see Vella, 2003; following Grice, but also any information from the PRAAT annotations
Arvaniti & Ladd 2000). Annotation at the level of prosody which would be useful for processing the spoken “texts”
is currently underway. in line with principles established also for the written ones.
For ease of reference, a conventional playscript-type
A further tier, the Functions tier, contains information transcript of the excerpt shown in Figure 1 is given below:
relating to discourse features as detailed by Carletta et al.
(1995) in the coding of the HCRC Map Task Corpus. SP1: M-hm.
Their system describes typical features of turn-taking in SP2: M-hm?
conversation such as “initiating moves” like INSTRUCT or SP1: (Overlapping) Għaddi minn bej’ Sqaq il-Merill...u Triq Marmarà.
EXPLAIN and “response moves” like ACKNOWLEDGE or Ibqa’ tielgħa.
CLARIFY. (Go between Merill Alley...and Marmarà Street. Keep moving upwards.)

87/119
Using a PRAAT script called obstacles to overcome. Assuming some kind of mark-up
GetLabelsOfIntervals_WithPauses (Lennes), it is possible similar to that used in the BNC, and an element <u>
to reduce the information shown in the Appendix as in (utterance) corresponding to the written text <p>
Table 2 below: (paragraph) element, grouping a sequence of <s>
(sentence) elements (BNC User Reference Guide), the
Speaker Words in sequence Pause duration short dialogue above could be encoded as follows:
(0.23 s)
SP1: M-hm. <u who=“SP1”> <w...“bejn”>
(0.22 s) <s...> <w...“Sqaq”>
SP2: M-hm? <pause dur=0.23s> <w...“il-Merill”>
(-0.11s) <w...“M-hm”> <pause dur=0.20s>
SP1: Għaddi <c c5=”PUN”>.</c> <w...“u”>
Minn <u who=“SP2”> <w...“Triq”>
bej’ <w...“M-hm”> <w...“Marmarà”>
Sqaq <c c5=”PUN”>?</c> <c c5=”PUN”>.</c>
il-Merill <u who=“SP1”> <s...>
(0.20s) <s...> <pause dur=0.24s>
U <overlap dur=-0.11s> <w...“Ibqa’”>
Triq <w...“Għaddi”> <w...“tielgħa”>
Marmarà. <w...“minn”> <c c5=”PUN”>.</c>
(0.24s)
Ibqa’ The above demonstrates that the output of the script used
tielgħa. here already goes a long way towards helping us
accomplish our purpose.
Table 2: Contents of .txt file following extraction of
information from PRAAT TextGrid Some of the elements in the text, e.g. use of <...> to
indicate elision could easily be adapted for the purposes
Exporting selected information from the PRAAT of automatic tagging – the BNC suggests the use of an
TextGrids as shown above makes it possible for some element <trunc> in such cases. Information relating to
important features of the annotation carried out to be other elements such as <unclear> (entered in the MISC
retained in text which, unlike TextGrids in their raw form, tier in the current TextGrid annotations), could also be
is easy to incorporate into a corpus composed mainly of retrieved, albeit possibly in a less straightforward
written texts. Although, improvements on this fashion.
preliminary attempt at exporting the data can be
envisaged, the output of the script used can already be One element of particular interest in the context of this
seen to contain a number of useful features. One of these paper is the element <vocal>. The BNC User Reference
is the fact that corpus-processing tasks such as paragraph Guide describes this as: “(Vocalized semi-lexical) any
(the spoken equivalent of which would be the utterance) vocalized but not necessarily lexical phenomenon for
and sentence splitting, as well as tokenisation, would example voiced pauses, non-lexical backchannels, etc.”.
seem to be relatively straightforward tasks given the One of the outcomes of the project SPAN is in fact a
word-by-word segmentation and annotation in the categorisation of different types of vocalisations as
original format. POS tagging would need to proceed on follows (see Vella et al., 2009):
the same lines as in the case of written texts.
1. “real” FPs such as eee, mmm and emm having an
A second very important feature of the output of the script actual pause as their counterpart;
is that it captures information on both pause, and on 2. non-lexical vocalisations such as m-hm which
overlap, a very significant feature of speech which is parallel with quasi-lexical vocalisations such as
completely absent from written texts. These two features eħe and with lexical words such as iva; and
are recorded in the right hand column of the output, a 3. “paralinguistic vocalisations” such as fff, ħeqq,
positive value indicating a pause, a negative value overlap. ttt etc.
It should be a relatively straightforward for a script to be
developed which will allow for such information to be Given that, in all cases, the elements in these categories
converted into the appropriate tags. consist of a relatively closed set of items – which after all
could be added to in cases of new items being identified –
5. The encoding of features of spoken such elements should be relatively easy to identify on the
Maltese basis of their orthographic rendering in the annotations,
although one would need to assume proper training of
Given conversion of PRAAT annotations in a way similar transcribers to established standards and guidelines. It is
to that described in Section 4 above, there remain few being suggested here that vocalisations involving some

88/119
element of meaning would be better tagged as “words”, 12-Apr-10
thus leaving the element “vocalisations” as a means of Grice, M., Ladd, D.R.L., Arvaniti, A. (2000). On the place of
recording events of a purely non-linguistic nature. A full phrase accents in intonational phonology. Phonology, 17, pp.
categorisation of different events of this sort, as well as of 143--185.
other vocalisations of a “quasi-lexical” nature still awaits HCRC Map Task Corpus, http://www.hcrc.ed.ac.uk/maptask/
research. visited 10-Nov-09.
Kunsill tal-Malti,
6. Conclusion http://kunsilltalmalti.gov.mt/filebank/documents/Decizjonijiet1_
In conclusion, standards and guidelines for the 25.07.08.pdf visited 15-Apr-10.
orthographic annotation, as well as preliminary standards Ladd, D.R.L. (2008). Intonational Phonology. 2nd edition.
for the annotation of prosody, of spoken Maltese, are in Cambridge: Cambridge University Press.
place. Exportation of TextGrid information to a format Lennes, M. (2010). Mietta’s Praat scripts,
more readily incorporable into corpora of written data is GetLabelsOfIntervals_WithPauses script.
also doable. However, possibilities for automating or http://www.helsinki.fi/~lennes/praat-scripts/ visited 18–Jan-10.
semi-automating procedures of conversion need to be Mifsud, M., Borg, A. (1997). Fuq l-Għatba tal-Malti. [Threshold
explored. Lastly, improved knowledge of the workings of Level in Maltese]. Strasbourg: Council of Europe Publishing.
these relatively-less-described features of Maltese should Pierrehumbert, J. (1980). The Phonetics and Phonology of English
serve not only to improve the quality of HLTs such as Intonation. Ph.D. thesis, MIT.
Text-to-Speech systems for Maltese, but also to improve Rosner, M., Caruana, J., Fabri, R. (1998). MaltiLex: a
methodologies for the evaluation of such HLTs. computational lexicon for Maltese. In Proceedings of the
COLING-ACL Workshop on Computational Approaches to
7. Acknowledgements Semitic Languages. Morristown, NJ: Association for
Computational Linguistics, pp. 97--101.
We would like to thank the University of Malta’s
Rosner, M., Caruana, J., Fabri, R., Loughraieb, M., Montebello,
Research Fund Committee (vote numbers 73-759 and
M., Galea, D., Mangion, G. (2000). Linguistic and
31-418) for making funding for the projects SPAN (1) and
computational aspects of MaltiLex. In Proceedings of ATLAS:
SPAN (2) available for the years starting January 2007 and
The Arabic Translation and Localization Symposium. PLACE,
January 2008.
pp. 2 -- 9.
Rosner, M. (2009). Electronic language resources for Maltese. In B.
8. References
Comrie, R. Fabri, E. Hume, M. Mifsud & M. Vanhove (Eds.),
Introducing Maltese Linguistics. Amsterdam: John Benjamins,
Akkademja tal-Malti. (2004). Tagħrif fuq il-Kitba Maltija II. Malta: pp. 251--276.
Klabb Kotba Maltin. Savino, M., Vella, A. (Forthcoming). Intonational backchannelling
Aquilina, J. (1987/1990). Maltese-English Dictionary. Volumes 1 & strategies in Italian and Maltese Map Task dialogues.
2. Malta: Midsea Books Ltd. Shriberg, E.E. (1999). Phonetic consequences of speech disfluency.
Boersma, P., Weenick, D. (2008). PRAAT: doing phonetics by In Proceedings of the 13th International Congress of Phonetic
computer. (Version 5.0.08). http://www.praat.org visited Sciences 1999, San Francisco, CA, pp. 619--622.
11-Feb-08. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M.,
Bolinger, D. (1958). A theory of pitch accent in English. Word, 14, Wightman, C., Price, P., Pierrrehumbert, J., Hirschberg, J.
pp. 109--149. (1992). ToBI: a standard for labeling English prosody. In
Borg, A., Azzopardi-Alexander, M. (1997). Maltese. [Descriptive Proceedings of the 1992 International Conference on Spoken
Grammars]. London/New York: Routledge. Language Processing. Banff, Canada, p. 867--870.
British National Corpus, Vella, A. (1995). Prosodic Structure and Intonation in Maltese and
http://www.natcorp.ox.ac.uk/corpus/index.xml visited its Influence on Maltese English. Unpublished Ph.D thesis,
20-Feb-10. University of Edinburgh.
Carletta, J., Isard, A., Kowtko, J., Doherty-Sneddon, G., Anderson, Vella, A. (2003). Phrase accents in Maltese: distribution and
A. (1995). The coding of dialogue structure in a corpus. In J.A. realisation. In Proceedings of the 15th International Congress of
Andernach, S.P. van de Burat & G.F. van der Hoeven (Eds.), Phonetic Sciences. Barcelona, pp. 1775--1778.
Proceedings of the Twentieth Workshop on Language Technology: Vella, A. (2007). The phonetics and phonology of wh-question
corpus-based approaches to dialogue modelling, pp. 25--34. intonation in Maltese. In Proceedings of the 16th International
Cruttenden, A. (1997). Intonation. 2nd edition. Cambridge: Congress of Phonetic Sciences. Saarbrücken, pp. 1285--1288.
Cambridge University Press. Vella, A. (2009a). Maltese intonation and focus structure. In R.
Dalli, A. (2001). Interoperable extensible linguistic databases. In Fabri (Ed.), Maltese Linguistics: A Snapshot. In Memory of
Proceedings of the IRCS Workshop on Linguistics Databases, Joseph A. Cremona. [Il-Lingwa Tagħna Vol. 1]. Bochum:
University of Pennsylvania, Philadelphia, pp. 74--81. Niemeyer, pp. 63--92.
Gibbon, D., Moore, R., Winski, R. (1997). Handbook of Standards Vella, A. (2009b). On Maltese prosody. In B. Comrie, R. Fabri, E.
and Resources for Spoken Language Systems. Berlin: Mouton de Hume, M. Mifsud & M. Vanhove (Eds.), Introducing Maltese
Gruyter. Linguistics. Amsterdam: John Benjamins, pp. 47--68.
Grabe, E. (2001). The IViE Labelling Guide. (Version 3). Vella, A., Farrugia, P-J. (2006). MalToBI – building an annotated
http://www.phon.ox.ac.uk/files/apps/IViE//guide.html# visited

89/119
corpus of spoken Maltese. In Proceedings of Speech Prosody variation in the type and prosody of filled pauses in Maltese.
2006, Dresden. Paper given at the 2nd International Conference of Maltese
Vella, A., Chetcuti, F., Grech, S., Spagnol, M. (2008). SPeech Linguistics, Bremen.
ANnotation: developing guidelines for spoken corpora. Paper Yvnge, V. (1970). On getting a word in edgewise. Papers from the
given at the 1st International Conference of Maltese Linguistics, Sixth Regional Meeting, Chicago Linguistic Society, Chicago, pp.
Bremen. 567-577.
Vella, A., Chetcuti, F., Grech, S., Spagnol, M. (2009). Interspeaker

Appendix

TextGrid information for sample excerpt in Figure 1. The sequences of left and right angled brackets indicate the positions at which information from the different
tiers of the original TextGrid was removed in order to make it possible for time-aligned information from the various tiers to be presented here.

File type = "ooTextFile" text = "Triq" xmin = 24.988121382607528 name = "TIs"


Object class = "TextGrid" intervals [34]: xmax = 25.060897544296406 <<<<<<<<<<>>>>>>>>>>
xmin = 21.521897831205 text = "" points [7]:
xmin = 0 xmax = 22.091024710212217 intervals [16]: time = 20.231585573720867
xmax = 225.7226530612245 text = "Marmar\a`." xmin = 25.060897544296406 mark = "TB8"
tiers? <exists> intervals [35]: xmax = 25.539234518515816 points [8]:
size = 6 xmin = 22.091024710212217 text = "Triq" time = 21.030082586632584
item []: xmax = 22.335863155310836 intervals [17]: mark = "TE8"
item [1]: text = "" xmin = 25.539234518515816 points [9]:
class = "IntervalTier" intervals [36]: xmax = 26.066892216144574 time = 21.338326610341518
name = "SP1" xmin = 22.335863155310836 text = "" mark = "TB17"
<<<<<<<<<<>>>>>>>>>> xmax = 22.584860481468173 intervals [18]: points [10]:
intervals [24]: text = "Ibqa'" xmin = 26.066892216144574 time = 22.091024710212217
xmin = 18.943357990906726 intervals [37]: xmax = 26.63328825195686 mark = "TE17"
xmax = 19.261373640829614 xmin = 22.584860481468173 text = "Marmara`." <<<<<<<<<<>>>>>>>>>>
text = "M-hm." xmax = 22.997308645192785 <<<<<<<<<<>>>>>>>>>> item [5]:
intervals [25]: text = "tielgħa." item [3]: class = "IntervalTier"
xmin = 19.261373640829614 <<<<<<<<<<>>>>>>>>>> class = "IntervalTier" name = "FPs"
xmax = 19.643059096210124 item [2]: name = "Br-Pa-Os" <<<<<<<<<<>>>>>>>>>>
text = "" class = "IntervalTier" <<<<<<<<<<>>>>>>>>>> intervals [2]:
intervals [26]: name = "SP2" intervals [21]: xmin = 18.943357990906726
xmin = 19.643059096210124 <<<<<<<<<<>>>>>>>>>> xmin = 18.71206799854182 xmax = 19.261373640829614
xmax = 19.89371820123613 intervals [8]: xmax = 18.943357990906726 text = "FP"
text = "Għaddi" xmin = 18.71206799854182 text = "Pause-C" intervals [3]:
intervals [27]: xmax = 19.48013067794322 intervals [22]: xmin = 19.261373640829614
xmin = 19.89371820123613 text = "" xmin = 18.943357990906726 xmax = 19.48013067794322
xmax = 20.065239076749062 intervals [9]: xmax = 19.643059096210124 text = ""
text = "minn" xmin = 19.48013067794322 text = " " intervals [4]:
intervals [28]: xmax = 19.756995053040125 intervals [23]: xmin = 19.48013067794322
xmin = 20.065239076749062 text = "M-hm?" xmin = 19.643059096210124 xmax = 19.756995053040125
xmax = 20.231585573720867 intervals [10]: xmax = 19.756995053040125 text = "FP"
text = "bej'" xmin = 19.756995053040125 text = "Overlap" <<<<<<<<<<>>>>>>>>>>
intervals [29]: xmax = 23.648962491299642 intervals [24]: item [6]:
xmin = 20.231585573720867 text = "" xmin = 19.756995053040125 class = "IntervalTier"
xmax = 20.49279384475206 intervals [11]: xmax = 21.030082586632584 name = "MISC"
text = "Sqaq" xmin = 23.648962491299642 text = "" <<<<<<<<<<>>>>>>>>>>
intervals [30]: xmax = 24.163953016171252 intervals [25]: intervals [4]:
xmin = 20.49279384475206 text = "Sqaq" xmin = 21.030082586632584 xmin = 19.261373640829614
xmax = 21.030082586632584 intervals [12]: xmax = 21.225530013079816 xmax = 19.48013067794322
text = "il-Merill" xmin = 24.163953016171252 text = "Break" text = "Aspiration”
intervals [31]: xmax = 24.812517158406998 intervals [26]:
xmin = 21.030082586632584 text = "il-Merill" xmin = 21.225530013079816
xmax = 21.225530013079816 intervals [13]: xmax = 22.091024710212217
text = "" xmin = 24.812517158406998 text = ""
intervals [32]: xmax = 24.9184775982589 intervals [27]:
xmin = 21.225530013079816 text = "" xmin = 22.091024710212217
xmax = 21.338326610341518 intervals [14]: xmax = 22.335863155310836
text = "u" xmin = 24.9184775982589 text = "Pause"
intervals [33]: xmax = 24.988121382607528 <<<<<<<<<<>>>>>>>>>>
xmin = 21.338326610341518 text = "u" item [4]:
xmax = 21.521897831205 intervals [15]: class = "TextTier"

90/119
A Web Application for Dialectal Arabic Text Annotation
Yassine Benajiba and Mona Diab
Center for Computational Learning Systems
Columbia University, NY, NY 10115
{ybenajiba,mdiab}@ccls.columbia.edu
Abstract
Design and implementation of an application which allows many annotators to annotate data and enter the information into a central
database is not a trivial task. Such an application has to guarantee a high level of security, consistent and robust back-ups for the underly-
ing database, and aid in increasing the speed and efficiency of the annotation by providing the annotators with intuitive GUIs. Moreover
it needs to ensure that the data is stored with a minimal amount of redundancy in order to simultaneously save all the information while
not losing on speed. In this paper, we describe a web application which is used to annotate many Dialectal Arabic texts. It aims at
optimizing speed, accuracy and efficiency while maintaining the security and integrity of the data.

1. Introduction jazeera1 where the speaker explains the situation of women


Arabic is spoken by more than 300 million people in the who are film makers in the Arab world. The DA word se-
world, most of them live in Arab countries. However quences are circled in red and the rest of the text is in MSA.
the form of the spoken language varies distinctly from the In (Habash et al., 2008), the authors show that in broadcast
written standard form. This phenomenon is referred to as conversation, DA data represents 72.3% of the text. In we-
diglossia (Ferguson, 1959). The spoken form is dialectal blogs, the amount of DA is even higher depending on the
Arabic (DA) while the standard form is modern standard domain/topic of discussion where the entire text could be
Arabic (MSA). MSA is the language of education in the written in DA.
Arab world and it is the language used in formal settings, Language used in social media pose a challenge for NLP
people vary in their degree of proficiency in MSA, how- tools in general in any language due the difference in genre.
ever it is not the native tongue of any Arab. MSA is shared Social media language is more akin to speech in nature and
across the Arab world. DA, on the other hand, is the every- people tend to be more loose in their writing standards. The
day language used in spoken communication and is emerg- challenge arises from the fact that the language is less con-
ing as the form of Arabic used in web communications (so- trolled and more speech like where many of the textually
cial media) such as blogs, emails, chats and SMS. DA is a oriented NLP techniques are tailored to processing edited
pervasive form of the Arabic language, especially given the text. The problem is exacerbated for Arabic writing found
ubiquity of the web. on the web because of the use of DA in these genres. DA
DA varies significantly from region to region and it varies writing lacks orthographic standards, on top of the other
also within a single country/city depending on so many fac- typical problems associated with web media language in
tors including education, social class, gender and religion. general of typographical errors and lack of punctuation.
But of more relevance to our object of study, from a natural Figure 2 shows a fully DA text taken from an Arabic
language processing (NLP) perspective, DA varieties vary weblog2 .
significantly from MSA which poses a serious impediment
for processing DA with tools designed for MSA. The fact is Our Cross Lingual Arabic Blog Alerts (COLABA) project
that most of the robust tools designed for the processing of aims at addressing these gaps both on the resource creation
Arabic to date are tailored to MSA due to the abundance of level and the building of DA processing tools. In order to
resources for that variant of Arabic. In fact, applying NLP achieve this goal, the very first phase consists of gathering
tools designed for MSA directly to DA yields significantly the necessary data to model:
lower performance (Habash et al., 2008; Benajiba et al., • Orthographic cleaning and punctuation restoration
2008) making it imperative to direct research to building (mainly sentence splitting);
resources and dedicated tools for DA processing.
• Dialect Annotation;
DA lack large amounts of consistent data due to several fac-
tors: the lack of orthographic standards for the dialects, the • Lemma Creation;
lack of overall Arabic content on the web, let alone DA con- • Morphological Profile Creation.
tent. Accordingly there is a severe deficiency in the avail-
ability of computational annotations of DA data. Across all these tasks, we have designed a new phonetic
Any serious attempt at processing real Arabic has to ac- scheme to render the DA in a conventionalized internal
count for the dialects. Even broadcast news which is sup- orthographic form details of which are listed in (Diab
posed to be MSA has non-trivial DA infiltrations. In broad- et al., 2010b). We believe that creating a repository of
cast news and talk shows, for instance, speakers tend to
1
code switch between MSA and DA quite frequently. Fig- http://www.aljazeera.net/
2
ure 1 illustrates an example taken from a talk show on Al- http : //www.paldf.net/f orum/
91/119
Figure 1: An illustrating example of MSA - DA code switching.

Figure 2: An illustrating example of a DA text on a weblog.

consistent annotated resources allows for the building of allow the lead annotators to assign different tasks to
applications such as Information Retrieval, Information different annotators at different times, help them trace
Extraction and Statistical Machine Translation on the DA the annotations already accomplished, and should al-
data. In this project, we have targeted four Arabic Dialects, low them to give illustrative constructive feedback
namely: Egyptian, Iraqi, Levantine, and Moroccan. And from within the tool with regards to the annotation
the harvested data is on the order of half a million Arabic quality.
blogs. The DA data is harvested based on manually
created queries in the respective dialects as well as a list of Even though many of these annotation tools, such as
compiled dialect specific URLs. Once the data is harvested GATE(Damljanovic et al., 2008; Maynard, 2008; Aswani
it is automatically cleaned from metadata and the content and Gaizauskas, 2009), Annotea(Kahan et al., 2001) and
part is prepared for manual annotation. MnM(Vargas-Vera et al., 2002) among others, have proven
The application that we present in this paper, successful in serving their intended purposes, none of them
COLANN GUI, is designed and implemented in the was flexible enough for being tailored to the COLABA
framework of the COLABA project. goals.
COLANN GUI is the interface used by the annota- The remainder of this paper is organized as follows: We
tors to annotate the data with the relevant information. give an overview of the system in Section 2.; Section 3. il-
COLANN GUI uses two different servers for its front-end lustrates the detailed functionalities of the application; Sec-
and back-end components. It also allows many annotators tion 4. describes each of the annotation tasks handled by
to access the database remotely. It offers several views de- the application; We give further details about the database
pending on the type of user and the annotation task assigned in Section 5. and finally, some future directions are shared
to an annotator at any given time. The decision to develop in Section 6..
an annotation application in-house was taken after unsuc-
cessfully trying to find an off-the-shelf tool which can offer 2. Overall System View
the functionalities we are interested in. Some of these func- COLANN GUI is a web application. We have chosen
tionalities are: such a set up, in lieu of a desktop one, as it allows us to
• Task dependency management: Some of the annota- build a machine and platform independent application.
tion tasks are dependent on each other whereas oth- Moreover, the administrator (or super user) will have
ers are completely detached. It is pretty important in to handle only one central database that is multi-user
our tasks to be able to manage the annotation tasks compatible. Furthermore, the COLANN GUI is browser
in a way to keep track of each word in each sentence independent, i.e. all the scripts running in the background
and organize the information entered by the annota- are completely browser independent hence allowing all
tor efficiently. It is conceivable that the same word the complicated operations to run on the server side only.
could have different annotations assigned by different COLANN GUI uses PHP scripts to interact with the server
annotators in different tasks whereas most the avail- database, and uses JavaScripts to increase GUI interactivity.
able tools do not have the flexibility to be tailored is
such fashion; and Safety and security are essential issues to be thought of
when designing a web application. For safety considera-
• Annotators’ management: the tool should be able to tions, we employ a subversion network (SVN) and auto-
92/119
matic back-up servers. For security considerations we orga- 4. An annotator could check the speed of others (anony-
nize our application in two different servers, both of which mously and randomized) on a specific task once they
is behind several firewalls (see Figure 3). submit their own

3. COLANN GUI: A Web Application 5. View annotations shared with them by the Lead Anno-
tator
As an annotation tool, we have designed COLANN GUI
with three types of users in mind: Annotators, Lead
Annotators, and Super User. The design structure of 4. Annotation Tasks
COLANN GUI aims to ensure that each annotator is work- A detailed description of the annotation guidelines goes be-
ing on the right data at the right time. The Super User and yond the scope of this paper. The annotation guidelines are
Lead Annotator views allow for the handling of organiza- described in detail in (Diab et al., 2010b). We enumerate
tional tasks such as database manipulations, management the different annotation tasks which our application pro-
of the annotators as well as control of in/out data opera- vides. All the annotation tasks can only be performed by
tions. a user of category Annotator or Lead Annotator for the cre-
Accordingly, each of these different views is associated ation of the gold evaluation data. In all the tasks, the anno-
with different types of permissions which connect to the tator is asked to either save the annotation work, or submit
application. it. If saved they can go back and edit their annotation at a
later time. Once the work is submitted, they are not allowed
3.1. Super User View to go back and edit it. Moreover, the annotators always have
The Super User has the following functionalities: direct access to the relevant task guidelines from the web in-
terface by pressing on the information button provided with
1. Create, edit and delete tables in the database each task.
2. Create, edit and delete lead accounts The annotation tasks are described briefly as follows:

3. Create, edit and delete annotator accounts 1. Typo Identification and Classification and Sentence
Boundary Detection: The annotator is presented with
4. Check the status of the annotation tasks for each anno- the raw data as it is cleaned from the meta data but as it
tator would have been present on web. Blog data is known
5. Transfer the data which needs to be annotated from to have all kinds of speech effects and typos in addi-
text files to the database tion to a severe lack of punctuation.
Accordingly, the first step in content annotation is to
6. Generate reports and statistics on the underlying identify the typos and have them classified and fixed,
database in addition have sentence boundaries identified.
7. Write the annotated data into XML files The typos include: (i) gross misspellings: it is rec-
ognized that DA has no standard orthography, how-
3.2. Lead Annotator View ever many of the words are cognates/homographs with
The Lead Annotator view shares points 3 and 4 of the Su- MSA, the annotator is required to fix misspelling of
per User view. In addition, this view has the following ad- such words if they are misspelled for example Yg. A‚ÖÏ @
ditional functionalities: AlmsAj, “the mosques” would be fixed and re-entered
as Yg. A‚ÖÏ @ AlmsAjd; (ii) speech effects: which consists
1. Assign tasks to the annotators of rendering words such as “Goaaaal” to “Goal”; and
2. Check the annotations submitted by the annotators (iii) missing spaces. The annotator is also asked to
specify the kind of typo found. Figure 4 shows a case
3. Communicate annotation errors to the annotators where the annotator is fixing a “missing space” typo.
The following step is sentence boundary detection.
4. Create gold annotations for samples of the assignment
This step is crucial for many of the language tools
tasks for evaluation purposes. Their annotations are
which cannot handle very long sequences of text, e.g.
saved as those of a special annotator
syntactic parsers. In order to increase the speed and
5. Generate inter-annotator agreement reports and other efficiency of the annotation, we make it possible to
types of relevant statistics on the task and annotator indicate a sentence boundary by clicking on a word
levels in the running text. The sequence of words is simply
split at that click point. The annotator can also de-
3.3. Annotator View cide to merge two sequences of words by clicking at
The annotator view has the following functionalities: the beginning of a line and it automatically appends
the current line to the previous one. It is worth noting
1. Check status of his/her own annotations that all the tasks that follow depend on this step be-
2. Annotate the assigned units of data ing completed. Once this task is completed, the data is
sent to a coarse grained level of dialect identification
3. Check the overall stats of other annotators’ work for (DI) pipeline described in detail in (Diab et al., 2010a).
comparative purposes The result of this DI process is the identification of the
93/119
Figure 3: Servers and views organization.

Figure 4: Typo Identification and Fixing.

problem words and sequences that are not recognized scale. Finally, they are required to provide the pho-
by our MSA morphological analyzer, i.e. the words netic transcription of word as specified in our guide-
don’t exist in our underlying dictionaries.3 lines on rendering DA in the COLABA Conventional
Orthography (CCO).
2. Dialect annotation: For each word in the running text
(after the content cleaning step mentioned before), the The GUI at this point only allows the annotator to sub-
annotator is asked to specify its dialect(s) by picking mit his/her annotation work when all the words in the
from a drop down menu. Moreover they are requested text are annotated. The annotators are given the option
to choose the word’s level of dialectalness on a given to mark a word as unknown.

3 Another functionality that we have added in order to


It is important to note that we run the data through the mor-
phological analyzer as opposed to matching against the underly- help the annotators speed up their annotation in an ef-
ing dictionary due to the fact the design decision we made early on ficient way is a color coding system for similar words.
that our dictionaries will have lemmas and rules associated with If the annotator enters the possible dialects, relevant
them rather than exhaustively listing all possible morphological annotation, and the phonetic CCO transliteration for
forms which could easily be in the millions of entries. a surface word wi . The annotated words change color
94/119
to red. This allows the annotator to know which words us the ability to access them easily with their recurrent
have already been annotated by simply eye balling the examples which are in turn identified uniquely by sen-
words colored in red in the overall document undergo- tence number and document number. For instance, let
ing annotation. Second, the script will look for all the us consider all the sentences where the surface word
words in the text which have the same surface form as éJ.»QÓ, mrkbh, appears in our data. For illustration in
wi , i.e. all instances/occurrences of annotated wi , and this paper, we provide the English translation and the
it will color each of these instances in blue. The an- Buckwalter transliteration, however in the actual inter-
notator then, can simply skip annotating these words face the annotators only see the surface DA word and
if s/he judges them to have the same annotation as the associated examples in Arabic script as they occur in
original, so it ends up being a revision rather than a the data, but after being cleaned up from meta data,
new annotation. It is easy to understand how this sim- html mark up, typos are fixed and sentence boundaries
ple change of color coding can facilitate the annota- identified.
tion job and increase the efficiency of the annotation
process by an example. In a long Arabic blog text,
frequent function words such as Ó  , m$, “not”, will ... è@@@@@@@@AÖß éJ.»QÓ é<Ë@ð PAJÊK. úGñ¢ªK ñË Aë Aë Aë Aë
only need to be annotated once. Buckwalter Transliteration: hA hA hA hA lw
Figure 6 shows an illustrating example of the dialect yEtwny blyAr wAllh mrkbh ymAAAAAAAAh ...
annotation process via a screenshot of the task. English Gloss: hahaha even if they give me a billion
I wouldn’t ride it muuuuuum
3. Lemma Creation: In this task, the annotators are ¨A®KP@ I. ‚. àAJ.K X ø X@ð úÍ@ éJ.»QÓ ¬Qm.' @...
asked to provide the underlying lemma forms (cita- Buckwalter Transliteration: Anjrf mrkbh AlY
tion forms) for surface DA words. The lemmas con- wAdy *ybAn bsbb ArtfAE ...
stiture the dictionary entry forms in our lexical re- English Gloss: his boat drifted to the Dhibane valley
sources. The resource aims to have a large repository because of the increase of the level ...
of DA lemmas and their MSA and English equivalents éKñºÊJ.ÊJ¯ éJ.»QÓ ú×Aƒ èX
as well as DA example usages as observed in the blog
Buckwalter Transliteration: dh sAmy mrkbh
data in the COLABA project. Accordingly, the anno-
fylblkwnh
tator is provided with a surface DA word and instances
English Gloss: Samy has it set up in the balcony
of it’s usage from example sentences in the blogs and
they are required to provide the corresponding lemma,
MSA equivalent, English equivalent, gross dialect id. These sentences are shown to the annotator and s/he
Once they provide the lemma, they have to identify is asked to identify the number of lemmas for this
which example usage is associated with the lemma surface word. For instance, in the second example,
they created. All the lemma information is typed in we find sentences where mrkbh appears as “his boat”
using the CCO transcription scheme that COLABA and in the first example it appears as “ride”. Accord-
specifies. It is worth noting that this task is completely ingly in these examples, the annotator should indicate
independent from the Dialect Annotation task. Hence that there are three different lemmas for the surface
annotators could work directly on this task, i.e. after form mrkbh rendered in CCO transliteration scheme
fixing typos and sentence boundaries are identified and as rikib, markib, and merakib, respectively.
the DI process is run.
Figure 5 shows an illustrating example.
Accordingly, after the data undergoes the various
clean up steps mentioned earlier, the data goes through
the DI process as follows: 4. Morphological Profile Creation: Only for those words
which have been already annotated with the lemma in-
(a) Transliterate the Arabic script of the blogs into formation in the previous step do we proceed for fur-
the Buckwalter Transliteration scheme (Buck- ther annotation. For those lemmas in the database al-
walter, 2004) after the previous content clean ready, we add more detailed morphological informa-
up tasks of typo and sentence boundary han- tion. The annotator is shown one lemma at a time with
dling. This process also identifies the foreign a set of example sentences where the surface form of
word character encoding if they exist in the text; the lemma is used. Thereafter the annotator is asked to
(b) Use the DI pipeline to identify the DA words select a part of speech tag (POS-tag). Figure 7 shows
within each document; that when a POS-tag is selected the interface shows
the type of information required accordingly.
(c) Build a ranked list of all the surface DA words
observed in the input document set based on their
frequency of occurrence, while associating each In all these tasks, the application is always keeping track of
surface word with the sentences in which it oc- the time that took each annotator for each task unit. Such
curred in the document collection; information is necessary to compare speed and efficiency
among annotators and also for the annotators themselves
Thus, we have grouped the DA words by surface form to be able to compare themselves to the best and the worst
and used them as key entries in our database allowing across a task.
95/119
Figure 5: Illustrating example of lemma creation.

5. The Database stance of the actual sentence in only one table and re-
For this application we need a sophisticated database. We late all the annotation records to it. Only by doing
use the freely available Postgresql management sys- so are we able to save information about millions of
tem4 . The database system is able to save all the annota- words in the database while keeping it easily and ef-
tions which we have described in Section 3.. But it also ficiently accessible. Finally, we also save the time (in
holds other information concerning the users and their an- seconds) taken by the annotators to complete the an-
notation times. Our database contains 22 relational tables, notation tasks. This information is necessary for inter-
which can be split into the following categories: annotator speed comparison.
• Basic information: We have created a number of tables • Assignments: These tables hold information about
which save basic information that is necessary for all how many task assignments have been assigned to
the tasks. By saving such information in the database each annotator and how many of them have already
our application becomes very easy to update and main- been annotated and submitted. This is directly related
tain. For instance, if we decide to add a new POS-tag to the assignment task of the lead annotators described
we just have to add it in the appropriate table. in Subsection 3.2..
• Annotation: These tables are the core of the database. • Users Permissions: When a user is created, we assign
For each of the annotation tasks described in Section 4. a certain category to her/him. This information is used
it saves all the information entered by the annotator by all the scripts to decide on user privileges.
while keeping the redundancy of the information at a
minimum. For instance, for each sentence in the data, • Connection: Whenever a user is connected, this infor-
we want to save all the information entered about the mation is communicated to the database. By doing so
dialect, the lemmas and the morphological informa- we are able to deny a user to connect from two ma-
tion of the dialectal words while saving only one in- chines at the same time.
As mentioned in Section 1., our database is located on a
4
http://www.postgresql.org/ separate server from the web server. This web server can
96/119
Figure 7: Illustrating example of the requested information when an annotator chooses the POS-tag Noun or Verb.

6. Future Work
We are constantly updating our interface incorporating
feedback from the annotators and lead annotators on the
various tasks. The data that is annotated using our applica-
tion is intended to build efficient models of four different
dialects that cover all the major Arabic dialects. The mod-
els will be useful for several NLP applications:

• Automatic spelling correction;

• Automatic sentence boundary detection;

• Automatic dialect identification and annotation;

• Lemmatization and POS-tagging;

• Information Retrieval and Advanced search;

• Named Entity, foreign words and borrowed words de-


Figure 6: Illustrating example of dialect annotation. tection;

However, it is not possible to aim at such advanced appli-


cations without a consistent annotation where the efficiency
send requests to the database server through an ssh tunnel of the application which we describe in this paper plays a
which helps forward services between the two servers with pivotal role.
encrypted data.5

5
Acknowledgments
http://www.ssh.com/support/documentation/online/ssh/adminguide/
32/Port Forwarding.html This work has been funded by ACXIOM Corporation.
97/119
7. References
N. Aswani and R. Gaizauskas. 2009. Evolving a general
framework for text alignment: Case studies with two
south asian languages. In Proceedings of the Interna-
tional Conference on Machine Translation: Twenty-Five
Years On.
Y. Benajiba, M. Diab, and P. Rosso. 2008. Arabic named
entity recognition using optimized feature sets. In Pro-
ceedings of EMNLP’08, pages 284–293.
T. Buckwalter. 2004. Buckwalter Arabic Morphologi-
cal Analyzer Version 2.0. Linguistic Data Consortium,
University of Pennsylvania, 2002. LDC Cat alog No.:
LDC2004L02, ISBN 1-58563-324-0.
D. Damljanovic, V. Tablan, and K. Bontcheva. 2008. A
Text-based Query Interface to owl Ontologies. In Pro-
ceedings of the 6th Language Resources and Evaluation
Conference (LREC).
M. Diab, N. Habash, O. Rambow, M. AlTantawy, and
Y. Benajiba. 2010a. COLABA: Arabic Dialect Annota-
tion and Processing. In Proceedings of the Language Re-
sources (LRs) and Human Language Technologies (HLT)
for Semitic Languages at LREC.
Mona Diab, Nizar Habash, Reem Faraj, and May Ahmar.
2010b. Guidelines for the Annotation of Dialectal Ara-
bic. In Proceedings of the Language Resources (LRs)
and Human Language Technologies (HLT) for Semitic
Languages at LREC.
C. A. Ferguson. 1959. Diglossia. Word, 15:325–340.
N. Habash, O. Rambow, M. Diab, and R. Kanjawi-Faraj.
2008. Guidelines for Annotation of Arabic Dialectness.
In Proceedings of the LREC Workshop on HLT & NLP
within the Arabic world.
J. Kahan, M.R. Koivunen, E. Prud’Hommeaux, and R.R.
Swick. 2001. Annotea: an open rdf infrastructure for
shared web annotations. In Proceedings of the WWW10
Conference.
D. Maynard. 2008. Benchmarking textual annotation tools
for the semantic web. In Proceedings of the 6th Lan-
guage Resources and Evaluation Conference (LREC).
M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni,
A. Stutt, and F. Ciravegna. 2002. Mnm: Ontology
driven semi-automatic and automatic support for seman-
tic markup. In Proceedings of the 13th International
Conference on Knowledge Engineering and Manage-
ment (EKAW).

98/119
Towards a Psycholinguistic Database for Modern Standard Arabic
Sami Boudelaaa, and William D. Marslen-Wilsona
a
MRC Cognition and Brain Sciences Unit
sami.boudelaa@mrc-cbu.cam.ac.uk, William.marslen-wilson@mrc-cbu.cam.ac.uk

Abstract
To date, there are no Arabic databases that provide distributional information about orthographically disambiguated words and
morphemes. Here we present ARALEX (Arabic Lexical database), a new tool providing type and token frequency counts for vowelled
Arabic surface words, stems, bigrams and trigrams. The database also provides type and token frequency for roots and word patterns.
Token frequency counts are based on an automatically annotated 40 million word corpus derived from different Arabic news papers,
while the type frequency counts are based on the Hans Wehr dictionary. This database is a valuable resource for researchers across
many fields. It is available for the community as a web interface and as a stand alone downloadable application on:
http://www.mrc-cbu.cam.ac.uk:8081/ARALEX.online/login.jsp

morpho-syntactic meaning of active perfective, to generate the


1. Introduction surface form [xatam] finish and with the pattern {fiEaal} to give
rise to the form [xitaam] termination. Although the meaning of a
Arabic has been gathering a lot of attention across
given surface form is not always componential, there is a
many fields both because of its socio-political significance and
reasonable amount of consistency (McCarthy, 1981).
because its linguistic characteristics present a sharp contrast
Stems like [xatam] and [xitaam] can be further augmented
with Indo-European languages. Important insights have been
with various inflectional affixes and enclitics. For instance the
gained through the study of Arabic phonology, syntax and
complex form [waxitaamuhaa] and its termination consists of
morphology (Ferguson, 1959; McCarthy, 1981; Plunkett &
the proclitic [wa] and, the surface form or stem [xitaam]
Nakisa, 1997; Boudelaa & Marslen-Wilson, 2010). These
termination, the ending [u] nominative marker, and the third
insights are largely limited however to theoretical linguistics.
person feminine singular possessive pronoun enclitic [ha]
Experimental research fields such as psycholinguistics,
her/its.
cognitive neuroscience and neurolinguistics on the other hand
are lagging behind in spite of their clear benefits for language The non-concatenative nature of the Arabic morphological
learning and language rehabilitation. One of the reasons for the system makes it an interesting subject of research for
scarcity of research in Arabic experimental and applied experimental and applied research disciplines. It raises
language fields is the lack of reliable databases that provide important questions with far reaching consequences both for
information about the distributional characteristics of cognitive and neurocognitive theories of language
representation and processing. For example, are the component
the words in the language. ARALEX, the database we describe morphemes of an Arabic word represented independently at a
here, promises to fill this gap by providing information about the cognitive and neural level? Can morphology as a domain of
frequency of words, morphemes, and letter combinations (i.e., knowledge affect language learning? What kind of
bigrams and trigrams) in Modern Standard Arabic (MSA). To neuro-cognitive challenges are raised by the process of reading
provide the basis for understanding the structure of ARALEX, unvowelled Arabic? Addressing these kinds of issues in the
and the choices we made to develop it, we first provide a brief context of Arabic can be promoted by a lexical database like
description of MSA highlighting its specific features that guided ARALEX that makes the design of well controlled experiments
the development of the ARALEX architecture easy and efficient.

2. Basic features of MSA 3. ARALEX architecture


MSA is a Semitic language characterized by a rich Any Arabic lexical resource that does not provide statistical
templatic morphology where effectively all content words (and information about roots and word patterns is bound to be of
most function words) are analyzable into a root and a word limited interest not only to psycholinguists and cognitive
pattern. The root is exclusively made up of consonants and is neuroscientists, but also to language learners and practitioners in
thought to convey a broad semantic meaning which will be general. For this reason we designed ARALEX to provide the
expressed to various degrees in the different forms featuring that typical token frequency information about surface forms,
particular root. By contrast, the word pattern is essentially made bigrams and trigrams along with information about the type and
up of vowels although a subset of consonants can be part of it. token frequencies of morphemes (i.e., roots and patterns).
The word pattern conveys morpho-syntactic information and ARALEX relies on two sources of information: (a) the Hans
defines the overall phonological structure of the word (Holes, Wehr Dictionary of Modern Arabic (Wehr, 1994) as used in the
1995; Versteegh, 1997). Unlike stems and affixes in dictionary of stems that comes with the Arabic Morphological
Indo-European languages, Arabic roots and word patterns are Analyzer (Buckwalter, 2002), and (b) a 40 million word corpus
interleaved within each other in a non-linear manner. For derived from various Arabic papers. The version of the stem
example the root {xtm} with the general meaning of finishing, dictionary we used consisted of 37, 494 different stems. These
can be interleaved with the pattern {faEal} 1 with the include native MSA words, assimilated and non-assimilated

1
We will be using the standard Buckwalter transliteration scheme throughout this article.

99/119
foreign words and proper Arabic and foreign nouns. For each The orthographic form is defined as the graphic entity
stem we manually established the corresponding root and word written with white space on either side of it. For instance the
pattern. This resulted in the identification of 6804 roots and phrase ‫[ وﺳﻴﺘﻄﻠﺐ‬wsytTlb] and it will require is an orthographic
2329 word patterns. This dictionary is used as a look-up table to form. The unpointed stem is the output of AraMorph once the
guide a deterministic parsing algorithm applied to each stem in clitics and the affixes have been removed. In the example above
the corpus. the unpointed stem is [tTlb] while the pointed stem is [taTal~ab].
The corpus consisted of 40 million words drawn from The root for this stem is {Tlb} and the pattern {tafaE~al}.
various Arabic online newspapers . The most challenging aspect The token frequency statistics are computed from
of the corpus was the absence of vowel diacritics, which makes occurrence counts in the 40 million word corpus as the rate of
Arabic extremely ambiguous. In order to provide accurate occurrence per 1 million words of text, given by:
frequency measures of surface forms and morphemes it was Freq(w) = occ(w)
necessary to disambiguate the corpus by reinstating the missing T/k
vowels. Accordingly, we first stripped the corpus off its html where occ(w) is the number of occurrences of word w in the
tags, sliced it into manageable text files, and then submitted to corpus, T is the total number of words in the corpus, and k = 1,
AraMorph (Buckwalter, 2002). For each word defined as a 000, 000. The generation of token frequencies for orthographic
string of letters with space on either side of it, AraMorph outputs forms consists simply in counting and normalizing the number
(a) a vowelled solution of all the possible alternatives, (b) an of times each distinct orthographic form occurs in the corpus.
exhaustive parse of the word into its component morphemes, Where the token. frequencies of stems, roots and word patterns
and (c) a part-of-speech tag for each solution along with an are concerned, the following procedure was followed: For each
English gloss. record in the corpus the pointed and unpointed stems are
To choose the correct solution for each word, we developed a extracted, then their corresponding root and word pattern are
novel automated technique using Support Vector Machines located in the dictionary, and the occurrence of each of these
(Wilding, 2006). The output of this technique is a probability four units (i.e., the pointed stem, unpointed stem, the root, and
score reflecting the accuracy of the automatic vowelling, and an the word pattern) is recorded. If a pointed stem is not found in
entropy score which measures the amount of uncertainty in the the dictionary, the unpointed stem is used to match on dictionary
probability score 2 . In Initial testing on 792 K words from the entries without diacritics to get a set of pointed stem candidates.
Arabic Treebank, the accuracy of this automatic vowelling Then all corresponding roots and patterns for that set of stems
are located and recorded, thus increasing recall at the cost of
procedure was 93% when case endings were not taken into
potentially decreasing precision.
account, and over 85% when case endings were included. When
The type frequencies of roots and patterns are simply
applied to the full corpus, the procedure was accurate 80% of the raw counts and are extracted from the dictionary. Finally the
time for fully diacritized forms and 90% of the time accurate for character n-gram frequencies (bigrams and trigrams) are
forms without case endings. These figures were further computed from the 40 million word corpus for orthographic
cross-validated against a randomly chosen 500 K words of forms, root, and word patterns as follows:
automatically vowelled words that were also hand-annotated by Freq(g) = occ(g)
a team of native Arabic speakers in Egypt 3 . The validation T/k
showed an overall accuracy of 77.9% suggesting that the where occ(g) is the number of occurrences of n-gram g in the
solutions chosen by the human annotators were also likely to be corpus, T is the total number of n-grams in the corpus, and k = 1,
chosen by the automatic vowel diacritizer. 000, 000. .

4. Combining the corpus and the 5. Aralex interface


dictionary Two interfaces are developed to support the use of
To provide type and token frequency counts, we combined ARALEX: A JSP/Java-based web interface, and a Java-Based
the corpus and the dictionary into an integrated database 4 . For Command-Line Interface (CLI). Both are based on the Apache
every item in the corpus which has a stem in the dictionary, we Lucene index tool (http://lucene.apache.org/java) and provide
determined the root and the word pattern using the dictionary as advanced query functionality with rapid response times.
a deterministic look-up table. Around 0.44% of the corpus stems The Web interface is aimed at the majority of users whose
are not listed in the dictionary, and consequently we do not needs can be met by a set of predefined queries. It allows the
provide type frequency for such items but we do provide a token user to query the database using either Buckwalter’s
frequency. For the remaining 99.56% of the data we provide transliteration scheme or Arabic script. Users can request the
frequency counts for the orthographic form, the unpointed stem surface frequency for an orthographic form, a pointed stem, an
(i.e., the stem without vowels) 5 , the pointed stem, the root, the unpointed stem, a root and a word pattern. They can also request
word pattern, the bigram and trigram frequency of the the type frequency for roots and patterns, and the bigram and
orthographic form, the root and the word pattern. trigram frequencies for orthographic forms, roots and patterns. A
list of items with specific characteristics can also be obtained.
2
For more details the reader is referred to Wilding, 2006. The output can be sorted by a search unit (e.g., orthographic
3
The hand annotation was conducted under the leadership of Dr form frequency or root type frequency) ordered in ascending or
Sameh Al-Ansary of Alexandria University, Egypt. descending order. All the user needs to do is to enter a search
4
The integration of the corpus and the dictionary, and the term in the appropriate window, and tick the appropriate boxes
development of the front-end interface for ARALEX were done or indeed check all the boxes to have exhaustive information
in collaboration with Ted Briscoe and Ben Medlock, in a about the search string, and hit the search button.
contract with the iLexIR company. The CLI offers a powerful, customizable method for querying
5
We use the terms “unpointed”, “unvowelled” and ARALEX. The input to CLI can be a single word or a text file,
“non-diacritized” interchangeably.

100/119
allowing batch processing, and the output can be written into a Wehr, H. (1994). Arabic-English Dictionary. Spoken Language
file or displayed on the screen. To use the ARALEX CLI, the Services, Inc., Ithaca, NY.
user needs to install Java JDK 5.0 or later, and Lucene 2.3.2 or Wilding, M. (2006). Bootstrapping Arabic pointing and
later. An ARALEX command-line interface with Java class files morphological structure. MPhil in Computer Speech, Language
and an ARALEX Lucene database index are also required and and Internet Technology. University of Cambridge, St
can be downloaded from the ARALEX website. Once these Catherine’s College.
components are available and the Lucene core JAR is on the
system classpath for the ARALEX CLI, the interface can be
invoked by the command java SearchDB. If successful, this
should display the input argument format, options, and field
names. At this stage the program requires the directory
containing the ARALEX index files to be specified. Invoking
the command java SearchDB index_dir, where index_dir is the
location of the database index, yields the prompt Enter query.
From now on, any valid Lucene query can be entered (for further
details refer to Boudelaa & Marslen-Wilson, 2010).

6. Conclusion
ARALEX is the first Arabic lexical database to provide
frequency information about vowelled words, morphemes and
letter and phoneme bigrams. It allows experimental researchers
to design well controlled experiments, and provides a valuable
source of information for natural language processing
development. It can also be used to derive basic and/or more
advanced vocabulary lists tailored to the needs of various
language learners.

7. Acknowledgements
This work is supported by a British Academy Large Research
Grant (LRG42466) and MRC (grants U.1055.04.002.00001.01).
The authors would like to thank Sameh Al-Ansary, Ted Briscoe,
Tim Buckwalter, Hubert Jin, Mohamed Maamouri, Fermin
Moscoso del Prado Martin, Dilworth B. Parkinson and Mark
Wilding for their help at different stages of the project.

8. References
Boudelaa, S., & Marslen-Wilson, W.D. (2010). ARALEX: A
lexical database for Modern Standard Arabic.
Behavioral Research Methods in press
Boudelaa, S., Pulvermüller, F., Hauk, O., Shtyrov, Y. &
Marslen-Wilson, W.D. (2010). Arabic morphology in the
neural language system. Journal of Cognitive
Neuroscience in press
Buckwalter, T. (2002). Buckwalter Arabic Morphological
Analyzer Version 1.0. Linguistic Data Consortium,
catalog number LDC2002L49, ISBN 1-58563-257-0.
Ferguson, C. A. (1959). Diglossia. Word, 15, 325-340.
Holes, C. (1995). Modern Arabic: Structures, functions and
varieties. London and New York: Longman.
Maamouri, M., & Bies, A. (2004). Developing an Arabic
Treebank: methods, guidelines, procedures, and tools.
Proceedings of the Workshop on Computational
Approaches to Arabic Script-based languages, p. 2-9.
Geneva, Switzerland.
McCarthy, J. J. (1981). A prosodic theory of non-concatenative
morphology. Linguistic Inquiry, 12, 373-418.
Plunkett, K. & Nakisa, R.C. (1997). A connectionist model of the
Arabic plural system. Language and Cognitive Processes,
12, 807-836.
Versteegh, K. (1997). The Arabic Language. Edingburgh
University Press.

101/119
Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC
Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma, Stephanie Strassel
Linguistic Data Consortium, University of Pennsylvania
3600 Market Street, Suite 810, Philadelphia, PA USA
E-mail: { sgrimes, xuansong, bies, skulick, xma, strassel}@ldc.upenn.edu

Abstract
This contribution describes an Arabic-English parallel word aligned treebank corpus from the Linguistic Data Consortium that is
currently under production. Herein we primarily focus on efforts required to assemble the package and instructions for using it. It was
crucial that word alignment be performed on tokens produced during treebanking to ensure cohesion and greater utility of the corpus.
Word alignment guidelines were enriched to allow for alignment of treebank tokens; in some cases more detailed word alignments are
now possible. We also discuss future annotation enhancements for Arabic-English word alignment.

discusses word alignment at LDC. Section 4 addresses


1. Introduction issues faced in combining treebank and word alignment
annotation. Section 5 has information about the corpus
1.1 Parallel Treebanks structure and how to use the data. Section 6 provides a
critical analysis and discussion of future directions.
Multiple annotation of corpora is common in the
development of computational linguistic language
resources. Additional annotation increases potential Release date Genre Words Tokens Sentences
information extraction from a given resource. For 4/9/2009 NW 9191 13145 382
example, many existing parallel corpora have been 9/21/2009 NW 182351 267520 7711
developed into parallel treebanks, and for several
9/21/2009 BN 89213 115826 4824
language pairs there exist parallel treebank corpora.
Parallel treebank corpora are parallel texts for which there 10/24/2009 NW 16207 22544 611
exist manual parses for both languages (and possibly POS 10/24/2009 WB 6656 9478 288
tags also). Examples include Czech-English (Hajic et al., 1/29/2010 BN 9930 12629 705
2001), English-German (Cyrus et al., 2003), 1/29/2010 WB 12640 18660 565
English-Swedish (Ahrenburg, 2007), Swedish-Turkish Total 326188 459802 15086
(Megyesi et al., 2008), Arabic-English (Maamouri et al., Table 1. Annotation volume as of May 2010. Figures
2005; Bies, 2006), Chinese-English (Palmer et al., 2005; reported for words and tokens refer to the Arabic source.
Bies et al., 2007). The latter corpora produced by LDC are
of particular note due to their high data volume.
2. Development of Parallel Treebanks
Parallel word-aligned treebank corpora appear to be rare, The path towards construction of the resource under
and their scarcity is likely due to their being very discussion could be considered to begin with the Arabic
resource-intensive to create. The most prominent related Treebank (ATB) corpus (Maamouri et al., 2005).
corpus is called SMULTRON and is a parallel aligned Translation of the Arabic to English created parallel texts,
treebank corpus for one-thousand English, Swedish, and and when the English-Arabic Translation Treebank
German sentences (Gustafson-Capkova et al., 2007). In (EATB) (Bies, 2006) is used in conjunction with the ATB,
SMULTRON, alignment is pairwise between each of the this serves as an English-Arabic parallel treebank corpus.
component languages, and annotation permitted between Please refer to documentation released with these corpora
syntactic categories and not exclusively between words. for additional discussion concerning construction,
annotation guidelines, and quality control efforts that
1.2 Current Project went into creating the individual treebanks.
The present paper discusses key points in creating an
Arabic-English parallel word-aligned treebank corpus. In developing parallel treebanks, care must be taken to
We have also included a brief description of this corpus in ensure sentence segments remain parallel from the
the LREC 2010 Language Resource Map. original parallel corpus. Arabic sentences are often
translated as multiple English sentences. Hence one
As shown in Table 1, releases for this corpus began in Arabic tree may correspond to multiple English trees, and
2009, and to date more than 325,000 words of Arabic and occasionally effort is required to enforce that sentence
the corresponding English translation have been segments remain parallel. For a similar project involving
treebanked and word aligned. Each release includes data an English-Chinese parallel word-aligned treebanked
from one or more genre: newswire (NW), broadcast news corpus, English and Chinese treebanking were performed
transcripts (BN), or online web resources such as blogs independently at different locations, and the resulting
(WB). corpora were only weakly parallel; an automatic sentence
aligner was required to re-establish the parallel texts. We
1.3 Organization of the Paper used Champollion, a lexicon-based sentence aligner for
robust alignment of the noisy data (Ma, 2006). Such a tool
The paper is structured as follows. Section 2 discusses may be necessary for others creating parallel aligned
development of Arabic and English treebanks. Section 3

102/119
treebank corpora if the data inputs are not already 3.3 Word Alignment and Tagging Tool
sentence-wise parallel. Word alignment is performed on unvocalized tokens
rendered in Arabic script. LDC’s word alignment tool
The Arabic Treebank (ATB) distinguishes between source allows annotators to simultaneously align tokens and tag
and treebank tokens. While source tokens are generally them with meta data or semantic labels. A screenshot of
whitespace-delimited words, the treebank tokens are the tool is shown in Figure 1.
produced using a combination of SAMA (Maamouri et al.,
2009) for morphological analysis, selection from amongst The navigation panel on the right side of the software
alternative morphological analyses, and finally splitting displays original (untokenized) source text to help
of the source token into one or more treebank tokens annotators understand the context of surrounding
based on clitic or pronoun boundaries. sentences (which aids in, for example, anaphora
resolution). Having untokenized source text also aids in
For release as part of this corpus, the ATB and EATB are resolving interpretation ambiguities that would arise if
provided in Penn Treebank format (Bies et al., 1995). The annotators could only see tokenized, unvocalized script.
trees are unmodified from ATB/EATB releases except that
the tokens were replaced with token IDs. This structure is 3.3 Additional Tagging for Word Alignment
discussed in greater detail in Section 5. In addition to part-of-speech tags produced as part of
treebank annotation, word alignment annotators have the
3. Word Alignment Annotation option of adding certain language-specific tags to aid in
At the LDC, word alignment is a manual annotation disambiguation. A tagging task for Arabic-English has
process that creates a mapping between words or tokens recently been added to the duties of word alignment
in parallel texts. While automatic or semi-automatic annotators, and it is described as follows.
methods exist for producing alignments, we avoid these
methods. Manual alignment serves as a gold standard for For unaligned words or phrases having locally-related
training automatic word alignment algorithms and for use constituents to which to attach, they are tagged as "GLU"
in machine translation (c.f. Melamed 2001, Véronis and (i.e., "glue"). This indicates local word relations among
Langlais 2000), and it is desirable that annotator decisions dependency constituents. The following are some cases in
during manual alignment not be biased through use of which the GLU tag would be used:
partially pre-aligned tokens. It is felt that annotators may -English subject pronouns omitted in Arabic.
accept the automatic alignment and also lower annotator -Unmatched verb "to be" for Arabic equational
agreement at the same time. sentences.
-Unmatched pronouns and relative nouns when linked to
Using higher-quality manual alignment data for training their referents.
data results in better machine translations. Fossum, -Unmatched possessives ('s and ') when linked to their
Knight, and Abney (2008) showed that using Arabic and possessor.
English parsers or statistical word alignment tools such as -When a preposition in one language has no counterpart,
GIZA++ instead of gold standard annotations contributes the extra preposition attached to the object is marked
to degradations in training data quality that significantly GLU.
impact BLEU scores for machine translation. While -Two or more prepositions in one language while there is
automatic parsing and word aligning have their place in one preposition in the other side; the unmatched
NLP toolkits, use of manually-annotated training data is preposition would be tagged as GLU.
always preferred if annotator resources are available.
It is hoped that the presence of the GLU tag provides a
3.1 Word Alignment Annotation Guidelines clue in understanding morphology better, and we will
LDC's word alignment guidelines are adapted from continue to explore using additional tags for this task.
previous task specifications including those used in the
BLINKER project (Melamed 1998a, 1998b). Single or 4. Uniting Treebank and Word Alignment
multiple tokens (words, punctuation, clitics, etc.) may be Annotation
aligned to a token in the opposite language, or a given This section describes efforts to join treebank and word
token may be marked as not translated. Early LDC alignment annotation.
Arabic-English word alignment releases as part of the
DARPA GALE program were generally based on 4.1 Order of Annotation
whitespace tokenization. The order of annotation in creating a parallel
word-aligned treebank corpus is important. From the
Word alignment guidelines serve to increase annotator parallel corpus, the sentences can first be treebanked or
agreement, but different word alignment projects may word aligned. If word alignment was to proceed first, the
have unique guidelines according to what is deemed tokens used for word alignment would serve as input to
translation equivalence. For example, are pronouns treebanking. However, treebank tokenization includes
permitted to be aligned to proper nouns with which they morphosyntactic analysis, and hence treebank
are coindexed? Our point here is to encourage the corpus tokenization is only determined manually during treebank
user to explore alignment guidelines in detail to better annotation. For this reason, the preferred workflow is to
understand the task. only perform word-alignment annotation after
experienced treebank annotators have fixed tokenization,

103/119
Figure 1. The PyQt-based tool used at LDC for word alignment annotation and tagging.

104/119
and it is this development trajectory we assume for the transcripts (for Arabic) or translations (for English), so
remainder of the paper. that cross-referential integrity with the original data and
with English translations is maintained.
4.2 Tokenization Modification
The word alignment guidelines were adapted so that For both Arabic Treebank and English Treebank, quality
control passes are performed to check for and correct
annotation would be based on the treebank tokens instead
of on source, whitespace tokens. As illustrated by the errors of annotation in the trees. The Corpus Search tool2
is used with a set of error-search queries created at LDC to
following examples, finer alignment distinctions may be
locate and index a range of known likely annotation errors
made when pronouns are considered independent tokens.
The example below appears in the Buckwalter involving improper patterns of tree structures, node labels,
transliteration1 for convenience, but please note that the and the correspondence between part-of-speech tags and
tree structure. The errors found in this way are corrected
bilingual annotators work only with Arabic script.
manually in the treebank annotation files.
Source: ‫فزجوه بالسجن‬
ّ
In addition, the Arabic Treebank (ATB) closely integrates
Transliterated: fa+ zaj~uwA +h b+ Alsijn
Morpheme gloss: and sent.3P him to jail the Standard Arabic Morphological Analyzer (SAMA)
Gloss: "They sent him to jail." into both the annotation procedure and the integrity
checking procedure. The interaction between SAMA and
Source: ّ
‫معطلة‬ ّ
‫سيارته‬ the Treebank is evaluated throughout the workflow, so
Transliterated: say~Arap+h muEaT~alap that the link between the Treebank and SAMA is as
consistent as possible and explicitly notated for each
Morpheme gloss: car his broken
Gloss: "His car is broken." token.

For details on the integration between the ATB and


In each case the "h" morpheme corresponding to third
person singular is now considered an independent token SAMA, along with information about the various forms
and can be aligned to English "him" or "his" in the of the tokens that are provided, see Kulick, Bies and
Maamouri (2010). For a general overview of the ATB
examples. Under previous Arabic-English word
alignment guidelines , English "him" and "his" would pipeline, see Maamouri, et al. (2010).
have been aligned with the Arabic verb.
4.4.2 Word alignment validation
4.3 Empty Categories For word alignment, it is verified that all delivery files are
In transitioning to word alignment on treebank tokens, all well-formed. It is ensured that all tokens receive some
type of word alignment annotation.
leaves of the syntax tree — including all Empty
Categories — are considered to be tokens. This
interpretation as tokens differs slightly from ATB- and 4.4.3 Validation of parallel word-aligned treebanks
To ensure consistency of the parallel aligned treebank, we
EATB-defined treebank tokens which do not include the
Empty Category markers such as traces, empty verify that the set of tokens referenced by the treebank
complementizers, and null pro markers. files coincides with the same set of tokens appearing the
token and word alignment files.
Our word alignment guidelines currently dictate that all
Empty Category tokens are annotated as "not translated." 5. Using the Corpus
One could imagine amending guidelines to allow for the This section provides information about the file format of
alignment of Empty Category markers to pronouns in the the word-aligned treebanked data we are releasing. A
translation. This is not currently being practiced. The typical release will contain seven files for each source
primary reason for including Empty Categories as tokens document
for word alignment is to ensure that, for each language,
the number of tree leaves is identical to the number of -- Arabic source, collected from newswire, television
word alignment tokens. This requirement simplifies broadcast, or on the web
somewhat the data validation process. -- English translation of Arabic source
-- Tokenized Arabic, resulting from treebank annotation
4.4 Data Validation -- Tokenized English, resulting from treebank annotation
Validation of the data structures have both manual and -- Treebanked Arabic
-- Treebanked English
automatic components.
-- Word alignment file
4.4.1 Treebank validation
The parallel treebank is a standoff annotation with
Throughout the Treebank pipelines, there are numerous
stages and methods of sanity checks and content multiple layers of annotations with upper layer annotation
validation, to assure that annotations are coherent, referring to lower layer data (using character offsets). The
diagram in Figure 2 shows the dependencies between files
correctly formatted, and consistent within and across
in the release.
annotation files, and to confirm that the resulting
annotated text remains fully concordant with the original

1 2
We use the Buckwalter transliteration. Details are available at CorpusSearch is freely available at
http://www.qamus.org/transliteration.htm. http://corpussearch.sourceforge.net
105/119
Figure 2. File structure illustration

The word alignment file and the Arabic and English tree
files have token numbers which reference the Arabic and
English token files. Within the token files, each token
number for each sentence is expanded to give additional
information. For each token in the English token files, the
token number is listed, followed by a character range in
the raw file to which the token corresponds, and then
finally the token itself. For Arabic, multiple versions of
each token are provided (unvocalized, vocalized, input
string) and in multiple formats (Arabic script, Buckwalter
transliteration).

We considered distributing the corpus in a single


XML-based file. We felt the present structure has the
following advantages:
-- the format of each type of file (raw, tokenized, tree, wa)
is not modified and hence the same tools researchers Figure 3. A view of Arabic (above) and English (below)
wrote before can still be used; word-aligned treebanks as displayed by TreeAligner3.
-- the data are more easily manipulated; with XML it is
necessary to fully parse the xml files for even trivial tasks; 8. References
-- it is easier and less error-prone to put the package Bies, A., Ferguson, M., Katz, K, and MacIntyre, R.
together using separate files then using xml; and (1995). Bracketing guidelines for treebank II style,
-- separate files are more human readable. Penn Treebank Project. University of Pennsylvania
technical report.
6. Discussion Bies, A. (2006). English-Arabic Treebank v 1.0. LDC
Annotator agreement for the Arabic-English word Cat. No.: LDC2006T10.
alignment task is approximately 85% after first pass Bies, A., Palmer, M., Mott, J., and Warner, C. (2007).
annotation and higher after a quality round of annotation. English Chinese Translation Treebank v 1.0. LDC Cat.
In the future we plan to add additional morphosyntactic or No.: LDC2007T02.
semantic tags to the word alignment portion of the task. Cyrus, L., Feddes, H., and Schumacher, F. (2003). Fuse –
a multilayered parallel treebank. In Proceedings of the
We are also investigating methods for improving Second Workshop on Treebanks and Linguistic
automatic and semi-supervised error detection. We wish Theories.
to flag statistically unlikely alignments for human Fossum, V., Knight, K., and Abney S. (2008). Using
annotator review. Additionally, through incorporating Syntax to Improve Word Alignment Precision for
phrase structure from treebank annotation, we might Syntax-Based Machine Translation. In Proceedings of
examine alignments which cross certain phrase Third Workshop on Statistical Machine Translation,
boundaries. p.44-52, Columbus, June 2008. Assocation for
Computational Linguistics.
7. Acknowledgements Gustafson-Capkova, S., Samuelsson, Y., and Volk, M.
(2007). SMULTRON (version 1.0) - The Stockholm
This work was supported in part by the Defense MULtilingual parallel TReebank. Department of
Advanced Research Projects Agency, GALE Program Linguistics, Stockholm University, Sweden.
Grant No. HR0011-06-1-0003. The content of this paper Hajic, J., Hajicova, E., Pajas, P. , Panevova, J., Sgall, P. and
does not necessarily reflect the position or the policy of Vidova Hladka, B. (2001). The Prague Dependency
the Government, and no official endorsement should be Treebank 1.0 CDROM. LDC Cat. No. LDC2001T10.
inferred. Kulick, S., Bies, A., and Maamouri, M. (2010). Consistent
and Flexible Integration of Morphological Annotation in
the Arabic Treebank. In Proceedings of the Seventh

3
http://kitt.cl.uzh.ch/kitt/treealigner
106/119
International Conference on Language Resources and
Evaluation (LREC 2010).
Ma, X. (2006). Champollion: A Robust Parallel Text
Sentence Aligner. In LREC 2006: Fifth International
Conference on Language Resources and Evaluation.
Maamouri, M., Bies, A., Buckwalter, T., and Jin, H.
(2005). Arabic Treebank: Part 1 v 3.0 (POS with full
vocalization + syntactic analysis). LDC Cat. No.:
LDC2005T02.
Maamouri, M., Bies, A., Kulick, S., Zaghouani, W., Graff,
D., and Ciul, M.. (2010). From Speech to Trees:
Applying Treebank Annotation to Arabic Broadcast
News. In Proceedings of the Seventh International
Conference on Language Resources and Evaluation
(LREC 2010).
Maamouri, M., Graff, D., Bouziri, B., Krouna, S., and
Kulick, S. (2009). LDC Standard Arabic
Morphological Analyzer (SAMA) v. 3.0. LDC Catalog
No.: LDC2009E44. Special GALE release to be
followed by a full LDC publication.
Megyesi, B., Dahlqvist, B., Pettersson, E., and Nivre, J.
(2008). Swedish-Turkish Parallel Treebank. In
Proceedings of Fifth International Conference on
Language Resources and Evaluation (LREC 2008).
Melamed, D.I. (2001). Empirical Methods for Exploiting
Parallel Texts. MIT Press.
Melamed, D.I. (1998a). Annotation Style Guide for the
Blinker Project. University of Pennsylvania (IRCS
Technical Report #98-06).
Melamed, D.I. (1998b). Manual Annotation of
Translational Equivalence: The Blinker Project.
University of Pennsylvania (IRCS Technical Report
#98-07).
Palmer, M., Chiou, F.-D., Xue, N., and Lee, T.-K. (2005)
LDC2005T01, Chinese Treebank 5.0.
Véronis, J. and Langlais, P. (2000). Evaluation of Parallel
Text Alignment Systems -- The ARCADE Project. In J.
Véronis (ed.) Parallel Text Processing, Text, Speech
and Language Technology. ordrecht, The Netherlands:
Kluwer Academic Publishers, pp. 369-388.

107/119
Using English as a Pivot Language to Enhance Danish-Arabic Statistical
Machine Translation
Mossab Al-Hunaity, Bente Maegaard, Dorte Hansen
Center for Language Technology
University of Copenhagen
musab@hum.ku.dk,bmaegaard@hum.ku.dk,dorteh@hum.ku.dk

Abstract
We inspect two pivot strategies for Danish-Arabic statistical machine translation (SMT) system; phrase translation pivot strategy
and sentence translation pivot strategy respectively. English is used as a pivot language. We develop two SMT systems, Danish-
English and English-Arabic. We use different English-Arabic and English-Danish data resources. Our final results show that
SMT systems developed under sentence based pivot strategy outperforms system developed under phrase based pivot strategy,
especially when common parallel corpora are not available.

processing tools to enhance our baseline


1. Introduction performance. Although a parallel corpus is not
Developing a statistical machine translation (SMT) available for the Danish-Arabic pair, there are lots of
system between any two languages usually requires a parallel English-Arabic and English-Danish resources
common parallel corpus. This is used in training the available. This makes English as a pivot language
SMT in translating the source language to the target between Arabic and Danish a favorable choice. Still
language. Bilingual corpora are usually available for any language can be used as a pivot language. Our
widely spread language pairs like Arabic English experiments use two separate corpora for Danish–
Chinese, etc, but when trying to develop SMT English and English-Arabic SMT systems. Having
systems for languages pair like Arabic-Danish a English as a pivot Language we apply two different
bilingual corpus unfortunately doesn’t exist. The pivot strategies:
limited data resources make developing SMT for
Arabic-Danish a real challenge. To the best of our - Phrase translation pivot strategy.
knowledge, there has not been much direct work on - Sentence translation pivot strategy.
SMT for the Danish-Arabic language pair. Google
These methods are based on techniques developed
Translate which is a free web translation service
by Utiyama, Isahara (2007), but we apply these
provides the option for translation from Danish to
techniques with a different perspective. We use non
Arabic. Google Translate web service uses gigantic
parallel corpora as a source of training data and not
monolingual texts collected by its crawling engine to
corpora with common text. We develop two
build massive language models. Aligned bilingual
baselines: Danish-English system that is piped with
language resources collected through web makes it
another English-Arabic system to translate from
easy for Google to build SMT between any language
Danish into Arabic. Each system has different
pairs. Google performs better between languages pair
training corpus from the other. Corpora yet share or
which has huge common data resources like the case
intercross partially in domain. Languages nature
in English and Arabic or English and Chinese. For
represents another challenge for our baseline. Our
pairs like Arabic and Danish Google translation
System languages are from completely different
quality is quite less than other pairs. A possible
families which affect experiment results greatly.
explanation for that is the lack of common parallel
Another interesting factor is the training data
available resources which control the SMT learning
resources. Many previous efforts on SMT systems
and performance. In our work we don’t consider
with pivot language were carried on parallel corpora
language resources factor alone, but also we
where data was aligned on sentence level; languages
concentrate on language specific details like syntax
either were from the same nature like European
and morphology to tune SMT learning for our
languages Koehn (2009), or they shared a parallel
Danish-Arabic baseline. We also utilize text
data for the source pivot and target. For example

108/119
Habash and Hu (2009) used English as a pivot phrase translation strategy consistently outperformed
language between Chinese and Arabic where the the sentence translation strategy in their controlled
three languages in their system were based on the experiments. Habash and Hu (2009) used English as
a pivot language while translating from Arabic to
same text. Our work differs in that we train our two
Chinese. Their results showed that pivot strategy
systems on two different unrelated sets of data. This outperforms direct translation systems. Babych et al.
is due to the fact of scares parallel data resources (2007) used Russian language as a pivot from
between Danish and Arabic. Many pivot strategies Ukrainian to English. Their comparison showed that
are suggested in previous studies like the case with it is possible to achieve better translation quality with
Bertoldi et. al (2008) ,Utiyama, Isahara (2007) and pivot approach. Kumar et al. (2007) improved
Habash and Hu (2009). We choose to apply our Arabic-English MT by using available parallel data in
other languages. Their approach was to combine
experiments on two strategies; namely phrase
word alignment systems from multiple bridge
translation and sentence translation, due to the languages by multiplying posterior probability
available data resources and to hold more control on matrices. This approach requires parallel data for
experiments conditions. We plan to inspect further several languages, like the United Nations or
techniques on Danish Arabic SMT system in future European Parliament corpus. An approach based on
work. Our results show that using English as a pivot phrase table multiplication is discussed in Wu and
Wang (2007) .Phrase table is formed for the training
language is possible with partially comparable
process. Scores of the new phrase table are computed
corpora and produces reasonable results. We discover by combining corresponding translation probabilities
that sentence translation strategy outperforms phrase in the source-pivot and pivot-target phrase-tables.
translation strategy, especially when none parallel or They also focused on phrase pivoting. They proposed
common resources are available. We compare our a framework with two phrase tables: one extracted
experiments results with Google Translate to judge from a small amount of direct parallel data; and the
system performance. Finally we discuss future other extracted from large amounts of indirect data
with a third pivoting language. Their results were
research directions we find interesting to enhance our
compared with many different European language as
baseline performance. In the next section we well as Chinese-Japanese translation using English as
describe related work. Section 3 presents our system a pivoting language. Their results show that simple
description. In section 4 we describe our data and pivoting does not improve over direct MT. Utiyama
present our pivot experiments details. We present our and Isahara (2007) inspected many phrase pivoting
system performance results in section 5. Finally we strategies using three European languages (Spanish,
discuss our conclusions and future work in section 6. French and German). Their results showed that
pivoting does not work as well as direct translation.
Bertoldi et. al (2008) compare between various
2. Related Work approaches of PBSMT models with pivot languages.
There has been a lot of work on translation from Their experiments were on Chinese-Spanish
Danish to English Koehn (2009), and from Arabic to translation via disjoint or overlapped English as pivot
English Sadat and Habash( 2006) , Al-Onaizan and language. We believe that we are the first to explore
Papineni, (2006).Many efforts were spent to the Danish-Arabic language pair directly in MT. We
overcome the lack of parallel corpora with pivot also apply pivoting techniques on none parallel text
methods. For example, Resnik and Smith (2003) corpora.
developed a technique for mining the web to collect 3. System Description
parallel corpora for low-density language pairs.
Munteanu and Marcu (2005) extract parallel In our work we develop two base lines for each
sentences from large Chinese, Arabic, and English experiment, Danish English and English Arabic.
non-parallel newspaper corpora. Statistical machine Translation direction is from Danish to Arabic.
translation with pivot approach was investigated by Moses 1 package is used for training the base lines.
many researchers. For example Gispert and Mario The system partition the source sentence into phrases.
(2006) used Spanish as a bridge for their Catalan- Each phrase is translated into a target language
English translation. They compared two coupling
strategies: cascading of two translation systems phrase. We use GIZA++ Och and Ney (2003) for
versus training of system from parallel texts whose word alignment.
target part has been automatically translated from
pivot to target. In their work they showed that the 1: Moses Package http://www.statmt.org/moses/

109/119
We use Pharaoh System suite to build the phrase We pass the translation with maximum feature score
table and decode (Koehn, 2004). Our language as input to the English-Arabic system.
models for both systems were built using the SRILM
toolkit Stolcke( 2002).We use a maximum phrase 4.2 Phrase Translation Experiment
length of 6 to account for the increase in length of the In the phrase translation strategy we need to construct
segmented Arabic. Our distortion limit set to 6. And a phrase table to train the phrase-based SMT system.
We need a Danish-English phrase table and an
finally we use BLEU metric Papineni et al. (2001) to
English-Arabic phrase-table. From these tables, we
measure performance. construct a Danish-Arabic phrase table. We use a
matching algorithm that identifies parallel sentences
4. Pivot Strategy pairs among the tables. This process is explained in
We use the phrase-based SMT system described in Munteanu and Marcu (2005). We identify candidate
the previous section to deploy our pivot methods. We sentence pairs using a word-overlap filter tool 1 .
inspect two pivot strategies phrase translation and Finally we use a classifier to decide if the sentences
sentence translation. In both strategies we use in each pair are a good translation for each other and
English as the pivot language. Danish and Arabic update our Danish-Arabic phrase table with the
represent source and target languages. In phrase selected pair.
translation strategy we directly construct a Danish- 4.3 Data
Arabic phrase translation table from a Danish- Data collection was a great challenge for this
English and an English-Arabic phrase-table. In experiment. Our data resources are from two groups;
sentence translation strategy we first translate a Arabic-English and English-Danish. Table 1 shows a
Danish sentence into n English sentences and brief description of our data resources. English-
translate these n sentences into Arabic separately. We Arabic corpora domain intercrosses with the English-
select the highest scoring sentence from the Arabic Danish corpora domain to some reasonable degree.
sentences.
Name Direction Domain Size
4.1 Sentence Translation Experiment (words)
The sentence translation strategy uses two Acquis Danish- Legal 7.0 M
independently trained SMT systems: a direct Danish- English issues /
English system and a direct English-Arabic system. News
We translate every Danish sentence d into n English UN Arabic- Legal 3.2 M
sentences e {e1, e2, ..., en} using a Danish-English multilingual English issues /
SMT system. Then we translate each e sentence into corpus News
Arabic sentences a {a1, a2,..,an}. We estimate Meedan Arabic- News 0.5 M
sentence pair feature according to formula 1 below. English
LDC2004T17 Arabic- News 0.5 M
8
S s, t = n=1 αsn βsn + αt n βt n .. 1 English
Table 1: Corpus resources
αsn βsn , αt n βt n is the feature functions for the
source and target (s, t) sentences respectively. Feature
functions represents: a trigram language model Sample Lines Words
probability of the target language, two phrase
Small 30 K 1M
translation probabilities (both directions), two lexical
Training

translation probabilities (both directions), a word Medium 70 K 2M


penalty, a phrase penalty, and a linear reordering Large 100 K 3M
penalty. Further details on these feature functions is
found in (Koehn, 2004; Koehn et al., 2005). We
Test Test 1K 19 K
choose to limit the number of the translation for any
(Parallel)
Danish sentence to English into three due to
Table2: Training and testing data sizes
performance issues.
For the Arabic English we selected three major
1: JTextPro http://sourceforge.net/projects/jtextpro/ resources, the United Nations (UN) multilingual
2: UN Corpus http://www.uncorpora.org/ corpus 2 which is available at the UN web site.

110/119
It enjoys a good quality of translation and it contains
about 3.2 M lines of data and about 7 M words. The
second resource was Meedan 1 corpus, which is a Training Data DA-EN EN-AR
newly developed Arabic English corpus mainly Size
compiled from the internet and news agencies, it Small 20.3 25.1
contains more than 0.5 M Arabic words. The third Medium 21.4 26.3
resource was provided by LDC 2 (catalog no. Large 23.1 27.1
LDC2004T17), it contains more than 0.5 M words, it Table3: BLEU Scores for Direct Sentence Based
also cover news domain. For the English Danish SMT systems.
category we selected the Acquis 3 Corpus, it contains
more than 8 K documents and more than 7 M words. Our direct system for DA-EN system BLEU score
Acquis contain many legal documents that cover was 23 which is (64%) of Google system BLEU
many domains. English Arabic resources were scores while for the EN-AR system BLEU score was
extracted and aligned using Okapi 4 translation 27.1 which is (40%) of Google system BLEU scores.
memory editor. With the Acquis corpus we used the
available tools that are available at the Acquis
website for extracting and aligning Danish English DA-EN EN-AR DA-AR DA-EN-AR
text. All data were tokenized and lowercased Test 36.0 67.0 30.0 30.0
separately. In order to inspect the size factor on our Sample
SMT system data were compiled into three sets: Table 4 describe the BLEU scores for Google
Large, Medium and Small. Table 2 illustrates the translate web service on our test sample
training data size for each set. For testing data we
collected a parallel Arabic-English-Danish text from In Table 5 we present the results of the sentence
the UN Climate Change conference 2009 which was pivoting system and the phrase pivoting system.
held in Copenhagen 5 . We extracted 1 K sentences Sentence based strategy outperform Phrase based
for each language. Table 2 illustrates the training strategy. For the large size training data set the
data size for each experiment. The English Arabic system achieved a score of 19.1 for the sentence
corpora domain intercrosses with the English Danish based system compared with 12.9 to the phrased
corpora domain to some reasonable degree. We are based strategy .This results differs from previous
aware that there might be some bias among data similar studies like Utiyama and Isahara (2007) and
resources coverage, but due to data availability our Habash and Hu (2009) where pivot strategy
corpora can still serve our experiments objectives. outperform sentence strategy. Pivot system was not
Given the expense involved in creating direct Arabic- better because of the quality and quantity of the DA-
Danish parallel text and given the large amounts of EN-AR phrase table entries which was received from
Arabic-English and English-Danish data, we think the matching algorithm. Pivot system is dependent on
our approach in collecting data for our experiment is the matching algorithm and enhancing it will enhance
still valid and interesting. system performance. Google DA-EN and DA-EN-
AR results were the same. This is a good indicator
that Google uses pivot approach between languages
5. Results and Evaluation with limited resources like the case of Arabic and
We measure our system performance using BLEU Danish. Figure 1 represents a sample of our best
scores Papineni et al. (2002). We compare our system performing system results, compared with Google
performance with Google Translate web service. translate web service. The sample shows both
Comparison with Google provides us with a general original text and its translation, and our system
performance indicator for our system. Table 3 translation results for the same text.
presents our direct translation system results for DA-
EN and EN-AR baselines. As expected BLEU scores
will increase when we increase the training data size.
We use the same testing data described in section 4.3 1:Meedan http://github.com/anastaw/Meedan-Memory
with Google Translate; results are described in 2: LDC http://www.ldc.upenn.edu/
Table 4. Google outperforms our direct system 3: Acquis http://langtech.jrc.it/JRC-Acquis.html
4: Okapi http://okapi.sourceforge.net/
results especially for the EN-AR direct translation 5: Cop15 http://en.cop15.dk

111/119
Size Sentence Based Phrase Based Pivot corpora are not available. We compared our system
Pivot Strategy Strategy results with Google translate web service to estimate
(Da- En- Ar) (Da- En - Ar) relative progress and results were promising. In the
Small 15.0 11.4 future we plan to enhance our pivoting techniques.
Medium 16.9 12.3 Phrase pivot strategy is still a promising technique
Large 19.1 12.9 we need to utilize with our baseline. Phrase Pivot
Table5: BLEU Scores for Phrase based and Sentence strategy performs better when more parallel data
Based SMT systems. resources are available, so we plan to collect more
parallel training data for our baseline. We also plan to
apply state of the art alignment technique and to use
6. Conclusion and Future work word reorder tools on our system training data. This
Developing a SMT system between two language will enhance our SMT system learning process. We
pairs that don’t share many linguistic resources Like also plan to train our SMT system to fit domain
Danish and Arabic language pairs is a quite specific areas like weather, or climate domains. We
challenging task. We presented a comparison target high quality pivot techniques that will help us
between two common pivot strategies; phrase outperform available commercial tools like Google
translation and sentence translation. Our initial Translate especially for domain specific SMT areas
results show that sentence pivot strategy outperforms
phrase strategy especially when common parallel

Jeg tror, at en af de store mangler ved Kyoto var, at den officielle delegation kom tilbage med en
DA
Reference

aftale, som de vidste aldrig ville blive vedtaget i senatet.


EN I think that a major shortcoming of Kyoto was that the official delegation came back with a treaty they
knew was never going to make it through the Senate
AR ‫َأعخقذ أن أحذ أَجً القصُر الرئٍسٍت فً كٍُحُ ٌُ أن الُفذ الرسمً عبد مع معبٌذة كبوُا على علم أوٍب له حمر خالل مجلس الشٍُخ‬
System ‫ كبن الُفذ الرسمً جبء ٌعُد مع أوٍم اعلم كبن له ٌخماعخمبدي‬Kyoto ً‫َأعخقذ أن أحذ مشبكل الرئٍسٍت ف‬

Google ‫اعخقذ ان احذ العٍُة الرئٍسٍت فً كٍُحُ ٌُ أن الُفذ الرسمً عبد الى احفبق مع أوٍم ٌعرفُن له ٌخم اعخمبدي فً مجلس الشٍُخ‬

DA Men selv om udledningen af drivhusgasser forventes at falde på grund af faldende aktivitet i


industrien, tror de Boer ikke, det vil mindske presset på landene om at handle og underskrive en ny
aftale.
EN But even though greenhouse gas emissions are expected to slow down as a result of shrinking
Reference

industrial activities ,de Boer does not believe it will lessen the pressure on countries to act and sign a
new treaty.
AR ‫ دي بٌُر ال ٌعخقذ أن‬، ‫َ على الرغم مه االوبعبثبث الغبزٌت لبٍج الذفٍئت مه المخُقع أن حىخفط وخٍجت الوخفبض األوشطت الصىبعٍت‬
‫رلك سُف ٌقلل مه الضغظ على الذَل للعمل َالخُقٍع على معبٌذة جذٌذة‬
System ‫ َسخحذد الضغُط على البالد لعمل على‬, ً‫اس الىشبط الخىبزل‬
‫حخى على الرغم مه اوبعبثبث غبزاث الحرارة مه المخُقع حىخفط على أس‬
. ‫االحفبقٍت جذٌذة‬
Figure 1: Selected samples of system translation result

112/119
References

A. de Gispert and J. B. Mario, “Catalan-english Kishore Papineni, Salim Roukos, Todd Ward, and
statistical machine translation without parallel WeiJing Zhu. 2002. BLEU: a method for
corpus: bridging through spanish,” in Proc. of 5th automatic evaluation of machine translation. In
International Conference on Language Resources ACL.
and Evaluation (LREC), Genoa, Italy, 2006.
Masao Utiyama and Hitoshi Isahara. 2007. A
Andreas Stolcke. 2002. SRILM - an extensible comparison of pivot methods for phrase-based
language modeling toolkit. In ICSLP. statistical machine translation. In Proceedings of
NAACL-HLT’07, Rochester, NY, USA
Bogdan Babych, Anthony Hartley, and Serge
Sharoff. 2007. Translating from underresourced Munteanu and Marcu (2005)Dragos Stefan
languages: comparing direct transfer against pivot Munteanu and Daniel Marcu. 2005. Improving
translation. In Proceedings of MT Summit XI, machine translation performance by exploiting
Copenhagen, Denmark. non-parallel corpora. Computational
Linguistics,31(4):477–504.
Callison-Burch et al. (2006) Chris Callison-Burch,
Philipp Koehn, Amittai Axelrod, Alexandra Birch Nicola Bertoldi, Madalina Barbaiani, Marcello
Mayne, Miles Osborne, and David Talbot. 2005. Federico, Roldano Cattoni, 2008 ,Phrase-Based
Edinburgh system description for the 2005 Statistical Machine Translation with Pivot
IWSLT speech translation evaluation. InIWSLT. Languages, Proceedings of IWLST , USA.

Callison-Burch et al (2006)Chris Callison-Burch, Nizar Habash and Jun Hu.2009. Improving Arabic-
Philipp Koehn, and Miles Osborne. 2006. Chinese Statistical Machine Translation using
Improved statistical machine translation using English as Pivot Language. Proceedings of the
paraphrases. In Proceedings of HLT-NAACL’06. Fourth Workshop on Statistical Machine
New York, NY, USA Translation , pages 173–181.

Franz Josef Och and Hermann Ney. 2003. A


systematic comparison of various statistical Philipp Koehn, Alexandra Birch and Ralf
alignment models. Computational Linguistics, Steinberger: 462 Machine Translation Systems for
29(1):19–51. Europe, in Proceedings of the 12th MT Summit,
(Ottawa, Canada, 26-30 August, 2009), p. 65-72.
Fatiha Sadat and Nizar Habash. 2006. Combination
of Arabic preprocessing schemes for statistical Philip Resnik and Noah A. Smith. 2003. The web as
machine translation. In Proceedings of Coling a parallel corpus. Computational
ACL’06. Sydney, Australia Linguistics,29(3):349–380.

Shankar Kumar, Franz Och, and Wolfgang


H. Wu and H. Wang, “Pivot language approach for Macherey.2007. Improving word alignment with
phrase-based statistical machine translation,” in bridge languages. In Proceedings of
Proc.of the 45th Annual Meeting of the EMNLPCoNLL’ 07, Prague, Czech Republic.
Association of Computational Linguistics. Prague,
Czech Republic,2007, pp. 856–863. Yaser Al-Onaizan and Kishore Papineni. 2006
Distortion models for statistical machine
Ibrahim Badr, Rabih Zbib, and James Glass 2008. translation.In Proceedings of Coling-ACL’06.
Syntactic Phrase Reordering for English-to- Sydney, Australia.
Arabic Statistical Machine Translation. In Proc. of
ACL/HLT.

Jakob Elming,2008, Syntactic Reordering Integrated


with Phrase-based SMT , ACL proceedings.

113/119
Using a Hybrid Word Alignment Approach for Automatic Construction and
Updating of Arabic to French Lexicons
Nasredine Semmar, Laib Meriama
CEA, LIST, Vision and Content Engineering Laboratory,
18 route du Panorama, Fontenay-aux-Roses, F-92265, France
nasredine.semmar@cea.fr, meriama.laib@cea.fr

Abstract
Translation lexicons are vital in machine translation and cross-language information retrieval. The high cost of lexicon development
and maintenance is a major entry barrier for adding new languages pairs. The integration of automatic building of bilingual lexicons
has the potential to improve not only cost-efficiency but also accuracy. Word alignment techniques are generally used to build bilingual
lexicons. We present in this paper a hybrid approach to align simple and complex words (compound words and idiomatic expressions)
from a parallel corpus. This approach combines linguistic and statistical methods in order to improve word alignment results. The
linguistic improvements taken into account refer to the use of an existing bilingual lexicon, named entities detection and the use of
grammatical tags and syntactic dependency relations between words. The word aligner has been evaluated on the MD corpus of the
ARCADE II project which is composed of the same subset of sentences in Arabic and French. Arabic sentences are aligned to their
French counterparts. Experimental results show that this approach achieves a significant improvement of the bilingual lexicon with
simple and complex words.

and morpho-syntactic analysis on source and target


1. Introduction sentences in order to obtain grammatical tags of
Translation lexicons are a vital component of several words and syntactic dependency relations (Debili &
Natural Language Processing applications such as Zribi, 1996; Bisson, 2001).
machine translation (MT) and cross-language information • A combination of the two previous approaches
retrieval (CLIR). The high cost of bilingual lexicon (Daille et al., 1994; Gaussier, 1995; Smadja et al.,
development and maintenance is a major entry barrier for 1996; Blank, 2000; Barbu, 2004; Ozdowska, 2004).
adding new languages pairs for these applications. The Gaussier (1995) approach is based on a statistical
integration of automatic building of bilingual lexicons model to establish the French and English word
improves not only cost-efficiency but also accuracy. Word associations. It uses the dependence properties
alignment approaches are generally used to construct between words and their translations. Ozdowska
bilingual lexicons (Melamed, 2001). (2004) approach consists in matching words regards
to the whole corpus, using the co-occurrence
In this paper, we present a hybrid approach to align simple frequencies in aligned sentences. These words are
and complex words (compound words and idiomatic used to create couples which are starting points for
expressions) from parallel text corpora. This approach the propagation of matching links by using
combines linguistic and statistical methods in order to dependency relations identified by syntactic analysis
improve word alignment results. in the source and target languages.

We present in section 2 the state of the art of aligning Machine translation systems based on IBM statistical
words from parallel text corpora. In section 3, the main models do not use any linguistic knowledge. They use
steps to prepare parallel corpora for word alignment are parallel corpora to extract translation models and they use
described; we will focus, in particular, on the linguistic target monolingual corpora to learn target language model.
processing of Arabic text. We present in section 4 single The translation model is built by using a word alignment
and multi-word alignment approaches. We discuss in tool applied on a sentence-to-sentence aligned corpus.
section 5 results obtained after aligning simple and This model can be represented as a matrix of probabilities
complex words of a part of the ARCADE II MD (Monde that relies target and source words. The Giza++ tool (Och,
Diplomatique) corpus. Section 6 concludes our study and 2003) implements this kind of approach but its
presents our future work. performance is proved only for aligning simple words.
Approaches and tools for complex words alignment are at
2. Related work experimental stage (DeNero & Klein, 2008).
There are mainly three approaches for word alignment
using parallel corpora:
3. Pre-processing the bilingual parallel
• Statistical approaches are generally based on IBM
corpus
models (Brown et al., 1993). A bilingual parallel corpus is an association of two texts in
• Linguistic approaches for simple words and two languages, which represent translations of each other.
compound words alignment use bilingual lexicons In order to use this corpus in word alignment, two

114/119
pre-processing tasks are involved on the two texts: sentences, one for each language. The sentence aligner
sentence alignment and linguistic analysis. uses a cross-language search to identify the link between
the sentence in the source language and the translated
3.1 Sentence alignment sentence in the target language (Figure 1).
Sentence alignment consists in mapping sentences of the
source language with their translations in the target
language. A number of sentence alignment approaches
have been proposed (Brown et al., 1991; Gale & Church, Arabic sentences to align
1991; Kay & Röscheisen, 1993).

Our approach to align the sentences of the bilingual


parallel corpus combines different information sources Cross-language interrogation in
(bilingual lexicon, sentence length and sentence position) French database
and is based on cross-language information retrieval
which consists in building a database of sentences of the
target text and considering each sentence of the source
text as a "query" to that database (Semmar & Fluhr, 2007). List of French
This approach uses a similarity value to evaluate whether sentences
the two sentences are translations of each other. This
similarity is computed by the comparator of the
cross-language search engine and consists in identifying
common words between source and target sentences. This Cross-language interrogation in
search engine is composed of a deep linguistic analysis, a Arabic database
statistical analysis to attribute a weight to each word of the
sentence, a comparator and a reformulator to translate the
words of the source sentence in the target language by
using a bilingual lexicon. List of Arabic
sentences
In order to refine the result of alignment, we used the
following three criteria:
• Number of common words between the source Check of alignment criteria
sentence and the target sentence (semantic similarity)
must be higher than 50% of number of words of the
target sentence.
• Position of the sentence to align must be in an French aligned sentences
interval of 10 compared to the position of the last
aligned sentence.
• Ratio of lengths of the target sentence and the source
Figure 1: Sentence alignment steps.
sentence (in characters) must be higher or equal than
1.1 (A French character needs 1.1 Arabic characters):
Longer sentences in Arabic tend to be translated into
3.2 Linguistic analysis
longer sentences in French, and shorter sentences The linguistic analysis produces a set of normalized
tend to be translated into shorter sentences. lemmas, a set of named entities and a set of compound
words with their grammatical tags. This analysis is built
using a traditional architecture involving separate
The alignment process has three steps: processing modules:
• Exact match 1-1 alignment: In this step, the similarity • A morphological analyzer which looks up each word
between the source sentence and the target sentence in a general full form dictionary. If these words are
is maximized by using the three criteria mentioned found, they are associated with their lemmas and all
above. their grammatical tags. For Arabic agglutinated
• 1-2 or 2-1 alignments: The goal of this step is to words which are not in the full form dictionary, a
attempt to merge the next unaligned sentence with the clitic stemmer was added to the morphological
previous one already aligned. To confirm 1-2 or 2-1 analyzer. The role of this stemmer is to split
alignments, we use only the first two criteria. agglutinated words into proclitics, simple forms and
• Fuzzy match 1-1 alignment: This step consists in enclitics.
aligning two sentences with a low level of similarity. • An idiomatic expressions recognizer which detects
This aligner does not use the three criteria. idiomatic expressions and considers them as single
words for the rest of the processing. Idiomatic
The parallel corpus is indexed into two databases. These expressions are phrases or compound nouns that are
two databases are composed of two sets of ordered listed in a specific dictionary. The detection of

115/119
idiomatic expressions is performed by applying a set with a post nominal adjective. These relations are
of rules that are triggered on specific words and restricted to the same nominal chain and are used to
tested on left and right contexts of the trigger. These compute compound words. For example, in the
rules can recognize contiguous expressions as "َْ ‫ا‬ nominal chain “12‫ ا‬3'#” (water transportation), the
َْ َ ‫( "ا‬the white house). syntactic analyzer considers this nominal chain as a
• A Part-Of-Speech (POS) tagger which searches valid compound word “1+_3'#” composed of the words
paths through all the possible tags paths using “3'#” (transportation) and “1+” (water).
attested trigrams and bigrams sequences. The trigram • A named entity recognizer which uses name triggers
and bigram sequences are generated from a manually to identify named entities. For example, the
annotated training corpus. They are extracted from a expression “‫َرس‬+ ِ0ْ"
َ ِ+ ‫ل‬9‫( ”ا َو‬The first of March)
hand-tagged corpora. If no continuous trigram full is recognized as a date and the expression “0”
path is found, the POS tagger tries to use bigrams at (Qatar) is recognized as a location.
the points where the trigrams were not found in the • A module to eliminate empty words which consists in
sequence. If no bigrams allow completing the path, identifying words that should not be used as search
the word is left undisambiguated. The following criteria and removing them. These empty words are
example shows the result of the linguistic analysis identified using only their Part-Of-Speech tags (such
after Part-Of-Speech tagging of the Arabic sentence as prepositions, articles, punctuations and some
“ ‫ا ادت  اء ا اع  ا‬ adverbs). For example, the preposition “‫( ”ل‬for) in
!"# $% & &%'!‫ ا*)اب ا‬+‫ن ز‬. /0+ 0 '0” (In the agglutinated word “3'” (for transportation) is
Italy, the order of things has persuaded in an invisible considered as an empty word.
manner a majority of voters that the time of • A module to normalize words by their lemmas. In the
traditional political parties was completed). Each case the word has several lemmas, only one of these
word is represented as a string (token) with its lemma lemmas is taken as normalization. Each normalized
and morpho-syntactic tag (token: lemma, word is associated with its morpho-syntactic tag. For
morpho-syntactic tag). example, normalization of the word “; #‫”أ‬
(pipelines) which is the plural of the word “‫ُْ=ب‬#ُ‫”أ‬
(1) : ِ
, Preposition (pipeline) is represented by the couple (‫ُْ=ب‬#ُ‫أ‬, Noun).
(2)  ‫ا‬: ِ َ ‫إ‬, Proper Noun
(3) ‫ادت‬: ‫َد‬
‫أ‬, Verb 4. Word alignment
(4) :  َ, Common Noun Our approach to align simple and complex words adapts
(5) ‫ال‬: ‫ال‬, Definite Article and enriches the methods developed by:
(6) ‫اء‬: ‫ء‬ َ
, Common Noun • (Debili & Zribi, 1996) (Bisson, 2001) which consist
(7) ‫ا‬: َ‫إ‬, Preposition to use, in one hand, a bilingual lexicon and the
(8) ‫ا ع‬: ‫َع‬ ْ‫إ‬, Common Noun linguistic properties of named entities and cognates
(9) ": # َِ", Common Noun to align simple words, and on the other hand,
(10) ‫ال‬: ‫ال‬, Definite Article syntactic dependency relations to align complex
(11) $%&: 'ِ %َ&, Common Noun words.
(12) : ِ, Preposition • (Giguet & Apidianaki, 2005) which consist to use
(13) ( ): َ ( )َ , Common Noun sequences of words repeated in the bilingual corpora
(14) *": )َ", Adverb and their occurrences to align compound words and
(15) +),: - ِ
+.),, Adjective idiomatic expressions.
(16) ‫ب‬: ‫ب‬, Preposition
(17) ‫أن‬: ّ‫َن‬
‫أ‬, Conjunction 4.1 Single-word alignment
(18) 2,‫ز‬: 2,َ ‫ز‬, Common Noun
Single-word alignment is composed of the following
(19) ‫ال‬: ‫ال‬, Definite Article
steps:
(20) ‫اب‬45‫ا‬: ‫ْب‬ 4ِ5, Common Noun
• Alignment using the existing bilingual lexicon.
(21) ‫ال‬: ‫ال‬, Definite Article
• Alignment using the detection of named entities.
(22)  67(8: - ‫ِي‬
6ِ7ْ
(َ8, Adjective
• Alignment using grammatical tags of words.
(23) 6 : 6َ, Preposition
• Alignment using Giza++.
(24) :7;: َ:َ7;, Verb
(25) < =:   >َ
&, Common Noun
4.1.1. Bilingual lexicon look-up
(26) ?: ?, Pronoun Alignment using the existing bilingual lexicon consists in
extracting for each word of the source sentence the
• A syntactic analyzer which is used to split graph of appropriate translation in the bilingual lexicon. The result
words into nominal and verbal chains and recognize of this step is a list of lemmas of source words for which
dependency relations by using a set of syntactic rules. one or more translations were found in the bilingual
We developed a set of dependency relations to link lexicon. The Arabic to French lexicon used in this step
nouns to other nouns, a noun with a proper noun, a contains 124 581 entries.
proper noun with a post nominal adjective and a noun

116/119
Table 1 shows results of this step for the Arabic sentence is taken directly from the parallel corpus. Table 2
“ '0 ‫ا ادت  اء ا اع  ا‬ illustrates results after running the four steps of
!"# $% & &%'!‫ ا*)اب ا‬+‫ن ز‬. /0+ 0” and its French single-word alignment.
translation “En Italie, l'ordre des choses a persuadé de
manière invisible une majorité d'électeurs que le temps Lemmas of words of the Translations returned by
des partis traditionnels était terminé”. source sentence single-word alignment
ََِِ ‫إ‬ Italie
Lemmas of words of the Translations found in the َِ َ ordre
source sentence bilingual lexicon
‫ ْء‬ َ chose
‫ ْء‬ َ chose
‫اع‬ persuasion
9ِ َِ majorité
9ِ َِ majorité
;َِ# électeur
;َِ# électeur
ِ0َ َ' manière
َ'ِ0 َ manière
َ+َ‫ز‬ temps
َ+َ‫ز‬ temps
‫*)ْب‬ ِ parti
‫*)ْب‬ ِ parti
ّ‫ِ&ِي‬%ْ'Aَ traditionnel
ّ‫ِ&ِي‬%ْ'Aَ traditionnel
Table 1: Single-word alignment with the existing
Table 2: Single-word alignment results.
bilingual lexicon.

4.1.2. Named entities detection 4.2 Multi-word alignment


For those words that are not found in the bilingual lexicon, The results obtained by the current tools for aligning
the single-word aligner searches named entities present in words from parallel corpora are limited either to the
the source and target sentences. For example, for the extraction of bilingual simple words from specialized
previous Arabic sentence and its French translation, the texts or to the extraction of bilingual noun phrases from
single-word aligner detects that the Arabic word “ََِِ ‫”إ‬ texts related to the general field. These limitations are due
(Italy) and the French word “Italie” are named entities of to the fact that the extraction of compound words is more
the type “Location”. However, this first step can produce difficult than the extraction of simple words. The
alignment errors in the case the source and target following examples illustrate some difficulties
sentences contain several named entities. To avoid these encountered when aligning compound words:
errors, we added a criterion related to the position of the • A compound word is not automatically translated
named entity in the sentence. with a compound word. For example, the Arabic
compound word “F ِG ‫م‬DE‫ ”إ‬is translated as a single
4.1.3. Grammatical tags matching word in French “informatique”.
If for a given word no translation is found in the bilingual • The translation of a compound word is not always
lexicon and no named entities are present in the source obtained by translating its components separately.
and target sentences, the single-word aligner tries to use For example, the French translation of the Arabic
grammatical tags of source and target words. This is compound word “ٍ&ِ&ْJ!َ ‫َْ ا‬HAَ ” is not “sous le
especially the case when the word to align is surrounded règlement” but “en cours de règlement”.
with some words already aligned. For example, because • A same compound word can have different forms due
the grammatical tags of the words “َِ َ ” and “ordre” are to the morphological, syntactic and semantic changes.
the same (Noun) and “َِ َ ” is surrounded with the words These changes must be taken into account in the
“َ َِِ ‫ ”اإ‬and “‫ ْء‬
َ ” which are already aligned in the two alignment process. For example, the Arabic
previous steps, the single-word aligner considers that the compound words “12‫=ارد ا‬+ ‫ ”إدارة‬and “ ‫=ارد‬2‫إدارة ا‬
lemma “ordre” is the translation of the lemma “َِ َ ”. /2‫ ”ا‬have the same French translation “gestion des
resources en eau”.
4.1.4. Giza++ alignment
For those words that are not found in the bilingual lexicon Our multi-word alignment approach is composed of the
and are not aligned by named entities detection or following steps:
grammatical tags matching, the single-word aligner uses • Alignment of compound words that are
results obtained with the Giza++ aligner from the translated literally from one to the other.
bilingual parallel corpus. For example, Giza++ founds • Alignment of idiomatic expressions and
that the French word “persuasion” is a translation of the compound words that are not translated word for
Arabic word “‫ ”اع‬despite the fact that this word does not word.
belong to the French sentence “En Italie, l'ordre des
choses a persuadé de manière invisible une majorité 4.2.1. Compound words alignment
d'électeurs que le temps des partis traditionnels était Compound words alignment consists in establishing
terminé”. In addition, this word has not vowels because it correspondences between the compound words of the

117/119
source sentence and the compound words of the target built or updated automatically by these methods.
sentences. First, a syntactic analysis is applied on the
source and target sentences in order to extract dependency We have established a score for each type of alignment to
relations between words and to recognize compound facilitate the cleaning process of the bilingual lexicon
words structures. Then, reformulation rules are applied on built or updated automatically from the parallel corpus:
these structures to establish correspondences between the • A link alignment between single words found in the
compound words of the source sentence and the bilingual corpus and validated in the bilingual
compound words of the target sentence. For example, the dictionary has a score equal to 1.
rule Translation(A.B) = Translation(A).Translation(B) • A link alignment between single words found by the
allows to align the Arabic compound word “ّ‫ِ&ِي‬%ْ'Aَ ‫*)ْب‬
ِ” detection of named entities (proper nouns and
with the French compound word “parti traditionnel” as numerical expressions) has a score equal to 0.99.
follows: • A link alignment between single words found by
matching grammatical tags has a score equal to 0.98.
Translation(ّ‫ِ&ِي‬%ْ'Aَ .‫*)ْب‬
ِ )= • A link alignment between single words produced by
Translation(‫*)ْب‬ ِ ).Translation(ّ‫ِ&ِي‬%ْ'Aَ ) = parti. traditionnel GIZA++ has a score equal to 0.97.
• A link alignment between compound words that are
In the same manner, this step aligns the compound word translated literally from one to the other has a score
“‫ ْء‬
َ _َِ
َ ” with the compound word “ordre_chose” equal to 0.96.
even if the word “ordre” is not proposed as a translation of • A link alignment between compound words that are
the word “ََِ ” in the bilingual lexicon. not translated word for word or idiomatic expressions
has a score equal to 0.95.
4.2.1. Idiomatic expressions alignment
In order to translate missed compound words and Table 3 presents results after running all the steps of word
idiomatic expressions, we used a statistical approach alignment process for simple and complex words.
which consists in:
• identifying the sequences of words which are Simple and complex words Translations returned by Score
candidate for the alignment: for the two texts of the of the source sentence word alignment
bilingual corpus, we compute the sequences of
ََِِ ‫إ‬ Italie 0.99
repeated words and their number of occurrences.
• representing these sequences with vectors: for each َِ َ ordre 0.98
sequence, we indicate numbers of segments in which ‫ ْء‬ َ chose 1
the sequence appears. ‫اع‬ persuasion 0.97
• aligning the sequences: for each sequence of the 9ِ َِ majorité 1
source text and each sequence of the target text, we ;َِ# électeur 1
estimate the value of the translation relation with the
َ'ِ0 َ manière 1
following formula:
َ+َ‫ز‬ temps 1
‫*)ْب‬ ِ parti 1
ّ‫ِ&ِي‬%ْ'Aَ traditionnel 1
;َِ#_9ِ َِ majorité_électeur 0.96
ِ&ِ%ْ'Aَ ّ‫*)ْب_ي‬ ِ parti_traditionnel 0.96
This step results in a list of single words, compound
words and idiomatic expressions of the source sentence ّ‫ِ&ِي‬%ْ'Aَ _‫*)ْب‬ِ _َ+َ‫ز‬ temps_parti_traditionnel 0.96
and their translations. For example, for the previous ‫ ْء‬ َ _َِ
َ ordre_chose 0.96
Arabic sentence and its French translation, the multi-word /0+ 0 '0 manière invisible 0.95
aligner founds that the expression “manière invisible” is a
translation of the Arabic expression “/0+ 0 '0”. Table 3: Single-word and multi-word alignment results.

4.3 Cleaning the bilingual lexicon 5. Experimental results


The various approaches described in this paper to align The word aligner has been tested on the MD corpus of the
simple and complex words use different tools for ARCADE II project which consists of news articles from
terminology extraction and dependency syntactic analysis. the French newspaper "Le Monde Diplomatique"
Each of these tools can be a source of noise because of (Veronis et al., 2008). The corpus contains 5 Arabic texts
errors that can be produced by the modules that compose (244 sentences) aligned at the sentence level to 5 French
them (POS tagging, lemmatization …). Therefore, these texts (283 sentences). The performance of the word
approaches inevitably produce incorrect matches between aligner is presented in Table 4.
the words of source text and the words of target text. It
thus becomes important to remove incorrect entries and
retain only the correct words in the bilingual lexicons

118/119
Precision Recall F-measure Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R. L.
0.85 0.80 0.82 (1993). The mathematics of statistical machine
translation : parameter estimation. Computational
Linguistics 19(3).
Table 4: Word alignment performance.
Daille, B., Gaussier, E., Lange, J. M. (1994). Towards
automatic extraction of monolingual and bilingual
Analysis of the alignment results of the previous sentence
terminology. In Proceedings of the 15th International
(Table 3) shows, in one hand, that 10 simple words
Conference on Computational Linguistics
(among 14), 4 compound words and 1 idiomatic
(COLING'94).
expression are correctly aligned, and on other hand, 7
Debili, F., Zribi, A. (1996). Les dépendances syntaxiques
simple words are aligned with the bilingual lexicon, 1
au service de l’appariement des mots. In Proceedings of
simple word is aligned with named entities detection, 1
the 10ème Congrès Reconnaissance des Formes et
simple word is aligned by using grammatical tag
Intelligence Artificielle.
matching and 1 simple word is aligned with Giza++.
DeNero, J., Klein, D. (2008). The Complexity of Phrase
Alignment Problems. In Proceedings of the of ACL
For the whole corpus, 53% of words are aligned with the
2008.
bilingual lexicon, 9% are aligned with named entities
Gale, W.A., Church, K. W. (1991). A program for aligning
detection, 15% are aligned by using grammatical tags and
sentences in bilingual corpora. In Proceedings of the
4% are aligned as compound words or idiomatic
29th Annual Meeting of Association for Computational
expressions. Consequently, 28% of the words of the
Linguistics.
source sentence and their translations are added to the
Gaussier, E., Lange, J.M. (1995). Modèles statistiques
bilingual lexicon.
pour l’extraction de lexiques bilingues. Traitement
Automatique de la Langue 36.
6. Conclusion Giguet, E., Apidianaki, M. (2005). Alignement d'unités
In this paper, we have presented a hybrid approach to textuelles de taille variable. In Proceedings of the
word alignment combining statistical and linguistic 4èmes Journées de la Linguistique de Corpus.
sources of information (bilingual lexicon, named entities Kay, M., Röscheisen, M. (1993). Text translation
detection, use of grammatical tags and syntactic alignment. Computational Linguistics, Special issue on
dependency relations, number of occurrences of word using large corpora, Volume 19, Issue 1.
sequences). The results we obtained showed that this Melamed, I.D. (2001). Empirical Methods for Exploiting
approach improves word alignment precision and recall, Parallel Texts. MIT Press.
and achieves a significant enrichment of the bilingual Och, F.J. (2003). GIZA++: Training of statistical
lexicon with simple and complex words. In future work, translation models. MIT Press
we plan to develop strategies and techniques, in one hand, http://www.fjoch.com/GIZA++.htm.
to filter word alignment results in order to clean the Ozdowska, S. (2004). Appariement bilingue de mots par
bilingual lexicons built or updated automatically, and on propagation syntaxique à partir de corpus
other hand, to improve the recall of the statistical français/anglais alignés. In Proceedings of the 11ème
approach by using the existing bilingual lexicon and the conférence TALN-RECITAL.
results of the morpho-syntactic analysis of the parallel Smadja, F., Mckeown, K., Hatzivassiloglou, V. (1996).
corpus. Translation Collocations for Bilingual Lexicons: A
Statistical Approach. Computational Linguistics 22(1).
7. Acknowledgements Semmar, N., Fluhr, C. (2007). Arabic to French Sentence
This research work is supported by WEBCROSSLING Alignment: Exploration of A Cross-language
(ANR - Programme Technologies Logicielles - 2007) and Information Retrieval Approach. In Proceedings of the
MEDAR (Support Action FP7 – ICT – 2007 - 1) projects. 2007 Workshop on Computational Approaches to
Semitic Languages: Common Issues and Resources.
8. References Veronis, J., Hamon, O., Ayache, C., Belmouhoub, R.,
Barbu, A.M. (2004). Simple linguistic methods for Kraif, O., Laurent, D., Nguyen, T. M. H., Semmar, N.,
improving a word alignment. In Proceedings of the 7th Stuck, F., Zaghouani, W. (2008). Arcade II Action de
International Conference on the Statistical Analysis of recherche concertée sur l'alignement de documents et
Textual. son évaluation. Chapitre 2, Editions Hermès.
Bisson, F. (2000). U Méthodes et outils pour
l’appariement de textes bilingues. Thèse de Doctorat en
Informatique. Université Paris VII.
Blank, I. (2000). Parallel Text Processing : Terminology
extraction from parallel technical texts. Dordrecht:
Kluwer.
Brown, P.F., Mercier, L. (1991). Aligning Sentences in
Parallel Corpora. In Proceedings of ACL 1991.

119/119

View publication stats

You might also like