COLABA Arabic Dialect Annotation and Processing
COLABA Arabic Dialect Annotation and Processing
net/publication/230668379
CITATIONS READS
57 975
5 authors, including:
All content following this page was uploaded by Yassine Benajiba on 27 May 2014.
13:50-15:10 Poster Session 2: Morphology & NLP Applications and NLP Tools
Techniques for Arabic Morphological Detokenization and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash
Center for Computational Learning Systems, Columbia University, USA
Tagging Amazigh with AncoraPipe
Mohamed Outahajala (1), Lahbib Zenkouar (2), Paolo Rosso (3) and Antònia Martí (4)
(1) IRCAM,
(2) Mohammadia School of Engineers, Med V University Rabat, Morocco,
(3) Natural Language Engineering Lab. - ELiRF, Universidad Politécnica Valencia, Spain,
(4) CLiC - Centre de Llenguatge i Computació, Universitat de Barcelona, Barcelona, Spain
i
Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical
Resource Grammar in GF
Dana Dannélls (1) and John J. Camilleri (2)
(1) Department of Swedish Language, University of Gothenburg, Sweden;
(2) Department of Intelligent Computer Systems, University of Malta, Malta
Syllable Based Transcription of English Words into Perso-Arabic Writing System
Jalal Maleki
Dept. of Computer and Information Science, Linkping University, Sweden
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Al Tantawy and Yassine Benajiba
Center for Computational Learning Systems, Columbia University, USA
A Linguistic Search Tool for Semitic Languages
Alon Itai
Knowledge Center for Processing Hebrew, Computer Science Department, Technion, Haifa,
Israel
ii
Editors & Workshop Chairs
Workshop general chair:
Khalid Choukri, ELRA/ELDA, Paris, France
Workshop co-chairs:
Owen Rambow, Columbia University, New York, USA
Bente Maegaard , University of Copenhagen, Denmark
Ibrahim A. Al-Kharashi, Computer and Electronics Research Institute, King Abdulaziz City for
Science and Technology, Saudi Arabia
iii
Table of Contents
Structures and Procedures in Arabic Language
André Jaccarini , Christian Gaubert , Claude Audebert 1
Developing and Evaluating an Arabic Statistical Parser
Ibrahim Zaghloul and Ahmed Rafea 7
A Dependency Grammar for Amharic
Michael Gasser 12
A syllable-based approach to Semitic verbal morphology
Lynne Cahill 19
Using the Yago ontology as a resource for the enrichment of Named Entities in Arabic WordNet
Lahsen Abouenour , Karim Bouzoubaa and Paolo Rosso 27
Light Morphology Processing for Amazighe Language
Fadoua Ataa Allah and Siham Boulaknadel 32
Using Mechanical Turk to Create a Corpus of Arabic Summaries
Mahmoud EL-Haj, Udo Kruschwitz and Chris Fox 36
DefArabicQA: Arabic Definition Question Answering System
Omar Trigui , Lamia Hadrich Belguith and Paolo Rosso 40
Techniques for Arabic Morphological Detokenization and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash 45
Tagging Amazigh with AncoraPipe
Mohamed Outahajala , Lahbib Zenkouar , Paolo Rosso and Antònia Martí 52
Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical Resource Grammar in
G
Dana Dannélls and John J. Camilleri 57
Syllable Based Transcription of English Words into Perso-Arabic Writing System
Jalal Maleki 62
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Al Tantawy and Yassine Benajiba 66
A Linguistic Search Tool for Semitic Languages
Alon Itai 75
Algerian Arabic Speech database Project (ALGASD): Description and Research Applications
Ghania Droua-Hamdani , Sid Ahmed Selouani and Malika Boudraa 79
Integrating Annotated Spoken Maltese Data into Corpora of Written Maltese
Alexandra Vella, Flavia Chetcuti , Sarah Grech and Michael Spagnol 83
A Web Application for Dialectal Arabic Text Annotation
Yassine Benajiba and Mona Diab 91
Towards a Psycholinguistic Database for Modern Standard Arabic
Sami Boudelaa and William David Marslen-Wilson 99
Creating Arabic-English Parallel Word-Aligned Treebank Corpora
Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma and Stephanie Strassel 102
Using English as a Pivot Language to Enhance Danish-Arabic Statistical Machine Translation
Mossab Al-Hunaity, Bente Maegaard and Dorte Hansen 108
Using a Hybrid Word Alignment Approach for Automatic Construction and Updating of Arabic to French
i
Nasredine Semmar 114
iv
Author Index
Lahsen Abouenour 27
Mohamed Al Tantawy 66
Mossab Al-Hunaity 108
Fadoua Ataa Allah 32
Claude Audebert 1
Yassine Benajiba 66,91
Ann Bies 102
Sami Boudelaa 99
Malika Boudraa 79
Siham Boulaknadel 32
Karim Bouzoubaa 27
Lynne Cahill 19
John J. Camilleri 57
Flavia Chetcuti 83
Dana Dannélls 57
Mona Diab 66,91
Ghania Droua-Hamdani 79
Ahmed El Kholy 45
Mahmoud EL-Haj 36
Chris Fox 36
Michael Gasser 12
Christian Gaubert 1
Sarah Grech 83
Stephen Grimes 102
Nizar Habash 45,66
Lamia Hadrich Belguith 40
Dorte Hansen 108
Alon Itai 75
André Jaccarini 1
Udo Kruschwitz 36
Seth Kulick 102
Xuansong Li 102
Xiaoyi Ma 102
Bente Maegaard 108
Jalal Maleki 62
William David Marslen-Wilson 99
Antònia Martí 52
Mohamed Outahajala 52
Ahmed Rafea 7
Owen Rambow 66
Paolo Rosso 27,40,52
Sid Ahmed Selouani 79
Nasredine Semmar 114
Michael Spagnol 83
Stephanie Strassel 102
Omar Trigui 40
Alexandra Vella 83
Ibrahim Zaghloul 7
Lahbib Zenkouar 52
v
Structures and Procedures in Arabic Language
André Jaccarini, Christian Gaubert *, Claude Audebert,
Maison méditerranéenne des sciences de l’homme (MMSH)
5 rue du Château de l'Horloge BP 647 13094 Aix-en-Provence, France
*
Institut français d’archéologie orientale du Caire (IFAO),
37 el Cheikh Aly Yousef Str., Cairo, Egypt
E-mail: jaccarini@mmsh.univ-aix.fr, cgaubert@ifao.egnet.net, claude.audebert@gmail.com
Abstract
In order to demonstrate the efficiency of the feedback method for the construction of finite machines (automata and transductors)
applied to the Arabic language, and to exhibit the algebraic characterization of this language through mathematical theory of
Schutzenberger school, we have chosen applications which are linked to many domains: morphological analysis (with or without
lexicon), syntactic analysis, construction of operators for I.R. and noise filtering.
A data bank of finite machines can only be efficient if integrated in a computational environment allowing the extraction of these
operators (which are fragments, pieces of operational grammars) which are to be combined in order to synthesise new operators,
according to the needs.
We have developed a software called Sarfiyya for the manipulation of arabic automata.
We constructed an extractor of quotations and reported discourse. The evaluation of this automaton will be available online. Sarfiyya
was entirely written in Java, which allowed the creation of a Web based application called Kawâkib, offering among other functions,
root extraction and tool word detection.
We are now heading towards content analysis and text characterization.
that this passage to the limit does not present a
1. General presentation1 significant difference with regard to the coefficients of
One of the important ideas of our work that Arabic, ambiguities of the considered forms compared to the
Semitic languages in general, has a particularly high parsers which resort systematically to lexicon (see for
degree of surfacing “algorithmicity/grammaticalness”. instance DIINAR project of the university of Lyon). The
We should for this however clarify the relation context, i.e. syntax is indeed much more interesting on
“procedure/structure” (see below §2). The this level, has led us to study in a formal way the
structural characteristics of Arabic language are thus passages of information between syntax and
also algorithmic. This duality is easily translatable within morphology.
the framework of the algebraic theory of the The minimal recourse to the lexicon even the reduction
of all the lexemes to their simple patterns (principle of
automata and offers extremely interesting applicative
the empty dictionary) is compensated by the headlight
prospects (the correspondence is done in the two
role which we confer on the tokens (tools words) and
directions; see below 2.), easily specifiable. that we regard as true operators defined by minimal
A certain deficit of mathematical specification is, indeed, finite machines (automata and transducers).
one of the characteristics of the actual position of the These elements, which are the most constraining,
automatic treatment of Arabic. The theoretical precisely coincide with the fixed elements (the
unification, operated thanks to the algebraic theory of the morphological atoms, which do not have roots: for
automata, seems to us to be particularly interesting to example inna, limādha, … etc). This coincidence has a
firmly draw up the Arab Studies in the universe of simple explanation: the cardinality of their class of
knowledge which is ours today, namely that of syntactic congruence2 is very limited (often equals with
the Turing machines. We thus seek to register the the unit) contrary to those of the other lexemes, which
can “commutate” with other elements belonging to the
automatic treatment of Arabic in the algebraic tradition
same “category” (or class of syntactic congruence)
which, in data processing, was especially initiated by
without doubting the “grammaticality” of the sentence,
M.P Schutzenberger and its school. nor its type (as for the relationship between syntactic
In order to support our assumption that the strong congruence and the congruence induced by the patterns,
algorithmicity/grammaticaness is an important refer to “Algorithmic Approach of Arab grammar”, first
specificity of Arabic we have stated in our preceding chapter “Morphological System and syntactic monoïd”;
studies (Audebert, Jaccarini 1994, Jaccarini 1997, to be published; summary of the chapter is available on
Gaubert 2001) that the construction of parsers, only the theoretical website associated with the article, from
requiring a minimal recourse to the lexicon, and even in now automatesarabes). An algebraic characterization of
certain cases, completely avoiding the lexicon, did not the tokens is given there: they are the “lexical” invariants
cause explosions of ambiguities. It is noted indeed of the projection of the language on its skeleton (in other
11 2
The authors thank Eva Saenz Diez for having read this A syntactic class of congruence consists of all the
text. Without her active participation, this work would words that can permute in a sentence without questioning
have been completed. its grammaticality.
1/119
words invariants of projection L L/RAC). The term an operator of extraction of speeches, built from finished
“token” was selected in reference so what states machines (automata and transducers). This
data-processing originators of language indicate by this machine that we made deterministic by a calculation,
term: association of symbols which it is necessary to requiring a time of several hours but carried out only
regard as fixed (example: BEGIN, GOTO, LOOP,…etc), once and whose result is then stored in memory once for
which naturally induce particular “waitings”. Thus this
all (the final automaton contains nearly 10.000
approach was initially presented like a “Grammar of
waitings” (“grammaire des attentes”; see Audebert, states!) allows to extract in a reasonable period time
Jaccarini, 1986); this name seems to be perfectly (which will be improved later on) all the quotations from
appropriate if one is situated at the level of didactic and a text of almost 30.000 words with a rate of success of
of cognition and we thus use it in all the contexts where almost 80% if one makes however abstraction of silences
it is not likely to produce any ambiguity. and noises especially due to anomalies of punctuation
The “tokens” can be considered, to some extent, like the and a lack of standardization to which it will be more or
“lexemes” of the quotient language. By choosing this less possible in the future to mitigate. Rate of success
term our first concern was simply to draw the attention to and times show that the operation of feasibility was a
the analogy with formal languages. The object L/RAC is success (see herewith table reproduced on
a semi-formal language obtained, by projection, starting automatesarabes). An evaluation is thus provided, which
from a natural language. It presents the particularity to
proves that our formal considerations, even algebraic, are
have a very limited lexicon. It is indeed this fact that
seems to us the most characteristic of the Semitic system indeed necessary to make coherent the theoretical
(strong grammaticaliness/algorithmicity) whose Arab framework, without which one is likely to be involved in
language is the current richest representative, best known a pragmatism which ends up becoming sterile. At the
and especially most spoken. present time, we are very much conscious of the
The assumption of the construction of a syntactic disadvantages of ad hoc programming.
monitor, which the theoretical goal is to study and
modeling of the interactions between syntax and 2. Advantages of the representation of
morphology and the practical finality –the Arabic by finite machines
short-circuiting of the superfluous operations in the
location of the structure of the sentence-, remains the 2.1 The transfer from structures to procedures
long-term objective which will lead the continuation of (Arabic language <--> automata)
this research. This transfer is carried out in the two directions. It is
Grammars not being static, but being regarded as a possible indeed, on the theoretical level to establish a
particular point of view on the language, they can appear
formal link between the structural
drifting by transformation of a non-fixed core
and algebraic properties by which we characterized the
itself. These points of view, i.e. these grammars, can be
connected to each other. The question of their adequacy Arab system, on one hand, and automata and transducers
compared to a given objective arises then: the grammar on which we based our algorithmic approach of Arabic,
of an orthographical controller is not the same as a on the other hand. These automata and transducers
program of learning Arabic or of an information constitute a class of machines equivalent to those of
extractor. Turing, on which we nevertheless introduced restrictions
The morpho-syntactic analysis of Arabic which we and conditions of minimality related to a rigorous
propose (see bibliography) constitutes a reflection and a hierarchy.
general tool to answer a whole set of applications. The Arabic structural properties that we proposed relate
The development of this approach passes by a thorough to the commutation of the operators of “categorization”
study of the tokens or words tools. This study results in and “projection”. “Categorization” amounts building a
the constitution of morphosyntaxic grammars, conceived quotient unit made up of the “classes of syntactic
like operational bricks of grammars in order to congruence” on the free monoïd of the symbols of the
synthesize procedures of research. initial vocabulary (the formal words in the case of
Such grammars built from automata and finite syntax), or “syntactic categories”. The relation of
implied congruence formalizes the distributional
transducers can also be used to detect various types of
principle of linguists (Bloomfield): it expresses that two
sentences, for example conditional sentences, relations of
words can be regarded as equivalents if and only if they
causality or other discursive relations, to extract the can commutate without affecting the grammaticalness of
reported speeches, etc. These topics are fundamental for the sentence. Projection amounts reducing the word to its
the extraction of information: coupled with the search for pattern, which also induces a partition in class of
collocations (non-fortuitous co-occurrences), in congruence: The concatenations (reciprocally the
interaction with the morpho-syntactic analyzer, they are segmentations) remain invariant even by changing the
the source of many applications. We presented to roots. This last congruence is compatible with the
MEDAR 09 (Audebert, Gaubert, Jaccarini, 2009) preceding one in the sense that the syntactic relation of
examples of constructions of such operational grammars congruence is more “coarse” (the least fine) that can be
that is at the level of the morphological, syntactic defined on the monoïd made up from Arabic graphemes
and who saturates the unit consisted by the unit or Arabic
analysis or even at the level of extraction of information
graphic words, which has as a corollary that any patter,
(I.R). As simple illustrative example, we have introduced
considered here as a given class of words and not as an
2/119
operator generating them (duality structure-procedure), is Automata Theory”, see automatesarabes). This
necessarily located inside a syntactic category within the illustration offers a theoretical interest (to reduce a
framework of the unit of the Arab graphic words. We possibly infinite set of sentences to a finished number of
registered the modeling of the Arab morph-syntax within configurations) as well as a practical one (the
the general framework of a principle of invariance “automatic” construction of the minimal deterministic
deriving from the previous Arabic morphographic automat corresponding).
property, which is obvious (invariance by change of root The automaton corresponding to the study of David
within the framework of the morphography) by Cohen (Cohen, 1970) will be rebuilt by using this same
generalizing it with syntax, namely that: method (which leads to the constitution of an automat of
1. To establish syntactic classes which are partitioned in 13 states and 102 transitions) while following an
patterns “entirely automated chain” if we may say so, or rather
2. or, on the contrary, to establish patterns which can “automatisable”.
then be gathered into syntactic classes.
It is the same. Any sequence of a language can indeed be regarded as
an application of an initial segment of N in itself and to
Let us suppose that Π and SC respectively indicate say that a language is recognizable by a finite-state
canonical homomorphisms associated with congruences automaton it is in fact equivalent to define a congruence
considered higher: syntactic congruence and the one on this language whose set of classes is finite
associated with the patterns (categorization and
projection), then the principle of invariance will be able The theorems which explicitly establish the links
to be expressed in an even more concise way: that of the between the concepts of syntactic monoïd, congruence
commutation of these two last “operators”: and the traditional concept of automaton, such as we use
them for our analysis of Arabic, also appears in
Π.SC = SC.Π automatesarabes.
Following this principle, we will thus categorize so that In conclusion the syntactic monoïd, with which is
the construction of the grammar is not affected by the associated a minimal deterministic automaton being able
operation which consists in reducing the language only to recognize this language can be produced thanks to a
to its paradigms (patterns + tokens). transducer3. This monoïd of transition (= syntactic) can
The possibility of building a computer program be obtained automatically.
functioning without lexicon is only one consequence of
the above-mentioned property according to which
it should be indifferent to first categorize and then
project or vice versa. 2.2 Automatic vocalisation and transduction
This second point deserves to be insulated, given its
In addition the definition of the automata, as those of the
importance. The standard writing of Arabic is
machines of Turing, can appear somewhat contingent -
shorthand. The short vowels are not noted, which has as
but it is quite astonishing that such a rough mechanism
can represent all calculations that can be made on a natural consequence to increase considerably the
machine and even that it can (theorem of the universal ambiguity and the difficulties of reading. Moreover cases
enumeration) simulate any computer or set of computers are often marked by short vowels, if they are singular,
(including Internet!). The automata with two stacks of and their “calculation” are not always extremely easy4.
memories (we don’t need a large number of them, which
3
does represent a remarkable property) are equivalent to We had programmed it in Lisp; the question of its
these machines. These automata are founded on those of restoration is posed today in terms of opportunity, utility,
less complexity, without stacks of memory: the calendar, working time, etc, etc. For the moment this
finite-state machines whose definition can cause the task is not a priority. It is also possible to enhance this
same feeling of “uneasiness” mentioned above –while transducer (minimal) in order to determine the basic
talking about the machines of Turing- and at the same relations, which associated with the generators, define
time amazement due to the fact that such an elementary the monoïd of transition (isomorphous with the syntactic
mechanism can generate such complex configurations. monoïd). It can indeed be interesting to have the
possibility of defining an infinite language (determined
The adaptation of a more abstract or algebraic viewpoint, nominal group or not determined, Conditionals, etc.) by
allows us at the same time a small set of limited equalities relating to this language
1. to avoid this uneasiness of contingency and of limited length, rather than by rewriting rules. For
2. to give a meaning to the extension of the principle of example in the example evoked in the site (a small subset
invariance from the linguistic level to the of the given nominal group), in order to check that an
data-processing level, to thus unify the theoretical unspecified sequence belongs to this language, it is
framework while offering extremely interesting practical enough to examine its sub-sequences of length 3.
4
prospects. Indeed the calculation of the monoïd of The great grammarian Sībawayh quotes an irreducible
transition M(L) from the language L means building the example of ambiguity, very much like a joke, which has
minimal deterministic automat directly accepting this also the merit to draw the attention to the fact that in
language. One will find in the automatesarabes website, literary Arabic the place of the words in the sentence is
the development on an example of syntax with this type relatively free; this freedom being compensated by a
of calculation (taken from “Linguistic Modeling and more important morpho-casual “marking”. The example
3/119
Our matter is not to discuss the relevance of this written lexeme can only acquire their final value (vocalization or
form but to note the phenomenon while trying to meaning, …) by the retroactive effect, once the complete
measure its consequences in term of ambiguities and to reading of the sentence was accomplished. Knuth studied
provide objective arguments for the supporters of the two the cases of vicious circles and developed at the item
camps: the one constituted by the “Arabists” whose (1968) an algorithm to avoid them. In the case of
systems of transliteration, which they use in general, do impossibility, you then find yourself in the well-known
not leave any right to the error (short vowels having the case of the data processing specialist, the “deadlock”,
same value as the full consonants or the long vowels which occurs when two processes are on standby, one of
(there are only three)) and the usual system used by the other. It is an intrinsic ambiguity6.
Arabs that leaves a certain latitude 5 which seems to In “Algorithmic Approach of the Arabic Grammar” (see
suggest - a fact corroborated by the outside experiments automatesarabes), we have presented
(but which remains to be more deeply examined) - that an ambiguous morphological transducer, functioning
the reading (without diacritics) in the first flow makes it word by word (vowels dependant on the case
possible to perceive the meaning of the sentence (linguistic) are not being taken into account, since the
overall; a finer syntactic analysis, implying backtracking connection with syntax was not
allows in the second time to raise ambiguity. implemented 7 ). Coefficients of ambiguity are varying
Nevertheless these assumptions must be evaluated. The from 1 (in a significant number of cases8) up to 12.
system of transduction based on underlying automata, to It is obvious that a connection with syntax is necessary
which we can make correspond “semantic attributes” of not only to cause a drop in the level of ambiguity but
grammars of Knuth grammars (see automatesarabes), or also to be able to vocalize the end of the words.
“schemes guided by syntax” (Aho, Seti and Ullman, Such tools that are to be reprogrammed can already have
1986), which are associated with synthesized” and extremely interesting applications. The writer of an
“inherited” attributes, is particularly well adapted for this Arabic text can be informed in real-time of the level of
task (linear dominating flow “blocked”, nevertheless, by ambiguity of the form introduced to see himself
“backtracking” which can present in the most dramatic suggested a certain number of solutions (total or partial)
to reduce ambiguity according to the level of
cases, vicious circles (deadlock) i.e impossibilities of
user (tutorial), only by clicking. Fundamental technology
vocalization (irreducible ambiguities which are cases to already exists; all the rest is only question of ergonomics
be studied for itself)). The synthesized attributes are and interface, which in this field is fundamental.
values that are propagated from bottom to top of the tree It goes without saying that it would be an improvable
representing the structure of the sentence (it is said that tool and evolvable tool by introduction of syntax but also
one decorates the tree or even that one associates to by training.
himself a “semantic” tree) and the inherited attributes, The conceptual tool (the interactive transducer of
those which are propagated from top, vocalisation) would obviously be of greater interest to
downwards. Transposed to the reading flow, that means, answer the question that had been asked at the beginning
there exist values (here one is interested in the short of this paragraph namely to try to measure or rather to
vowels) which “are synthesized” progressively according scientifically discuss the relevance of the two viewpoints:
respectively the Arabists one and Arabs conception, to
to the advancement of the reading head, whereas certain
say it in a concise way.
is the following: Akala (ate) IssA (Jesus) MoussA It would have been difficult to scientifically discuss this
(Moses); one cannot know if it is Jesus who ate Moses or question of relevance if one had not had recourse to the
the reverse, being given the phonological incompatibility transducers functioning letter with letter and interacting
of the mark of the direct case (short vowel u) with the with the highest level “director” automats: the syntactic
last phoneme of IssA or MoussA. The ambiguity automats.
naturally remains the same one by permutation of Issa
and of MoussA (the mark of indirect being short a). 2.3 Transparency of the analyzers
5
This report of the use of standard writing by the Arabs The transparency of the analyzers which can be entirely
since centuries, as well as their organization of their specified mathematically, offers essential advantages that
dictionaries, makes us naturally think that they perceived we will only mention here: those to offer evidence of
(and continue to perceive) consonants as being elements programs as well as measurements of complexity and,
of a “skeleton” which would be the principal “support” last but not least, the possibility of establishing relevant
of the meaning (more specifically the radical consonants, similarities with the natural process of apprehension
the others being able to belong to a pattern, inevitably
6
discontinuous, if we take into account the short vowels This question also arises about the “syntactic monitor”
which inevitably intervene there, which can never be which is supposed to optimize the morphological
radical; the pattern in its entirety, which is a analysis, where we must consider the extreme case
non-concatenative form, being only (with the root) likely where both morphological and syntactic processors are
to have one or more, semantic values). In this remark we waiting for each other (irreducible ambiguity).
7
are within the framework of morphology known as Some results will be available on the site
healthy. We announce only facts and we voluntarily keep automatesarabes before the publication of the book.
8
away from the problem of lexical “solidification” However no statistics were drawn up; it was about a
(fixation). feasibility study.
4/119
(cognitivo-algoritmic parallelism). We will find in automatesarabes a detailed evaluation of
the extractor of quotations in journalistic texts (which is
extremely encouraging). This experiment constitutes a
3. Coming back to tokens: from syntax to starter of the pump of feedback announced in MEDAR
semantics 09.
Study of the syntactic operators and grammar of
syntactic waitings. 4. References
Automatesarabes: http://automatesarabes.net.
1. The pivot of this research is the study, made easier by
an advanced version of the tool Sarfiyya, of the Aho, Sethi, Ullmann (1986), Compilateurs. Principes,
grammatical waitings of all the tokens (word tools of techniques et outils. French edition 1991. InterEdition.
Arabic), whose location is already solved. For example, Audebert C, Jaccarini A. (1986) À la recherche du
the operator inna implies the presence of structures Ḫabar, outils en vue de l’établissement d’un
having precise grammatical functions (topic, predicate) programme d’enseignement assisté par ordinateur,
that are recognizable by the machine. On the other hand, Annales islamologiques 22, Institut français
prepositions (ʿalā, ʿinda) are of more reduced range but d’archéologie orientale du Caire.
can possibly combine with high level tokens: a hierarchy Audebert C, Jaccarini A. (1988). De la reconnaissance
is established between families of operators. It is des mots outils et des tokens. Annales islamologiques
necessary to formalize the syntactic behaviors and 24, Institut français d’archéologie orientale du Caire.
their local and total implications. Audebert C, Jaccarini A. (1994). Méthode de variation
This research was started on a corpus and remains to be de la grammaire et algorithme morphologique.
undertaken. It is essential for the study of the syntax of Bulletin d’études orientales XLVI. Damascus.
Arabic and, although outlined it has to be reset once Audebert, Gaubert, Jaccarini (2009). Minimal
again. The number of the tokens amounts to Ressources for Arabic Parsing/ an Interactive Method
approximately 300 and poses problems of location dealt for the Construction of Evolutive Automata. MEDAR
according to a certain methodology and raises, by 09.(http://www.elda.org/medar-conference/summaries
definition, questions concerned with syntax whose /37.html)
modeling must taken into account. Audebert (2010). Quelques rélexions sur la fréquence et
2. This study will be coupled with that of the linguistic la distribution des mots outils ou tokens dans les textes
markers of certain discursive relations. This work arabes en vue de leur caractérisation dans le cadre de
consists in creating a base of the most possible l’extraction d’information. Annales islamologiques 43,
elementary automats (or transducers), so that their Institut français d’archéologie orientale du Caire.
combinations can allow the synthesis of new Beesley, Kenneth R. (1996). Arabic Finite-State
functionalities of search for information (IH). A first Morphological Analysis And Generation. COLING.
demonstration of the effectiveness of this method was Cohen, D. (1970) Essai d’une analyse automatique de
provided (MEDAR 09). The progressive refinement of l’arabe. In: David Cohen. Etudes de linguistique
the filters and the reduction of the noises were obtained, sémitique et arabe. Paris:Mouton, pp. 49-78.
according to a precise experimental method, consisting Gaubert Chr., (2001). Stratégies et règles pour un
in retroacting to the initial grammar according to the traitement automatique minimal de l’arabe. Thèse de
result provided by the machine. This method doctorat. Département d’arabe, Université d’Aix-en
of feedback (continual coming and going between Provence.
theoretical modeling and implementation) naturally Gaubert (2010), Kawâkib, une application web pour le
supposes a work of evaluation of grammars. traitement automatique de textes arabes, Annales
However, there exists several manner of assigning a islamologiques 43, Institut français d’archéologie
value to a grammar, according to the standard selected, orientale du Caire.
which varies according to the required application. The Jaccarini A., (1997). Grammaires modulaires de l’arabe.
standard allows assigning to the grammar Thèse de doctorat. Université de Paris-Sorbonne.
a value starting from fixed criteria. A criterion can be Jaccarini (2010). De l’intérêt de représenter la grammaire
essential for a given application but not very relevant for de l’arabe sous la forme d’une structure de machines
another (for example the non-ambiguous extraction of finies, Annales Islamologiques 43, Institut français
the root represents only little interest if the objective is to d’archéologie orientale du Caire.
obtain a simple spellchecking). The data of the standard Koskenniemi K. (1983). Two-level Morphology. A
makes it possible to privilege, according to its needs, General Computational Model for Word-Form
certain criteria among others and thus induces a Recognition and Production. Department of General
hierarchy.
Linguistics. University of Helsinki.
Inheriting its code from Sarfiyya with some
enhancements for collaborative work, the web-based
application Kawâkib and its latest version Kawâkib Pro
(fig. 1) are the tools we use for now to collect linguistic
data connected with tool words, to parse pieces of corpus
with automata and to perform measures in this regard. It
also includes tools for root searches, frequencies reports,
etc.
5/119
fig. 1 : The Kawâkib Pro web-based application
6/119
Developing and Evaluating an Arabic Statistical Parser
Abstract
This paper describes the development of an Arabic statistical parser using Arabic Treebank and a statistical parsing engine. The
different steps followed to develop and test the parser have been described. We divided the LDC2005T20 Arabic Treebank into
training and testing sets. 90 % of the treebank was used to train the Bikel parser package while 10% of it was randomly selected to test
the developed parser. The testing data set annotations were removed to convert it into pure text to be introduced to the trained parser.
The gold testing data set was prepared, by mapping its tags, to the tags produced by the trained parser. This mapping was necessary to
evaluate the parser results using a standard evaluation tool. The metrics widely applied for parsers evaluation were computed for the
developed parser results. The F-measure evaluation metric of the developed parser was 83.66 % which is comparable to evaluation
metrics results of well known English parsers.
7/119
2. Related work 2.5 Stanford Parser
A lot of work has been done in the statistical parsing area. The Stanford Parser is an un-lexicalized (does not use
Most of the work concentrated on parsing English as the lexical information) parser which rivals state-of-the-art
main language and paying no or little attention to other lexicalized ones (Klein and Manning, 2003). It uses a
languages. The following subsections summarize statistical context-free grammar with state splits. The parsing
parsers developed for English. algorithm is simpler, the grammar is smaller. It uses a CKY
chart parser which exhaustively generates all possible
2.1 Apple Pie Parser parses for a sentence before it selects the highest
Apple Pie (Sekine and Grishman, 1995) extracts a probability tree. This parser gave 84.41% Labeled
grammar from Penn Treebank (PTB) v.2. The rules Precision, 87% Labeled Recall, and 95.05% Tagging
extracted from the PTB have S or NP on the left-hand side accuracy, when tested on section 23 of the WSJ Treebank
and a flat structure on the right-hand side. The parser is a (Hempelmann et.al, 2005).
chart parser. The parser model is simple, but it can’t handle
sentences over 40 words. This parser gave 43.71% Labeled 3. Arabic Statistical Parser Development
Precision, 44.29% Labeled Recall, and 90.26% Tagging This section describes the steps for generating the Arabic
accuracy, when tested on section 23 of the Wall Street probabilistic grammar from an Arabic tree bank. The first
journal (WSJ) Treebank. subsection describes the used Treebank while the second
subsection shows how we divide this Treebank into
2.2 Charniak’s Parser training and testing parts. The third subsection describe
Charniak presents a parser based on probabilities gathered the generation of the probabilistic grammar.
from the WSJ part of the PTB (Charniak, 1997). It extracts
the grammar and probabilities and with a standard 3.1 Arabic Treebank
context-free chart-parsing mechanism generates a set of The Arabic Treebank we used is LDC2005T20. The
possible parses for each sentence retaining the one with the Treebank contains 12653 parsed Arabic sentences
highest probability. The probabilities of an entire tree are distributed among 600 text files representing 600 stories
computed bottom-up. from the An Nahar News Agency. This corpus is also
referred to as ANNAHAR. The sentences lengths
In (Charniak, 2000), he proposed a generative model based distributions in the Treebank are shown in Table (1).
on a Markov-grammar (Charniak, 2000). It uses a standard
bottom-up, best-first probabilistic parser to first generate Length Number of sentences
possible parses before ranking them with a probabilistic From 1 To 20 4046
model. This parser gave 84.35% Labeled Precision, 88.28% From 21 To 30 2541
Labeled Recall, and 92.58% Tagging accuracy, when tested From 31 To 40 2121
on section 23 of the WSJ Treebank. From 41 To 50 1481
From 51 To 60 942
2.3 Collins’s Parser From 61 To 100 1257
From 100 To max 265
Collins’s statistical parser (Collins, 1996) (Collins, 1997) is Table (1): Sentences lengths (in words) distributions in
based on the probabilities between head-words in parse the Arabic Tree
trees. Collins defines a mapping from parse trees to sets of
dependencies, on which he defines his statistical model. A 3.2 Division of Treebank
set of rules defines a head-child for each node in the tree.
The gold standard testing set size was selected to be 10% of
The parser is a CYK- style dynamic programming chart
the Treebank size which is approximately 1200 sentences
parser. This parser gave 84.97% Labeled Precision, 87.3%
and the remaining sentences were left for training. The
Labeled Recall, and 93.24% Tagging accuracy, when tested
complete description of the selection of the gold standard
on section 23 of the WSJ Treebank.
set is as follows:
We first grouped all the Treebank files in one file
2.4 Bikel Parser
containing all sentences,
Bikel based his parser on Collins model 2 (Collins, 1999) Then, we used a methodology to avoid being biased in
with some additional improvements and features in the the test sentences selection. The methodology was to
parsing engine like: layers of abstraction and encapsulation select a sentence from every 10 sentences; that is we
for quickly extending the engine to different languages span the Treebank and pick a sentence after counting 9
and/or Treebank annotation styles, “plug-’n’-play” sentences. This means that the sample we selected is
probability structures, flexible constrained parsing facility, distributed over all the Treebank.
and multithreaded for use in a multiprocessor and/or The selected sentences are put in a separate gold file
multihost environment. and all unselected sentences are put in a separate
training file. After completing this step we will have
two files: the gold data set file and the training data set
file.
8/119
3.3 Parser Training 8) No Crossing: Percentage of sentences which have zero
The training data set which is approximately 11400 crossing brackets.
sentences is introduced to Bikel parsing package to 9)2 or less crossing: Percentage of sentences which have
generate the Arabic probabilistic grammar. This grammar two or less crossing brackets.
is used by the parser included in the parsing package to 10) Tagging Accuracy: Percentage of correct POS tags.
generate the parsing tree for an input sentence.
4.2 Extracting the test data
4. Evaluation Methodology A tool was developed to extract the test sentences from the
The Arabic statistical parser will be evaluated following gold standard set. This tool takes the gold data file and
these steps: extracts the words only. So the output is a file containing
1. Select the evaluation tool. the sentences words without any annotations, which will
2. Extract the test data set (remove annotations to be then be given to the parser. Each sentence is processed
pure text) from the gold data. separately by reading the tokens and extracting the word
3. Prepare the test data for parsing. from each token, ignoring any additional annotations or
4. Run parser on the extracted test data. characters.
5. Pre-process the gold data set to meet the
requirements of the evaluation tool. 4.3 Preparing the Test Data for Parsing
The test sentences have to be put in a suitable form for
The following subsections describe in some details each of parsing. The Bikel parser accepts the input sentence in one
the above mentioned steps. of two formats:
1. (word1 word2 word3 ……. wordn).
4.1 Evaluation Tool and Output Description 2. (word1(pos1) word2(pos2) word3(pos3)…
The evaluation tool "Evalb"2, was used in evaluating the wordn(posn)).
parser output. It was written by Michael John Collins We put all the test file sentences in the format that allows
(University of Pennsylvania) to report the values of the the parser to do its own part of speech tagging, which is the
evaluation metrics for a given parsed data. first format.
The description of the outputs of Evalb is as follows: 4.4 Running the Parser
1) Number of sentence: The total number of sentences in The parser has been run over the 1200 test sentences using
the test set. the training outputs and the parameters file for Arabic
2) Number of valid sentences: Number of sentences that parser.
are successfully parsed. The parameter "pruneFactor", described below, was set to
value 2 instead of the default value 4 in order to increase
3) Bracketing Recall: the parsing speed. This change in parameter value was
Number of Correct Constituents made because the default value didn't work well for Arabic
giving infinite time for long sentences.
Number of Constituents in the Gold File The total parsing time for the test set was about 25 minutes
on a machine of processor 3GHz and 8 GB RAM.
4) Bracketing Precision:
Number of Correct Constituents pruneFactor: Is a property in the parameter file by which
the parser should prune away chart entries which have low
Number of Constituents in the Parsed File probability. The smaller the pruneFactor value, the faster
the parsing.
5) Bracketing FMeasure: The harmonic mean of
Precision and Recall. FMeasure = 4.5 Processing the Gold Data
2 × (Precision × Recall) The gold standard set is processed to be in the evaluation
used by the evaluation tool. The reason for this processing
(Precision + Recall) is that the Arabic Treebank annotation style was found to
be different from the parser annotation style. In the
6) Complete Match: Percentage of sentences where recall Treebank we had, the part of speech tags used are the
and precision are both 100%. morphological Arabic tags. But in the Bikel parser output
the tags are from the original Penn Treebank tag set.
7) Average Crossing:
Number of constituents crossing a gold file The following example shows the sentence:
constituent " fy AlsyAsp , AlAHtmAlAt kvyrp w AlHqA}q mEqdp."
Number of sentences (In politics, there are many possibilities and the facts are
complex.)
2
As represented in the LDC Treebank:
http://nlp.cs.nyu.edu/evalb/
9/119
We here show the change in the metrics values up or down
(S (S (PP (PREP fy) (NP
(DET+NOUN+NSUFF_FEM_SG+CASE_DEF_GEN AlsyAsp)))
for different sentences lengths. The results for 100, 60, 40
(PUNC ,) (NP-SBJ and 10 words length sentences are shown in table (3).
(DET+NOUN+NSUFF_FEM_PL+CASE_DEF_NOM
AlAHtmAlAt)) (ADJP-PRD Metric <=100 <=60 <=40 <=10
(ADJ+NSUFF_FEM_SG+CASE_INDEF_NOM kvyrp))) (CONJ w) Number of sentence 1180 1089 888 197
(S (NP-SBJ (DET+NOUN+CASE_DEF_NOM AlHqA}q)) Bracketing Recall 83.24 83.49 83.80 81.29
(ADJP-PRD (ADJ+NSUFF_FEM_SG+CASE_INDEF_NOM
Bracketing Precision 85.07 85.15 85.44 77.31
mEqdp))) (PUNC .))
Bracketing FMeasure 84.14 84.31 84.61 79.25
When this sentence is parsed using Bikel parser, the Complete match 19.24 20.75 25.00 45.69
following annotated sentence is produced: Average crossing 2.63 2.20 1.63 0.20
(S (S (PP (IN fy) (NP (NN AlsyAsp))) (, ,) (NP (NNS No crossing 45.42 48.58 55.29 87.82
AlAHtmAlAt)) (ADJP (JJ kvyrp))) (CC w) (S (NP (NN
Tagging accuracy 99.10 99.02 98.81 97.25
AlHqA}q)) (ADJP (JJ mEqdp))) (PUNC .))
Table (3): evalb outputs for the different lengths
In the Bikel parser training phase, the LDC tags are
sentences.
converted into the Bikel tags using the "training- 5.1 Analysis of the Results
metadata.lisp" file. Unfortunately this conversion is part of
The best accuracy of the parser appears with sentences in
the grammar generation code in Bikel package.
the "less than forty" category, as it has the highest
Consequently we have to develop a separate program that
F-measure value.
converts LDC tags into Bikel tags in order to test the parser.
Some metrics values drop at the "less than ten" category
The output of this process is the gold file that enables
like Recall, Precession, F-measure and tagging accuracy.
evaluating the output of Bikel parser running on the test
But the Complete match and No crossing metrics go up for
data against this gold file.
this category.
5. Results and Analysis These values went down as sentences less than ten are
We applied the evaluation tool on the whole test set with no more sensitive to any error, i.e. the accuracy for a sentence
length restriction to test the overall quality, and then we with length 5 words will be 80% accuracy with one wrong
made the evaluation again to see the change in the metrics bracket or tag, although accuracy will be 87.5% for a
values up or down for different sentences lengths. We sentence with 40 words and 5 wrong brackets or tags.
examined the results for the parser outputs trying to On the other hand, the chance to have a complete match
analyze the reasons for the drop in the accuracy for some increases for shorter sentences because it has smaller
metrics for different sentences lengths. number of brackets.
10/119
7. References
Brian Roark and Richard Sproat. 2006. Computational
Approaches to Morphology and Syntax, Oxford
University Press.
Charniak, Eugene. 2000. A maximum entropy–inspired
parser. In Proceedings of the 1st NAACL, pages 132–
139, Seattle, Washington, April 29 to May 4.
Charniak, Eugene. 1996. Tree-bank grammars, Technical
report, Department of Computer Science, Brown
University.
Charniak, Eugene. 1997. Statistical parsing with a con
text-free grammar and word statistics, In Proceedings
of the 14th National Conference on Artificial
Intelligence, pages 598–603, Providence, RI.
Charniak, Eugene. 1996. Tree-bank grammars, Technical
Report CS-96-02, Department of Computer Science,
Brown University.
Charniak, Eugene. 1997. Statistical Techniques for
natural language parsing, AI Magazine 18 4 (1997),
33-43.
D. Klein and C. Manning. 2003. Accurate unlexicalized
parsing. In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistic, Sapporo,
Japan, Pages 423-430.
Daniel M. Bikel. 2002. Design of a multi-lingual,
parallel-processing statistical parsing engine, In
Proceedings of HLT2002, San Diego, CA.
Daniel M. Bikel. 2004. On the parameter space of
generative lexicalized statistical parsing models, Ph.D.
thesis, University of Pennsylvania.
Hempelmann, Christian F. and Rus, Vasile and Graesser,
Arthur C. and McNamara, Danielle S. 2005. Evaluating
State-of-the-Art Treebank-style Parsers for Coh-Metrix
and Other Learning Technology Environments, In
Proceedings of the Second ACL Workshop on Building
Educational Applications Using NLP 2005, Ann Arbor,
Michigan, Pages 69-76.
Michael John Collins. 1999. Head-Driven Statistical
Models for Natural Language Parsing, Ph.D. thesis,
University of Pennsylvania.
Michael John Collins, 1997, Three generative lexicalized
models for statistical parsing, In Proceedings of the
35th Annual Meeting of the ACL. 1997, 16-23.
Peter Venable. 2003. Modeling Syntax for Parsing and
Translation, Ph.D. thesis, Carnegie Mellon University.
S.Sekine and R. Grishman. 1995. A corpus-based
probabilistic grammar with only two non-terminals, In
proceedings of the International Workshop on Parsing
Technologies.
11/119
A Dependency Grammar for Amharic
Michael Gasser
School of Informatics and Computing
Indiana University, Bloomington, Indiana USA
gasser@cs.indiana.edu
Abstract
There has been little work on computational grammars for Amharic or other Ethio-Semitic languages and their use for parsing and
generation. This paper introduces a grammar for a fragment of Amharic within the Extensible Dependency Grammar (XDG) framework
of Debusmann. A language such as Amharic presents special challenges for the design of a dependency grammar because of the complex
morphology and agreement constraints. The paper describes how a morphological analyzer for the language can be integrated into the
grammar, introduces empty nodes as a solution to the problem of null subjects and objects, and extends the agreement principle of XDG
in several ways to handle verb agreement with objects as well as subjects and the constraints governing relative clause verbs. It is shown
that XDG’s multiple dimensions lend themselves to a new approach to relative clauses in the language. The introduced extensions to
XDG are also applicable to other Ethio-Semitic languages.
hakimun
3. Relevant Amharic Morphosyntax doctor-DEF-ACC
3.1. Verb morphology ‘the doctor (as object of a verb)’ (3)
As in other Semitic languages, Amharic verbs are very
complex (see Leslau (1995) for an overview), consisting lehakimu
of a stem and up to four prefixes and four suffixes. The to-doctor-DEF
stem in turn is composed of a root, representing the purely ‘to the doctor’ (4)
lexical component of the verb, and a template, consist-
ing of slots for the root segments and for the vowels (and However, when a noun is modified by one or more adjec-
sometimes consonants) that are inserted around and be- tives or relative clauses, it is the first modifier that takes
tween these segments. The template represents tense, as-
2
pect, mood, and one of a small set of derivational cate- Amharic is written using the Ge’ez script. While there is
gories: passive-reflexive, transitive, causative, iterative, re- no single agreed-on standard for romanizing the language, the
ciprocal, and causative reciprocal. For the purposes of this SERA transcription system, which represents Ge’ez graphemes
using ASCII characters (Firdyiwek and Yaqob, 1997), is common
paper, we will consider the combination of root and deriva-
in computational work on Amharic and is used in this paper. This
tional category to constitute the verb lexeme.
transcription system represents the orthography directly, failing to
Each lexeme can appear in four different tense-aspect- indicate phonological features that the orthography does not en-
mood (TAM) categories, conventionally referred to code, in particular, consonant gemination and the presence of the
as perfect(ive), imperfect(ive), jussive/imperative, and epenthetic vowel that breaks up consonant clusters.
3
In the interest of simplification, indirect objects will be
1
We use YAML syntax (http://www.yaml.org/) for mostly ignored in this paper. Most of what will be said about
lexical entries. direct objects also applies to indirect objects.
13/119
these affixes (Kramer, 2009). If a noun takes a determiner, • The head noun of a noun phrase with an adjective or
the noun phrase needs no other indication of definiteness, relative clause modifier is optional.
but it is the determiner that takes the accusative suffix or
prepositional prefix. tlqun ’merTalehu
big-DEF-ACC I-choose
senefu hakim
lazy-DEF doctor ‘I choose the big one.’ (13)
‘the lazy doctor’ (5)
yemiwedat alderesem
lesenefu hakim REL-he-likes-her he-didn’t-arrive
to-lazy-DEF doctor ‘(He) who likes her didn’t arrive. (14)
‘to the lazy doctor’ (6)
Headless relative clauses are found in many lan-
yann senef hakim
guages, for example, in the English translation of sen-
that-ACC lazy doctor
tence (14). What makes Amharic somewhat unusual is
‘that lazy doctor (as object of a verb)’ (7) that headless relative clauses and adjectives function-
3.3. Relative clauses ing as noun phrases can be formed by simply dropping
Relative clauses in Amharic consist of a relative verb and the noun.
zero or more arguments and modifiers of the verb, as in any • Relative verbs agree with the main clause verbs that
clause. A relative verb is a verb in either the imperfective or contain them. For example, in example (14) above,
perfective TAM with a prefix indicating relativization. As the third person singular masculine subject in the main
with a main clause verb, a relative verb must agree with its clause verb agrees with the third person singular mas-
subject and may agree with its direct object if it has one. culine subject of the relative clause verb.
Both subjects and objects can be relativized.
Therefore we interpret relative clause modifiers as syntac-
yemiwedat sEt
tic heads of Amharic nouns. Because XDG offers the pos-
REL-he-likes-her woman
sibility of one or more dimensions for semantics as well as
‘the woman that he likes’ (8) syntax, it is straightforward to make the noun the semantic
yemiwedat wend head, much as auxiliary verbs function as syntactic heads
REL-he-likes-her man while the main verbs they accompany function as semantic
‘the man who likes her’ (9) heads in Debusmann’s XDG grammar of English. This is
discussed further below.
As noted above, when a noun is modified by a relative
clause and has no preceding determiner, it is the rela- 4. XDG for Amharic
tive clause that takes suffixes indicating definiteness or ac- In its current incomplete version, our Amharic grammar has
cusative case or prepositional prefixes. a single layer for syntax and a single layer for semantics.
yetemereqew lj wendmE new The Syntax dimension handles word order, agreement, and
REL-he-graduated-DEF boy my-brother is syntactic valency.5 The Semantics dimensions handles se-
mantic valency.
‘The boy who graduated is my brother.’ (10)
Because the grammar still does not cover some relatively
yetemereqewn lj alawqm common structures such as cleft sentences and complement
REL-he-graduated-DEF-ACC boy I-don’t-know clauses, the parser has not yet been evaluated on corpus
‘I don’t know the boy who graduated.’ (11) data.
When a sequence of modifiers precedes a noun, it is the first 4.1. Incorporating morphology
one that takes the suffixes or prefixes.4 For a language like Amharic, it is impractical to list all
yetemereqew gWebez lj wordforms in the lexicon; a verb lexeme can appear in
REL-he-graduated-DEF clever boy more than 100,000 wordforms. Instead we treat the lex-
eme/lemma as the basic unit; for nouns this is their stem.6
‘the clever boy who graduated’ (12)
5
Because the first modifier of a noun determines the syntac- Amharic word order is considerably simpler than that of a
tic role of the noun phrase in the clause as well as its def- language such as English or German, and there are none of the
initeness, we will treat this modifier, rather than the noun, problems of long-distance dependences in questions and relative
clauses that we find in those languages. The only non-projective
as the syntactic head of the noun phrase. There are at least
structures are those in cleft sentences and sentences with right dis-
two other reasons for doing this. location, neither of which is handled in the current version of our
grammar. In a later version, we will separate a projective lin-
4
With two adjectives, both may optionally take the affixes ear precedence layer from a non-projective immediate dominance
(Kramer, 2009). We consider this to fall within the realm of coor- layer, as Debusmann does for English and German (2007).
6
dination, which is not handled in the current version of the gram- Unlike in most other Semitic languages, most Amharic nouns
mar described in this paper. do not lend themselves to an analysis as template+root.
14/119
For verbs, as noted above, this is the root plus any deriva- both. Figure 4.3. shows the analysis returned by our parser
tional morphemes. for the following sentence.7
In parsing a sentence, we first run a morpholog-
ical parser over each of the input words. We yoHans ywedatal
use the HornMorpho Amharic parser available at Yohannis he-likes-her
http://www.cs.indiana.edu/˜gasser/ ‘Yohannis likes her.’ (15)
Research/software.html and described in Gasser
(2009). Given an Amharic word, this parser returns the root
(for verbs only), the lemma, and a grammatical analysis Syntax
in the form of a feature structure description (Carpenter, root
1992; Copestake, 2002) for each possible analysis. For
sbj obj
example, for the verb ywedatal ‘he likes her’, it returns the
following (excluding features that are not relevant for this
discussion): ዮሐንስ ይወዳታል
.
yoHans ywedatal
’wedede’, {’tam’: ’impf’,
’rel’: False,
’sb’: [-p1,-p2,-plr,-fem], arg1 arg2
’ob’: [-p1,-p2,-plr,+fem]} root
Figure 4 shows the parser’s analysis of sentence (17). Figure 5: Syntactic analysis of a sentence with a relative
clause.
4.4. Relative clauses
As argued above, relative verbs are best treated as the heads
of their noun phrases. When a relative verb has a head We model the semantics of a sentence with a relative clause
noun, the verb’s subject, object, or indirect object feature as a directed acyclic graph in which the shared noun has
must agree with that noun, depending on the role it plays in multiple verb heads. The relative clause predicate is dis-
the verb’s argument structure. In our grammar, we join the tinguished from the main clause predicate by a rel rather
relative verb to its head noun in the Syntax dimension by than a root arc into it from the sentence root. Figure 6
an arc with a label specifying this role, that is, sbj, obj, shows the analysis of sentence (18) on the Semantics di-
or iobj. Since verbs are already constrained to agree with
mension.
their arguments, the agreement between the relative verb
16/119
different verbs, the main clause verb and the relative verb.
አስቴር የምትጠላው ወንድልጅ ታመመ
astEr wendlj tameme
. The Cross-Agreement Principle forces the same feature of
yemtTelaw
the relative verb to agree with the main clause verb and the
modified noun. By this principle our parser finds no analy-
arg1 root
arg1 arg2 sis for sentence (20) because the feature of the relative verb
rel yemtTelaw that agrees with the modified noun (its object)
Semantics differs from the feature that agrees with the main verb (its
subject). This is illustrated in Figure 8. The grammar fails
to parse this sentence between the features marked with red
Figure 6: Semantic analysis of a sentence with a relative boxes do not agree.
clause.
Syntax
sbj: sbj =sbj
sbj:sbj=^ root
obj: obj =^
Relative clauses without nouns have no overt form corre- አስቴር የምትጠላው ወንድልጅ ታመመች
.
astEr yemtTelaw wendlj tamemec
sponding to the shared semantic argument, so we introduce
this argument as an empty node. Sentence (19) is sentence
(18) with the noun wendlj ‘boy’ dropped. The analyis of
Figure 8: Violation of the Cross-Agreement Principle. The
this sentence is shown in Figure 7.
features in red boxes should match.
astEr yemtTelaw tameme
Aster REL-she-hates-him he-got-sick
‘The one that Aster hates got sick.’ (19)
5. Conclusions
This paper has described an implementation of Extensible
Dependency Grammar for the Semitic language Amharic.
Syntax
root Amharic is interesting because it suffers from a serious lack
sbj:sbj=obj
of computational resources and because its extreme mor-
sbj:sbj=^ obj phological complexity and elaborate interactions of mor-
phology with syntax present challenges for computational
አስቴር የምትጠላው ታመመ grammatical theories. Besides the strongly lexical charac-
.
astEr yemtTelaw tameme ter that it shares with other dependency grammar frame-
works, XDG is attractive because of the modularity offered
arg1 by separate dimensions. We have seen how this modularity
arg1 arg2
permits us to handle the agreement constraints on a relative
rel
root verb by treating such verbs as the heads of noun phrases
Semantics on the Syntax, but not the Semantics dimension. We have
also seen that XDG requires some augmentation to deal
with null subjects and objects and the intricacies of verb
Figure 7: Analysis of a relative clause with no modified agreement. These complexities of Amharic are not unique.
noun. Much of what has been said in this paper also applies to
other Ethio-Semitic languages such as Tigrinya. In addi-
tion to expanding the coverage of Amharic, further work
Without further constraints, however, the grammar assigns on this project will be directed at developing synchronous
multiple analyses to some sentences and parses some un- XDG grammars to support translation between the different
grammatical sentences with relative clauses. Consider the Semitic languages spoken in Ethiopia and Eritrea.
following ungrammatical sentence.
*astEr yemtTelaw wendlj tamemec 6. References
Aster REL-she-hates-him boy she-got-sick Bob Carpenter. 1992. The Logic of Typed Feature Struc-
‘The boy that Aster hates (she) got sick.’ (20) tures. Cambridge University Press, Cambridge.
Ann Copestake. 2002. Implementing Typed Feature Struc-
This satisfies the constraint that subject of the main verb ture Grammars. CSLI Publications, Stanford, CA, USA.
tamemec agree with some feature of the relative verb (its Ralph Debusmann, Denys Duchier, and Geert-Jan M. Krui-
subject) and the constraint that the some feature of the rel- jff. 2004. Extensible dependency grammar: A new
ative verb (its object) agree with the modified noun wendlj. methodology. In Proceedings of the COLING 2004
To exclude sentences like this, we need a further XDG prin- Workshop on Recent Advances in Dependency Grammar,
ciple, which we call the Cross-Agreement Principle. This Geneva/SUI.
specifies a fundamental fact about relative clauses in all lan-
guages, that the same noun functions as an argument of two
17/119
Ralph Debusmann. 2007. Extensible Dependency Gram-
mar: A Modular Grammar Formalism Based On Multi-
graph Description. Ph.D. thesis, Universität des Saarlan-
des.
Kais Dukes, Eric Atwell, and Abdul-Baquee M. Sharaf.
2010. Syntactic annotation guidelines for the Quranic
Arabic treebank. In Proceedings of the Seventh Interna-
tional Conference on Language Resources and Evalua-
tion, Valletta, Malta.
Yitna Firdyiwek and Daniel Yaqob. 1997. The sys-
tem for Ethiopic representation in ASCII. URL: cite-
seer.ist.psu.edu/56365.html.
Michael Gasser. 2009. Semitic morphological analysis
and generation using finite state transducers with feature
structures. In Proceedings of the 12th Conference of the
European Chapter of the ACL, pages 309–317, Athens,
Greece.
Ruth Kramer. 2009. Definite Markers, Phi Features,
and Agreement: a Morphosyntactic Investigation of the
Amharic DP. Ph.D. thesis, University of California,
Santa Cruz.
Wolf Leslau. 1995. Reference Grammar of Amharic. Har-
rassowitz, Wiesbaden, Germany.
Marwan Odeh. 2004. Topologische dependenzgrammatik
fürs arabische. Technical report, Saarland University.
Forschungspraktikum.
18/119
A Syllable-based approach to verbal morphology in Arabic
Lynne Cahill
University of Brighton
NLTG, Watts Building, Lewes Rd, Brighton BN2 4GJ, UK
E-mail: L.Cahill@brighton.ac.uk
Abstract
The syllable-based approach to morphological representation (Cahill, 2007) involves defining fully inflected morphological forms
according to their syllabic structure. This permits the definition, for example, of distinct vowel constituents for inflected forms where an
ablaut process operates. Cahill (2007) demonstrated that this framework was capable of defining standard Arabic templatic morphology,
without the need for different techniques. In this paper we describe a further development of this lexicon which includes a larger number
of verbs, a complete account of the agreement inflections and accounts for one of the oft-cited problems for Arabic morphology, the
weak forms. Further, we explain how the use of this particular lexical framework permits the development of lexicons for the Semitic
languages that are easily maintainable, extendable and can represent dialectal variation.
19/119
(see e.g. Kiraz (2000)). and each rhyme of a peak and a coda 2 . The simplest
situation is where all wordform stems of a particular
The stem formation has already been shown (Cahill, 2007) lexeme are the same. In this case, we can simply specify
to be elegantly definable using an approach which was the onsets, peaks and codas for all of the syllables. For
developed mainly for defining European languages such example, the English word “pit” has the root /pIt/ and this
as English and German. We will describe this technique in is also its stem for all forms (singular, plural and
the next section. However, semitic morphology, and possessive). The phonological structure of this word in an
specifically the morphology of MSA, involves other word SBM lexicon would therefore by defined as follows3:
formation and inflection processes. One of the areas that
has attracted a good deal of attention is the issue of what <phn syl1 onset> == p
happens when the verb root, traditionally assumed to <phn syl1 peak> == I
consist of three consonants, does not fit this pattern. The <phn syl1 coda> == t
three principal situations where this happens are in the
case of biliteral or quadriliteral roots, where there are This example is monosyllabic, but polysyllabic roots
either two or four consonants instead of the expected three, involve identifying individual syllables by counting from
and the weak roots, where one of the consonants is a either the left or right of a root. For suffixing languages,
“weak” glide, i.e. either /w/ or /j/. the root’s syllables are counted from the right, while for
prefixing languages, they are counted from the left. For
Where a root has only two consonants, one or other of Arabic, although both pre- and suffixing processes occur,
those consonants is used as the third (middle) consonant, the decision has been made to count from the right, as
which one depending on the stem shape. Where a root has there is more suffixation. However, as the roots in Arabic,
four consonants, the possible forms are restricted to forms to all intents and purposes, always have the same number
where there are at least four consonant “slots”. Early of syllables, it is not important whether we choose to call
accounts of these types of root include a range of means of the initial syllable syl1 or syl2.
“spreading” where post lexical processes have to be
invoked to copy one or other of the consonants (see, e.g. In the case of simple stem alternations such as ablaut, the
Yip (1988)). peak of a specified syllable is defined as distinct for the
different wordforms. That is, the realisation of the peak is
The issue of bi- and quadri-literal roots is relatively determined by the morphosyntactic features of the form.
simply handled within the syllable-based framework, as To use a simple example, for an English word man, which
described in section 4 below. The weak roots are slightly has the plural men, we can specify in its lexical entry:
more complex, but nevertheless amenable to definition in
a similar way to the kind of phonological conditioning <phn syl1 peak sing> == a
seen, for example, in German final consonant devoicing, <phn syl1 peak plur> == E.
where the realisation of the final consonant of a stem
depends on whether it is followed by a suffix beginning As the individual consonants and vowels are defined
with a vowel or not. The Syllable-based Morphology separately for any stem, the situation for Arabic is actually
framework has been developed to allow for the realisation quite straightforward. For each verb form, inflected or
of fully inflected forms to be determined in part by derived, the consonants and vowels are defined, not in
phonological characteristics of the root or stem in terms of their position in a string or template, but in terms
question. This means that, while Arabic weak roots are of their position in the syllable trees. Thus, Cahill (2007)
often cited as behaving differently morphologically, we describes how the three consonants can be positioned as
argue that they behave entirely regularly morphologically, the onset or coda of different syllables. The vowels are
but their behaviour is determined by their phonology. defined in terms of tense/aspect.
3. Syllable-based morphology Figure 1 shows how the (underspecified) root structure for
The theory of syllable-based morphology (SBM) can the root katab looks. This is defined in DATR as follows4:
trace its roots back to the early work of Cahill (1990). The <phn syl2 onset> == Qpath:<c1>
initial aim was to develop an approach to describing
morphological alternation that could be used for all 2
The term “peak” is used to refer to the vowel portion of
languages and all types of morphology. Cahill’s doctoral the syllable, rather than the sometimes used “nucleus”.
work included a very small indicative example of how the The syllable structure is relatively uncontroversial, having
proposed approach could describe Arabic verbal stem been first proposed by Pike and Pike (1947).
formation. The basic idea behind syllable-based 3
We use the lexical representation language DATR
morphology is simply that one can use syllable structure (Evans and Gazdar, 1996) to represent the inheritance
to define all types of stem alternation, including simple network and use SAMPA (Wells, 1989) to represent
vowel alternations such as ablauts. All stems are defined phonemic representations.
4
by default as consisting of a string of tree-structured This is specified at the node for verbs, which defines all
syllables. Each syllable consists of an onset and a rhyme of the information that is shared, by default, by all verbs in
Arabic.
20/119
the separate “Suffix_S” node5.
Affixation is handled as simple concatenation, such that The issue of how many binyanim to define is an
(syllable-structured) affixes concatenate with interesting one, and one we will come back to in the
(syllable-structured) stems to make longer strings of discussion of extending coverage to dialects of Arabic.
structured syllables. For a simple case such as English Classical Arabic has a total of fifteen possible binyanim,
noun plural suffixation, for example, we need to specify while MSA makes use of ten of these standardly and two
that a noun consists of a stem and a suffix. We then need more in a handful of cases.
to state that, by default, the suffix is null, and that in the
case of the plural form, a suffix is added. 4.2 Agreement inflections
The next extension to the existing lexicon was to add the
<mor word form> == agreement inflections. These include prefixes and suffixes
“<phn root>” “<mor suffix>” and mark the person, number and gender of the form. As
<mor suffix> == Null noted above, the affixal inflections do not pose any
<mor suffix plur> == particular difficulties for the syllable-based framework.
Suffix_S:<phn form>
5
For more detail of this type of SBM definition for
As we are dealing with phonological forms, we also need German, English and Dutch, see Cahill and Gazdar
to specify how the suffix is realised, which is defined at (1999a, 1999b).
21/119
The “slots” for the affixes were already defined in the positions. This leads to phonologically conditioned
original account, so it was simply a case of specifying the variation from the standard stem forms. For example, the
realisations. The exact equations required for this will not hollow verb zawar (“visit”) has a glide in second
be covered in detail here, but we note that the affixes consonant position. This leads to stem forms with no
display typical levels of syncretism and default behaviour middle consonant, and a u in place of the two as. In order
so that, for example, we can specify that the default to allow for this variation, we need to check whether the
present tense prefix is t- as this is the most frequent second consonant is a glide and this will determine the
realisation, but the third person present tense prefix is y- realisations. This check must be done for each onset, peak
while the third person feminine prefix is t-. This kind of and coda that is defined as having possible variation, and
situation occurs often in defining default inheritance involves a simple check whether the second consonant is
accounts of inflection and is handled by means of the a /w/ or a /j/. In each case the behaviour is the same for the
following three equations6: consonant itself, i.e. it is omitted, but different for the
<agr prefix pres> == t vowel. With /w/, the vowel is /u/ but with /y/ it is /i/, in the
<agr prefix pres third> == y second vowel position.
<agr prefix pres third femn> == t
There are two possible approaches we could take to
4.3 Non-standard verb roots defining the different behaviour of weak verbs. The first is
The final extension which we report in this paper is the to specify a finite state transducer to run over the default
adaptation of the framework for stem structure to take forms. For example, we could state that if a verb root has
account of the different types of verb root, as discussed in the sequence /awa then this becomes /u:/. The second
section 2. approach is to define the elements of the syllable structure
according to the phonological context in which they occur.
Dealing with biliteral roots involves specifying for each We opt for the second of these approaches for a number of
consonant (i.e. onset or coda) defined in the stem structure reasons. The first is that we wish to minimise the different
whether it should take the first or third consonant value, if technologies used in our lexicon. Although FSTs are very
the second consonant is unspecified. Thus, biliteral simple to implement, we want to resist using them if
roots have their second consonants defined thus: possible, in order to make use only of the default
<c2> == Undef inheritance mechanisms available to us. The second is
Then, an example of defining the correct consonant that we are not yet at a stage in the project where we have
involves a simple conditional statement: enough varied data for all of the different verb and noun
<phn syl2 onset> == forms to be certain that any transducer we devise will not
IF:<EQUAL:<”<c2>” Undef> over apply, whereas we can be more confident of the
THEN “<c1>” specific generation behaviour of the inheritance
ELSE “<c2>” mechanisms we are employing in the lexicon structure as
This simply states that, if the second consonant is a whole.
unspecified, then the first consonant takes this particular
position, but if not, then the second consonant will take its One disadvantage of the approach we have chosen to take
normal place. In positions where the absent second is that is does result in somewhat more complex
consonant is represented by the third consonant simply definitions in our lexical hierarchy. For example, if we
require the third line above to give c3 rather than c1 as the only define strong triliteral verb roots, then our lexical
value. hierarchy can include statements like:
In order to handle quadrilateral roots, we need a separate
node for these verbs which defines which of the <phn syl1 onset> == Qpath:<c1>
consonants occupies each consonant slot in the syllable
trees. In many cases these are inherited from the Verb which are very simple. If we include all of the variation in
node, for example, the first consonant behaves the same in this hierarchy then we need more statements (to
these roots. Typically, where a triliteral root uses c1, a distinguish between, for example, past and present tense
quadriliteral root will use c1; where a triliteral root uses behaviour) and those statements are more complex. This
c3, the quadriliteral root will use c4 in most cases, but c3 is because, even for the standard strong triliteral roots, we
in others; where a triliteral root uses c2, the quadriliteral need to check for each consonant whether or not it is weak
root will either use c2 or c3, so these equations have to be and for each vowel, we need to check whether it is
specified. adjacent to a weak consonant . For this reason, we do not
include the DATR code which defines the weak verb
The weak roots have a glide in one of the consonant forms, but rather describe the checks needed.
6
The approach we take involves two levels of specification.
We have specified the present tense prefixes without the At the first level, each equation defining a consonant or a
/a/, as this is present in all forms. We therefore consider vowel calls on a simple checking function to determine
that this segment is part of the present tense stem.
22/119
whether the realisation is the default one or something strong then we would get a /u/ as the peak of the second
different. These calls to checking functions may take syllable. However, as the final consonant, /j/ is weak, the
different arguments. Thus, the simplest type just needs to peak is null. Similarly, the final consonant is not realised,
be passed the root consonant in question and will because it is weak. So, our stem is fully realised as /arm/.
determine whether it is realised (if it is strong) or not (if it Finally, we look for the agreement suffix, which is
is weak). In more complex situations, e.g. where a weak defined as /u:na/. So, our fully inflected form is /jarmu:na/.
root has /u:/ where it would by default have /awa/, we The syllable structure of the stem is shown in figure 3.
need to pass both the consonant and at least one of the
vowels.
These checks are very similar to the checks we can see in 5. Future directions
the syllable-based accounts of, for example, German
The extensions we report on here are only the start of a
(Cahill and Gazdar, 1999a). The realisation of the final
program of research which will add nouns and other
consonant in any stem in German is dependent on whether
non-regular morphological forms (e.g. the broken plural).
or not there is a suffix which begins with a vowel.
The project is also going to add orthographic forms,
Therefore, the equation specifying that consonant checks
derived from the phonological and morphological
for the beginning of the suffix (if there is one) and for
information, and supplement these with information
underlying voiced consonants returns the voiced variant
about the relative frequency of the ambiguous forms in
only if there is a vowel following, and returns the
unvocalised script.
voiceless variant otherwise.
5.1 Extension of the lexicon
To clarify the entire process involved in generating a verb
form from our lexicon, we shall now describe the The DATR implementation of the lexicon is based on the
derivation of the present tense active third person plural lexicon structure of PolyLex (Cahill and Gazdar, 1999b).
masculine form of the verb “throw” (they(m) throw). This This gives the lexicon two big advantages over other
is a weak (defective) verb, with a root of r-m-j. The first lexicon structures. The first is the default inheritance
thing we do is look for the agreement prefix. Our Verb machinery, which allows very easy extension. It is
node tells us that this is /j/. Next we need to determine the extremely easy to add large numbers of new lexemes
stem for this form. The stem is defined as having /a/ as the automatically, as long as the hierarchy defines all of the
peak of the first syllable (the default value for all present variation. The task is simply to add basic root information
tense forms) and the first consonant of the root, i.e. /r/ as (the consonants and the meaning and any irregularities
the coda of the first syllable. We determine this by peculiar to that lexeme – although there should not be
checking whether it is weak or not. Once we have many irregularities in new additions, as the most frequent
determined that it is a strong consonant, it takes its place words will have been added, and it is usually the more
in the syllable structure. The onset of the second syllable frequent words which are irregular) and choose the node
is the second consonant, in this case /m/, just as it is in in the hierarchy from which it should inherit. The PolyLex
most stems. Once again, we check that this is not a weak project developed tools to allow the generation of large
consonant before placing it in its position. At this point we numbers of additional lexical entries from a simple
start to find different behaviour. If the final consonant was database format which includes the important
information.
7
The discussion here has been simplified for the sake of Crucially, the use of default inheritance means that, even
brevity. The first and second person stem forms are the if we do not have all of the information available to
same, and are defined here, but the third person stems are determine the exact morphological behaviour of a
different. This is not a problem for our account, as the particular lexeme, we can assign sensible default values.
framework is specifically designed to allow both For example, if we wanted to add a new English noun to
morphosyntactic and phonological information to be used our lexicon, and we have not seen an example of that noun
in determining the correct form.
23/119
in its plural form, we can add it as a regular noun, and being designed to work for closely related languages, it is
generate a plural form which adds the –s suffix. This may also appropriate for dialects of a single language. We can
not be correct, but it is a reasonable guess, and the kind of map the situation for MSA 8 and the dialects onto this
behaviour we would expect from a human learning a directly, with MSA taking the place of the multilingual
language. This is useful if the data we use to extend our hierarchy and the dialects taking the place of the separate
lexicons comes, for example, from corpora – often a languages here. The assumption is that, by default, the
necessity for languages which do not have large dialects inherit information (about morphology,
established resources. phonology, orthography, syntax and semantics) from the
MSA hierarchy, but any part of that information can be
In terms of the Arabic lexicon we describe here, the forms overridden lower down for individual dialects. There is
of verbs, even those with weak roots, do not need any nothing to prevent a more complex inheritance system,
further specification, as the lexical hierarchy defines the for example, to allow two dialects to share information
alternation in terms of the phonological structure of the below the level of the MSA hierarchy, but to also specify
root. Therefore, if a newly added root has a weak some distinct bits of information.
consonant, the correct behaviour will automatically be
instigated by the recognition of that weak consonant.
6. Conclusions
This process has already been tested with a random The approach to Arabic morphology presented here is still
selection of 50 additional strong verbs, two weak verbs in the early stages of development. It does, nevertheless,
for each of the consonant positions (i.e. two with weak demonstrate a number of crucial points. First, it backs
initial consonants, two with weak medial consonants and Cahill (2007) in showing that the SBM approach appears
two with weak final consonants) and one with two weak to be adequate to define those aspects of Arabic
consonants. The resulting forms for some of these verbs morphology that have frequently been cited as
are included in Appendix 2. problematic. It is important to establish proof of concept
in employing a new approach to specifying the
5.2 Adding more dialects morphology of any language, and the (admittedly small)
Another issue which causes much concern in the lexicon does demonstrate the possibility of handling bi-
representation and processing of Arabic is the question of and quadriliteral roots as well as weak verb roots within
the different varieties or dialects. Buckwalter (2007) says the SBM framework. Although not all of the details for all
“... the issue of how to integrate morphological analysis of of the verbal morphology have yet been implemented,
the dialects into the existing morphological analysis of nothing has been shown to cause any significant
Modern Standard Arabic is identified as the primary difficulties that cannot be overcome in the framework.
challenge of the next decade.” (p. 23). Until relatively
recently, the issue of dialects in Arabic was only relevant Secondly, having established that the approach appears to
for phonological processing, as dialects did not tend to be be feasible for the complexities of Arabic morphology, it
written. However, the rapid expansion of the Internet, follows that the implementation of the morphology in the
amongst other developments, means that written versions form of a PolyLex-style lexicon will permit the definition
of the various dialects are increasingly used, and of dialectal variation, thus allowing the development of a
processing of these is becoming more important. full lexicon structure defining MSA, Classical Arabic as
well as regional variants in an efficient and practically
The PolyLex architecture was developed as a multilingual maintainable way. Although the details remain to be
representation framework, particularly aimed at worked out, the assumed structure would involve a core
representing closely related languages (the PolyLex lexicon which defines, for example, all fifteen of the
lexicons themselves include English, German and Dutch). Classical Arabic binyanim, with each of the lexicons for a
The framework involves making use of extended default “real” language specifying which of those are employed
inheritance to specify information which is shared, by within that language or dialect.
default, by more than one language, with overrides being
used to specify differences between languages as well as The PolyLex lexicon structure allows the definition of
variant behaviour within a single language (such as defaults, which can be overridden at any of a number of
irregular or sub-regular inflectional forms). In the case of levels. It is possible to override some pieces of
English, German and Dutch, for example, it is possible to information for an entire language or dialect, for a
state that, by default, nouns form their plural by adding an word-class such as nouns, for a sub-class of nouns or
–s suffix. This is true of all regular nouns in English and verbs or for an individual lexeme. This makes it very
of one class of nouns in both Dutch and German. efficient at representing lexical information which tends
Importantly, those classes in Dutch and German are the
classes that new nouns tend to belong to, so assuming that 8
class to be the default works well. It may prove more accurate and useful to have Classical
Arabic in the multilingual position, as this probably
One of the great advantages of such a framework is that, includes more of the range of forms that the different
dialects would need to inherit.
24/119
to be very largely regular. It also makes it very easy to add Linguistic Theory, 6.4. pp. 551-577.
new lexemes, even if it has not been wholly established
what all of the correct forms of that lexeme are. To use an
analogy from child language acquisition, a child hearing Appendix: Sample output
an English noun, will assume that its plural is –s unless
and until they hear an irregular plural form for it. The DATR-implemented lexicon can be compiled and
Similarly, a child learning Arabic will assume that a new queried. In this appendix, we include the full lexical
verb it hears follows the default, regular patterns unless dumps for three lexemes: the fully regular strong triliteral,
and until they hear non-regular forms. That is the kind of k-t-b, “write”; the weak (defective) verb r-m-y, “throw”;
behaviour that our default inheritance lexicon models and the “doubly” weak verb T-w-y, “fold”. The dumps
when adding new lexemes. give the present and past active forms for the first binyan.
25/119
Throw:<bin1 mor word past act secnd plur Fold:<bin1 mor word pres act third sing
femn> = r a m a j t u n n a. femn> = t a T w i:.
Throw:<bin1 mor word past act third sing Fold:<bin1 mor word pres act third plur
masc> = r a m a a. masc> = j a T u: n a.
Throw:<bin1 mor word past act third sing Fold:<bin1 mor word pres act third plur
femn> = r a m a t. femn> = j a T w i: n a.
Throw:<bin1 mor word past act third plur
masc> = r a m a w.
Throw:<bin1 mor word past act third plur
femn> = r a m a j n a.
Throw:<bin1 mor word pres act first sing>
= a r m i:.
Throw:<bin1 mor word pres act first plur>
= n a r m i:.
Throw:<bin1 mor word pres act secnd sing
masc> = t a r m i:.
Throw:<bin1 mor word pres act secnd sing
femn> = t a r m i: n a.
Throw:<bin1 mor word pres act secnd plur
masc> = t a r m u: n a.
Throw:<bin1 mor word pres act secnd plur
femn> = t a r m i: n a.
Throw:<bin1 mor word pres act third sing
masc> = j a r m i:.
Throw:<bin1 mor word pres act third sing
femn> = t a r m i:.
Throw:<bin1 mor word pres act third plur
masc> = j a r m u: n a.
Throw:<bin1 mor word pres act third plur
femn> = j a r m i: n a.
26/119
Using the Yago ontology as a resource for the enrichment of
Named Entities in Arabic WordNet
Lahsen Abouenour1, Karim Bouzoubaa1, Paolo Rosso2
1 Mohammadia School of Engineers, Med V University
Rabat, Morocco
2 Natural Language Engineering Lab. - ELiRF, Universidad Politécnica
Valencia, Spain
E-mail: abouenour@yahoo.fr, karim.bouzoubaa@emi.ac.ma, prosso@dsic.upv.es
Abstract
The development of sophisticated applications in the field of the Arabic Natural Language Processing (ANLP) depends on
the availability of resources. In the context of previous works related to the domain of the Arabic Question/Answering (Q/A)
systems, a semantic Query Expansion approach using Arabic WordNet (AWN) has been evaluated. The obtained results, al-
though AWN (one of the rare resources) has a low coverage of the Arabic language, showed that it helps to improve perform-
ances. The evaluation process integrates a Passage Retrieval (PR) system which helps to rank the returned passages according
to their structure similarity with the question. In this paper, we investigate the usefulness of enriching AWN by means of the
Yago ontology. Preliminary experiments show that this technique helps to extend and improve the processed questions.
27/119
the questions which contain keywords that can not be QASAL. They have reported that it will be necessary
found in AWN (not extensible questions) and those in future works to consider, the synonymy relations
for which the system could not reach the expected between AWN synsets at the question analysis stage
answer (not answered questions). For the two types of of the proposed system. In (Benajiba et al., 2009), the
questions, they investigated either the keywords form- authors have reported that the use of AWN would
ing the questions and the type of the expected answer. allow exploring the impact of semantic features for
the Arabic Named Entity Recognition (NER) task
The analysis showed that for a high percentage of which is generally included in the first question
the considered questions, both the question keywords analysis step of a Q/A process (generally composed
and answers are NEs. Hence, the enrichment of the by three steps: question analysis, passages re-trieval
NE content in the AWN ontology could help us to and answer extraction).
reach higher performances.
In (Abouenour et al., 2008; Abouenour et al.,
In this paper, we present an attempt to perform an 2009a), the authors have shown how it is possible to
automatic AWN enrichment for the NE synsets. In- build an ontology for QE and semantic reasoning in
deed, the use of a NER system (if such system is the context of the Arabic Q/A task. In addition, the
available and accurate in the context of the Arabic usefulness of AWN as a semantic resource for QE has
language) allows only identifying NE and information been proved in the recent work of (Abouenour et al.,
related to them whereas adding NE in AWN helps 2009a) where the authors have considered not only
also to identify synsets which are semantically related the lexical side of AWN, but also its semantic and
to them (synonyms, subtypes, supertypes, etc.). More- knowledge parts. Moreover, the QE process based on
over, such enrichment could be also useful in the con- AWN has been used together with a structure-based
text of other ANLP and Cross-language tasks. technique for Passage Retrieval (PR). Indeed, the first
step of our approach is retrieving a large number of
The current work is based on the Yago5 ontology passages which could contain the answer to the en-
which contains 2 million entities (such as persons, tered question. Generally, the answer is expected to
organizations, cities, etc.). This ontology (Suchanek et appear in those passages nearer to the other keywords
al., 2007) contains 20 million facts about these enti- of the question or to the terms which are semantically
ties. The main reasons behind using this resource are: related to those keywords. Therefore, new queries
• its large coverage of NEs can help to improve from the question were generated by replacing a key-
performances in the context of Arabic Q/A sys- word by its related terms in AWN regarding the four
tems; semantic relations mentioned previously.
• its connection to the PWN and the SUMO on-
tology (Gerard et al., 2008) can help us to In the second step of the described approach, the re-
transfer the large coverage of Yago to the turned passages have to be ranked according to the
AWN ontology. structure similarity between the passages and the
question. Thus, this step allows decreasing the num-
The rest of the paper is structured as follows: Sec- ber of passages to be processed at the answer extrac-
tion 2 describes works using AWN; Section 3 pre- tion stage.
sents the technique proposed for the AWN enrich-
ment; Section 4 is devoted to the presentation of the The conducted experiments showed an improve-
preliminary experiments that we have conducted on ment of performances thanks to our two steps ap-
the basis of the Yago content; in Section 5 we draw proach based on the AWN ontology for QE and the
the main conclusions of this work and we discuss fu- Java Information Retrieval System6 (JIRS) (Gomez et
ture work. al., 2007) for structure based PR. The analysis of the
obtained results showed that:
2. Arabic WordNet in previous works • A high percentage (46.2%) of the TREC
and CLEF questions are of NEs;
There are many works that have integrated AWN as • The enrichment of the NE content in
a lexical or a semantic resource. To our knowledge, AWN will allow extending 69% of the
most of these works belongs to the Arabic IR and Q/A non extensible questions;
fields. Indeed, in (El Amine, 2009), AWN has been • For a high percentage of the considered
used as a lexical resource for a QE process in the con- questions (50%), we can reach a similarity
text of the IR task. (between the question and passages) equal
or higher than 0.9 and an average of 0.95
In the context of Q/A systems, authors in (Brini et (max is 1) by using AWN together with
al., 2009) have proposed an Arabic Q/A system called JIRS.
5
Yet an Other Great Ontology, available at
http://www.mpi-inf.mpg.de/yago-
6
naga/yago/downloads.html http://jirs.dsic.upv.es
28/119
Thus, according to this analysis, the priority in whereas the latter are facts which set a relation be-
terms of AWN enrichment is clear: in order to evalu- tween these entities. To our knowledge Yago has been
ate the QE and structure-based approach, we have to used as a semantic resource in the context of IR sys-
enlarge and refine the coverage, hierarchy and rela- tems (Pound et al., 2009).
tions related to the NE synsets in AWN.
As we are interested in enriching the NE content of
In the next section, we describe how resources be- AWN, a translation stage has to be considered in our
longing to other languages could be used for the en- process. In (Al Khalifa and Rodriguez, 2009), authors
richment of the NE content in AWN. used the Arabic counterpart of the English Wikipedia
pages as a translation technique. In the current work,
we consider instead the Google Translation API12
3. Enrichment of Arabic WordNet using (GTA) because its coverage for NEs written in Arabic
Yago is higher than the one of Arabic Wikipedia. In addi-
tion, translating a word using GTA is faster. Indeed,
According to the great number of words of the the result of a translation using Arabic Wikipedia
Modern Standard Arabic (MSA) language, the current needs to be disambiguated as many possible words
release of AWN which has been manually built has are returned. This is not the case for the GTA.
still to be enlarged. The automatic enrichment is a
promising way to reach a large coverage by AWN The enrichment concerns both adding new indi-
regarding the MSA. In this context, authors in (Al viduals (NE) and adding their supertypes. These su-
Khalifa and Rodriguez, 2009) have proposed a new pertypes are very important and useful in our QE
approach for extending automatically the NE cover- process combined to the structure-based PR system
age of AWN. This approach relies on Wikipedia7. The (JIRS). In order to show this usefulness, let us con-
evaluation done in that work shows that 93.3% of the sider the example of the TREC question " و ون
NE synsets which was automatically recovered are ( " ن ؟When was Lindon Johnson born?). When
correct. However, due to the small size of the Arabic we query a search engine using this question, the two
wikipedia, only 3,854 Arabic NEs have been recov- following passages could be returned:
ered.
و ه ام اي و1908 The year 1908 which is
Our approach proposes using a freely available on- … ون ن the year of birth of
tology with a large coverage of NE instead of the Lindon Johnson ...
Arabic Wikipedia. In addition to Yago, the field of ون و ا ا The American presi-
open source ontologies provides interesting resources 1908 "# $ أ27 ن م dent Lindon Johnson
and attempts which belong either to the specific and … was born in 27 August
open domain category: OpenCyc (Matuszek et al., 1908 ...
2006), Know-ItAll (Etzioni et al., 2004), HowNet8,
9 10
SNOMED , GeneOntology , etc. According to the two passages above, the JIRS sys-
tem will consider the first passage as being the most
For the purpose of the current work, we have been relevant. Indeed, since the two passages contain the
interested in using Yago for the following reasons keywords of the question ( ون ن،)و, the
(Suchanek et al., 2007): similarity of the structure of each passage to the one
• It covers a great amount of individuals (2 of the question is the criterion to be used to compare
millions NEs), them. The second passage contains a structure similar
• It has a near-human accuracy around 95%, to the question with two additional terms (which are
• It is built from WordNet and Wikipedia, not among the question keywords) whereas in the first
• It is connected with the SUMO ontology, passage only one additional term appears (fyh - ).
• It exists in many formats (XML, SQL, RDF, Therefore, the latter is considered more similar to the
Notation 3, etc.) and is available with tools11 question than the former one. After enriching AWN
which facilitate exporting and querying it. by the NE ون نand its supertypes such as
'() ( ر*" أr}ys >mryky : US President), we can
The Yago ontology contains two types of informa- consider, in the query processed by JIRS, the
tion: entities and facts. The former are NE instances extended form of the question where the NE is
(from Wikipedia) and concepts (from WordNet), preceded by its supertype '() ,ا)*" ا. In this case,
the two terms "*) اand '() , اare considered as
being among the question keywords. Hence, the
7
www.wikipedia.org/ structure of the second passage would then be
8
www.keenage.com/html/e_index.html considered by JIRS as the most similar to the structure
9
www.snomed.org of the question. The second passage is the one
10
www.geneontology.org containing the expected answer in a structure which
11
http://www.mpi-inf.mpg.de/yago-
12
naga/yago/downloads.html http://code.google.com/p/google-api-translate-java/
29/119
structure which can be easy to process by the answer After performing steps 3 and 4, 374 distinct NEs
extraction module. In order to enrich the NE content (79%) have been identified within the Yago ontology.
in AWN, we have adopted an approach composed of A number of 59,747 facts concern the identified Yago
seven steps. Figure 1 below illustrates these steps. entities, with an average of 160 facts per entity. The
average of the confidence related to these facts around
0.97 (the max is 1). The Yago ontology contains 96
relations. We have identified 43 relations in the facts
corresponding to the NEs extracted from the consid-
ered questions. The TYPE relation is the first one to
be considered in our approach for the enrichment of
NEs in the AWN. For the purpose of the current
work, we have considered only the facts containing a
TYPE relation between a Yago entity and a WordNet
concept. From the 374 NEs identified in Yago, 204 of
them (around 55%) have a TYPE relation with a
WordNet concept.
30/119
WordNet on one hand and between WordNet and El Amine M. A. 2009. Vers une interface pour
AWN on the other hand. In the preliminary experi- l’enrichissement des requêtes en arabe dans un
ments that we have conducted, we have considered système de recherche d’information. In Procee-
the previous semantic QE approach which relies now dings of the 2nd Conférence Internationale sur
on the new content of AWN. These experiments show l'informatique et ses Applications (CIIA'09) Sai-
an improvement in terms of accuracy, MRR and the da, Algeria, May 3-4, 2009.
number of answered questions.
Elkateb S., Black W., Vossen P., Farwell D.,
In the current work, we have considered only the Rodríguez H., Pease A., Alkhalifa M. 2006.
relations of Yago which allow a direct mapping be- “Arabic WordNet and the Challenges of Arabic”.
tween its entities and the AWN synsets. Therefore, In proceedings of Arabic NLP/MT Conference,
considering the other relations and the whole content London, U.K.
of Yago is among the intended future works.
Etzioni O., M. J. Cafarella, D. Downey, S. Kok, A.-
Acknowledgement M. Popescu, T. Shaked, S. Soderland, D. S.
Weld, and A. Yates. 2004. Web-scale informa-
This research work is the result of the collaboration in tion extraction in KnowItAll. In WWW,
the framework of the bilateral Spain-Morocco AECID- 2004.
PCI C/026728/09 research project. The third author
thanks also the TIN2009-13391-C04-03 research pro- Fellbaum C. 2000. “WordNet: An Electronic Lexical
ject. Database”. MIT Press, cogsci.princeton.edu/˜wn,
September 7.
References Gerard D. M., Suchanek F. M., Pease A. Integrating
Al Khalifa M. and Rodríguez H. 2009. “Automati- YAGO into the Suggested Upper Merged Ontol-
cally Extending NE coverage of Arabic WordNet ogy. 20th IEEE International Conference on
using Wikipedia”. In Proc. Of the 3rd Interna- Tools with Artificial Intelligence (ICTAI 2008).
tional Conference on Arabic Language Process- Dayton, Ohio, USA (2008).
ing CITALA2009, Rabat, Morocco, May, 2009. Gómez J. M., Rosso P., Sanchis E. 2007. Re-ranking
Abouenour L., Bouzoubaa K., Rosso P., 2009. of Yahoo snippets with the JIRS Passage Re-
“Three-level approach for Passage Retrieval in trieval system. In: Proc. Workshop on Cross Lin-
Arabic Question /Answering Systems”. In Proc. gual Information Access, CLIA-2007, 20th Int.
Of the 3rd International Conference on Arabic Joint Conf. on Artificial Intelligence, IJCAI-07,
Language Processing CITALA2009, Rabat, Mo- Hyderabad, India, January 6-12.
rocco, May, 2009. Matuszek C., J. Cabral, M. Witbrock, and J. De
Abouenour L., Bouzoubaa K., Rosso P., 2009. “Struc- Oliveira. An introduction to the syntax and con-
ture-based evaluation of an Arabic semantic tent of Cyc. In AAAI Spring Symposium, 2006.
Query Expansion using the JIRS Passage Re- Niles I., Pease A. 2003. “Linking Lexicons and On-
trieval system” . In: Proc. Workshop on Compu- tologies: Mapping WordNet to the Suggested Up-
tational Approaches to Semitic Languages, E- per Merged Ontology.” In Proceedings of the
ACL-2009, Athens, Greece. 2003 International Conference on Information
Abouenour L., Bouzoubaa K., Rosso P. 2008. Improv- and Knowledge Engineering, Las Vegas, Nevada.
ing Q/A Using Arabic Wordnet. In: Proc. The
2008 International Arab Conference on Informa- Pound J., Ihab F. I., and Weddell. G. 2009. QUICK:
tion Technology (ACIT'2008),Tunisia, December. Queries Using Inferred Concepts fromKeywords
Technical Report CS-2009-18. Waterlo, Canada.
Benajiba Y., Mona D., Rosso P. Using Language In-
dependent and Language Specific Features to Rodríguez H., Farwell D., Farreres J., Bertran M.,
Enhance Arabic Named Entity Recognition. In: Alkhalifa M., Antonia Martí M., Black W., El-
IEEE Transactions on Audio, Speech and Lan- kateb S., Kirk J., Pease A., Vossen P., and Fell-
guage Processing. Special Issue on Processing baum C. 2008. Arabic WordNet: Current State
Morphologically Rich Languages, Vol. 17, No. 5, and Future Extensions in: Proceedings of the
July 2009. Fourth International GlobalWordNet Conference
- GWC 2008, Szeged, Hungary, January 22-25,
Brini W., Ellouze M., Hadrich Belguith L. 2009. QA 2008.
SAL : “Un système de question-réponse dédié
pour les questions factuelles en langue Arabe”. Suchanek, F. M., Kasneci, G., Weikum, G.: YAGO: a
In: 9ème Journées Scientifiques des Jeunes Cher- core of semantic knowledge unifying WordNet
cheurs en Génie Electrique et Informatique, Tuni- and Wikipedia. In Proc. of the 16th WWW, pp.
sia. (in French). 697-706 (2007).
31/119
Light Morphology Processing for Amazighe Language
Fadoua Ataa Allah, Siham Boulaknadel
CEISIC, IRCAM
Avenue Allal El Fassi, Madinat Al Irfane, Rabat, Morocco
E-mail: {ataaallah, boulaknadel}@ircam.ma
Abstract
In the aim to allow the Amazighe language an automatic processing, and integration in the field of Information and Communication
Technology, we have opted in the Royal Institute of Amazighe Culture “IRCAM” for an innovative approach of progressive
realizations. Thus since 2003, researchers in the Computer Sciences Studies, Information Systems and Communications Center
“CEISIC” have paved the way for elaborating linguistic resources, basic natural language processing tools, and other advanced
scientific researches by encoding Tifinaghe script and developing typefaces.
In this context, we are trying through this paper to develop a computationally stemming process which is based on analyzing words
to their stems. This process consists in splitting Amazighe words into constituent stem part and affix parts without doing complete
morphological analysis. This approach of light stemming will conflate word variants into a common stem in order to be used in
natural language applications such as indexation, information retrieval systems, and classification.
The remaining of the paper is organized as follows: in Furthermore, the IRCAM has recommended the use of
Section 2, we give a brief description of the Moroccan the International symbols for punctuation markers: “ ”
standard Amazighe language. Then, in Section 3, we give (space), “.”, “,”, “;”, “:”, “?”, “!”, “…”; the standard
an overview about the Amazighe language numeral used in Morocco (0, 1, 2, 3, 4, 5, 6, 7, 8, 9); and
characteristics. In Section 4, we present our light the horizontal direction from left to right for Tifinaghe
stemming algorithm. Finally, section 5 gives general writing (Ameur et al., 2004).
32/119
3. Amazighe Language 3.3 Particles
Characteristics In Amazighe language, particle is a function word that is
The purpose of this section is to give an overview of the not assignable to noun neither to verb. It contains
morphological properties of the main syntactic amazighe pronouns; conjunctions; prepositions; aspectual,
categories, which are the noun, the verb, and the particles orientation and negative particles; adverbs; and
(Boukhris et al., 2008; Ameur et al., 2004). subordinates. Generally, particles are uninflected words.
However in Amazighe, some of these particles are
3.1 Noun flectional, such as the possessive and demonstrative
In Amazighe language, noun is a lexical unit, formed pronouns (ta “ta” this (fem.) → tina “tina” these (fem.)).
from a root and a pattern. It could occur in a simple form
(argaz “argaz” the man), compound form (buhyyuf 4. Light Stemming Algorithm
“buhyyuf” the famine), or derived one (amsawaä The light stemming refers to a process of stripping off a
“amsawad ” the communication). This unit varies in small set of prefixes and/or suffixes, without trying to
gender, number and case. deal with infixes, or recognizing patterns and finding
- Gender: Nouns are categorised by roots (Larkey, 2002). As a first edition of such work in
grammatical gender: masculine or feminine. the IRCAM, with regard to the lack of huge digital
Generally, the masculine begin with an initial corpus availability, our method is based only on the
vowel a “a”, I “i”, or u “u”. While, the composition of words that is usually formed in the
feminine, used also to form diminutives and Moroccan standard Amazighe language as a sequence of
singulatives, is marked with the circumfix prefix, core, and suffix. We are assuming that we are not
t…t “t…t” (ampäaë “amh d ar ”masc., making use of any stem dictionary or exception list. Our
tampäaët “tamh d ar t” fem. the student). algorithm is merely based on an explicit list of prefixes
and suffixes that need to be stripped in a certain order.
- Number: There are two types: singular and
This list is derived from the common inflectional
plural, which has three forms. The external
morphemes of gender, number and case for nouns;
plural consists in changing the initial vowel,
personal markers, aspect and mood for verbs; and affix
and adding the suffix n or one of its variants
pronouns for kinship nouns and prepositions. While, the
in “in”, an “an”, yn “yn”, wn “wn”, awn
derivational morphemes are not included in order to keep
“awn”, iwn “iwn”, tn “tn” (impäaën “im
the semantic meaning of words. It is very reasonable to
amh d ar n” masc., timpäaëin
conflate the noun tarbat “tarbat” girl with its
“timh d ar i n” fem. students). The broken
masculine form “arba” arba boy; while it seems
plural involves a change in the vowels of the
unreasonable, for some application like information
noun (Adrar “adrar” mountain → idurar
retrieval, to conflate the derived verb ssufv “ssufγ”
idurar mountains, Tivmst “tiγmst” tooth
bring out with the simple form ffv “ffγ” leave.
→ tivmas “tiγm a s” teeth). The mixed
The set of prefixes and suffixes, that we have identified,
plural is formed by the combination of
are classified to five groups ranged from one character to
vowels’ change and the use, sometimes of the
five characters.
suffixation n (izi “izi” fly → izan “izan”
flies, Amggaru “amgguru” last → imggura
4.1 Prefix Set
“imggura” lasts).
- One-character: a, I, n, u, t.
- Case: Two cases are distinguished. The free
case is unmarked, while the construct one - Two-character: na, ni, nu, ta, ti, tu, tt,
involves a variation of the initial vowel wa, wu, ya, yi, yu.
(argaz “argaz” man → urgaz, “urgaz” - Three-character: itt, ntt, tta, tti.
tamvart “tamγar t” woman → tmvart
“tmγa r t” ). - Four-character: itta, itti, ntta, ntti,
tett.
3.2 Verb - Five-character: tetta, tetti.
The verb, in Amazighe, has two forms: basic and derived
forms. The basic form is composed of a root and a 4.2 Suffix Set
radical (ffv “ffγ” leave), while the derived one is - One-character: a, d, I, k, m, n, v, s, t.
based on the combination of a basic form and one of the
following prefixes morphemes: s/ss “s/ss”, tt “tt” - Two-character: an, at, id, im, in, IV, mt,
and m/mm “m/mm” (ssufv “ssufγ” bring out). nv, nt, un, sn, tn, wm, wn, yn.
Whether basic or derived, the verb is conjugated in four - Three-character: amt, ant, awn, imt, int,
aspects: aorist, imperfective, perfect, and negative iwn, nin, unt, tin, tnv, tun, tsn, snt, wmt.
perfect. Moreover, it is constructed using the same
personal markers for each mood, as represented in - Four-character: tunt, tsnt.
Table1.
33/119
Indicative mood Imperative mood Participial mood
Table 1: Personal markers for the indicative, imperative and participial moods
6. Appendix
Based on this list of affixes and on theoretical analysis,
Tifinaghe Latin Tifinaghe Latin
we notice that the proposed amazighe light stemmer
Correspondence Corresponden
could make two kinds of errors:
ce
- The understemming errors, in which words
referring to the same concept are not reduced ⴰ a ⵍ l
to the same stem, such the case of the verb ⴱ b ⵎ m
ffv “ ffγ” leave that ends with the character ⴳ g ⵏ n
v “ γ”, which coincides with the 1st singular ⴳⵯ gw ⵓ u
personal marker. So, the stem ffv “ffγ” of
ⴷ d ⵔ r
the verb when is conjugated in the perfect
aspect for the 1st singular person ffvv ⴹ d ⵕ r
“ffγγ” I left will not be conflated with stem ⴻ e ⵖ γ
ff “ff” of the 3rd singular masculine person ⴼ f ⵙ s
iffv “iffγ” he left. ⴽ k ⵚ s
- The overstemming errors, in which words are ⴽⵯ kw ⵛ c
converted to the same stem even though they ⵀ h ⵜ t
refer to distinct concepts, such the example of ⵃ h ⵟ t
the verb g “g” do and the noun aga “aga”
ⵄ ε ⵡ w
bucket. The stem g “g” of the verb when is
conjugated in the perfect aspect for the 3rd ⵅ x ⵢ y
singular masculine person iga “iga” he did ⵇ q ⵣ z
will be conflated with stem g “g” of the noun ⵉ i ⵥ z
aga “aga”. ⵊ j
In general, light stemmers avoid the overstemming
Table 2: Tifinaghe-Ircam Alphabet
errors, especially for the Indo-European languages;
however, it is not the case of the Amazighe language.
7. References
This proves that the Amazighe language constitutes a
significant challenge for natural language processing. Al-shammari, E. T., Lin, J. (2008). Towards an error-free
Arabic stemming. Actes de the 2nd ACM workshop on
improving non English web searching. pp.9--16.
5. Conclusion Ameur, M., Bouhjar, A., Boukhris, F., Boukouss, A.,
Boumalk, A., Elmedlaoui, M., Iazzi, E. M., Souifi, H.
Stemming is an important technique for highly inflected
(2004). Initiation à la langue amazighe. Rabat:
language such as Amazighe. In this work, we have
IRCAM.
investigated on the Amazighe language characteristics,
Boukhris, F., Boumalk, A., Elmoujahid, E., Souifi, H.
and have presented a light stemming approach for
(2008). La nouvelle grammaire de l'amazighe. Rabat:
Amazighe. We should note that the proposed stemming
IRCAM.
algorithm is primarily for handling inflections – it does
Larkey, L. S., Ballesteros, L., Connell, M. (2002).
not handle derivational suffixes, for which one would
Improving Stemming for Arabic Information
need a proper morphological analyzer.
Retrieval: Light Stemming and Cooccurrence
In attempt to improve the amazighe light stemmer, we
Analysis. In Proceedings of the 25th Annual
plan to build a stem dictionary, to elaborate a set of
International Conference on Research and
linguistic rules, and to set a list of exceptions to further
Development in Information Retrieval. Tampere,
extend the stemmer.
Finland, pp. 275--282.
34/119
Lovins, J. B. (1968). Development of a stemming
algorithm. Mechanical Translation and Computational
Linguistics, 11(1), pp. 22--31.
Paternostre, M., Francq, P., Lamoral, J., Wartel, D.,
Saerens, M. (2002). Carry, un algorithme de
désuffixation pour le français. Rapport technique du
projet Galilei.
Porter, M.F. (1980). An algorithm for suffix stripping.
Program, 14(3), pp.130--137.
Savoy, J. (1993). Stemming of French words based on
grammatical categories. Journal of the American
Society for Information Science, 44(1), pp.1--9.
Taghva, K., Elkhoury, R., Coombs, J. (2005). Arabic
stemming without a root dictionary. In Proceeding of
Information Technology: Coding and Computing. Las
Vegas, pp.152--157.
35/119
Using Mechanical Turk to Create a Corpus of Arabic Summaries
Mahmoud El-Haj, Udo Kruschwitz, Chris Fox
School of Computer Science and Electronic Engineering
University of Essex
Colchester, CO4 3SQ
United Kingdom
{melhaj,udo,foxcj}@essex.ac.uk
Abstract
This paper describes the creation of a human-generated corpus of extractive Arabic summaries of a selection of Wikipedia and Arabic
newspaper articles using Mechanical Turk—an online workforce. The purpose of this exercise was two-fold. First, it addresses a shortage
of relevant data for Arabic natural language processing. Second, it demonstrates the application of Mechanical Turk to the problem of
creating natural language resources. The paper also reports on a number of evaluations we have performed to compare the collected
summaries against results obtained from a variety of automatic summarisation systems.
1 4
http://www.ircs.upenn.edu/arabic/ http://www.wikipedia.com
2 5
http://ufal.mff.cuni.cz/padt/PADT 1.0/ http://www.alrai.com
3 6
http://www.mturk.com http://www.alwatan.com.sa
36/119
1. They contain real text as would be written and used by and a query (in this case the document’s title) and
native speakers of Arabic. returns an extractive summary (El-Haj and Hammo,
2008; El-Haj et al., 2009).
2. They are written by many authors from different back-
grounds. Gen-Summ: similar to AQBTSS except that the query is
replaced by the document’s first sentence.
3. They cover a range of topics from different subject
areas (such as politics, economics, and sports), each LSA-Summ: similar to Gen-Summ, but where the vector
with a credible amount of data. space is tranformed and reduced by applying Latent
Semantic Analysis (LSA) to both document and query
The Wikipedia documents were selected by asking a group
(Dumais et al., 1988).
of students to search the Wikipedia website for arbitrary
topics of their choice within given subject areas. The sub- Baseline-1: the first sentence of a document.
ject areas were: art and music; the environment; politics;
sports; health; finance and insurance; science and technol- The justification for selecting the first sentence in Baseline-
ogy; tourism; religion; and education. To obtain a more 1 is the believe that in Wikipedia and news articles the first
uniform distribution of articles across topics, the collection sentence tends to contain information about the content of
was then supplemented with newspaper articles that were the entire article, and is often included in extractive sum-
retrieved from a bespoke information retrieval system using maries generated by more sophisticated approaches (Bax-
the same queries as were used for selecting the Wikipedia endale, 1958; Yeh et al., 2008; Fattah and Ren, 2008; Ka-
articles. Each document contains on average 380 words. tragadda et al., 2009).
When using Mechanical Turk on other NLP tasks, it has
4. The Human-Generated Summaries been shown that aggregation of multiple independent anno-
tations from non-experts can approximate expert judgement
The corpus of extractive document summaries was gener-
(Snow et al., 2008; Callison-Burch, 2009; Albakour et al.,
ated using Mechanical Turk. The documents were pub-
2010, for example). For this reason, we evaluated the re-
lished as “Human Intelligence Tasks” (HITS). The asses-
sults of the systems not with the raw results of Mechanical
sors (workers) were asked to read and summarise a given
Turk, but with derived gold standard summaries, generated
article (one article per task) by selecting what they consid-
by further processing and analysis of the human generated
ered to be the most significant sentences that should make
summaries.
up the extractive summary. They were required to select no
The aggregation of the summaries can be done in a num-
more than half of the sentences in the article. Using this
ber of ways. To obtain a better understanding of the impact
method, five summaries were created for each article in the
of the aggregation method on the results of the evaluation,
collection. Each of the summaries for a given article were
we constructed three different gold standard summaries for
generated by different workers.
each document. First of all we selected all those sentences
In order to verify that the workers were properly engaged
identified by at least three of the five annotators (we call
with the articles, and provide a measure of quality assur-
this Level 3 summary). We also created a similar summary
ance, each worker was asked to provide up to three key-
which includes all sentences that have been identified by
words as an indicator that they read the article and did not
at least two annotators (called Level 2). Finally, each docu-
select random sentences. In some cases where a worker
ment has a third summary that contains all sentences identi-
appeared to select random sentences, the summary is still
fied by any of the annotators for this document (called All).
considered as part of the corpus to avoid the risk of subjec-
This last kind of summary will typically contain outlier
tive bias.
sentences. For this reason, only the first two kinds of ag-
The primary output of this project is this corpus of 765
gregated summaries (Level 2 and Level 3) should really be
human-generated summaries that we obtained, which is
viewed as providing genuine gold standards. The third one
now available to the community.7 To set the results in con-
(All) is considered here just for the purposes of providing a
text, and illustrate its use, we also conducted a number of
comparison.
evaluations.
A variety of evaluation methods have been developed for
5. Evaluations summarisation systems. As we are concerned with extrac-
tive summaries, we will concentrate on results obtained
To illustrate the use of the human-generated summaries from applying Dice’s coefficient (Manning and Schütze,
from Mechanical Turk in the evaluation of automatic sum- 1999), although we will discuss briefly results from N-gram
marisation, we created extractive summaries of the same set and substring-based methods ROUGE (Lin, 2004) and Au-
of documents using a number of systems, namely: toSummENG (Giannakopoulos et al., 2008).
Sakhr: an online Arabic summariser.8
5.1. Dice’s Coefficient
AQBTSS: a query-based document summariser based on We used Dice’s coefficient to judge the similarity of the sen-
the vector space model that takes an Arabic document tence selections in the gold-standard extractive summaries
— derived from the human-generated, Mechanical Turk
7
http://privatewww.essex.ac.uk/˜ melhaj/ summaries — with those generated by Sakhr, AQBTSS,
easc.htm Gen-Summ, LSA-Summ and Baseline-1 (Table 1). Statis-
8
http://www.sakhr.com tically significant differences can be observed in a number
37/119
Sakhr AQBTSS Gen-Summ LSA-Summ Baseline-1
All 39.07% 32.80% 39.51% 39.23% 25.34%
Level 2 48.49% 39.90% 48.95% 50.09% 26.84%
Level 3 43.40% 38.86% 43.39% 42.67% 40.86%
of cases, but we will concentrate on some more general ob- account how many times two N-grams are found to be
servations. neighbours. Gen-Summ and LSA-Summ gave the highest
We observe that the commercial system Sakhr as well as values indicating that they produce results more similar to
the systems that build a summary around the first sentence our gold standard summaries than what Sakhr and AQBTSS
most closely approximate the gold standards, i.e. Level 2 produced.
and Level 3. This is perhaps not surprising as the overlap When applying ROUGE we considered the results of
with the document’s first sentence has been shown to be a ROUGE-2, ROUGE-L, ROUGE-W, and ROUGE-S which
significant feature in many summarisers (Yeh et al., 2008; have been shown to work well in single document summari-
Fattah and Ren, 2008). sation tasks (Lin, 2004). In line with the results discussed
It is interesting to note that summaries consisting of a sin- above, LSA-Summ and Gen-Summ performed better on av-
gle sentence only (i.e. Baseline-1) do not score particu- erage than the other systems in terms of recall, precision
larly well. That suggests that the first sentence is important and F -measure (when using Level 2 and Level 3 summaries
but not sufficient for a good summary. When comparing as our gold standards). Regarding the other systems, they
Baseline-1 with the Level 2 and Level 3 summaries, respec- all performed better than Baseline-1.
tively, we also note how the “wisdom of the crowd” seems These results should only be taken to be indicative. Dice’s
to converge on the first sentence as a core part of the sum- coefficient appears to be a better method for extractive sum-
mary. maries as we are comparing summaries on the sentence
Finally, the system that most closely approximates our level. It is however worth noting that the main results ob-
Level 2 gold standard uses LSA, a method shown to work tained from Dice’s coefficient are in line with results from
effectively in various NLP and IR tasks including summari- ROUGE and AutoSummENG.
sation, e.g. (Steinberger and Ježek, 2004; Gong and Liu,
2001). 6. Conclusions and Future Work
We also compared the baseline systems with each other (Ta- We have demonstrated how gold-standard summaries can
ble 2). This is to get an idea of how closely the summaries be extracted using the “wisdom of the crowd”.
each of these systems produce correlate with each other. Using Mechanical Turk has allowed us to produce a re-
The results suggest that the system that extracts the first source for evaluating Arabic extractive summarisation tech-
sentence only does not correlate well with any of the other niques at relatively low cost. This resource is now available
systems. At the same time we observe that Gen-Summ and to the community. It will provide a useful benchmark for
LSA-Summ generate summaries that are highly correlated. those developing Arabic summarisation tools. The aim of
This explains the close similarity when comparing each of the work described here was to create a relatively small but
these systems against the gold standards (see Table 1). It usable resource. We provided some comparison with alter-
also demonstrates (not surprisingly) that the difference be- native summarisation systems for Arabic. We have delib-
tween a standard vector space approach and LSA is not erately made no attempt in judging the individual quality
great for the relatively short documents in a collection of of each system. How this resource will be used and how
limited size. effective it can be applied remains the task of the users of
this corpus.
5.2. Other Evaluation Methods
In addition to using Dice’s coefficient, we also applied the 7. References
ROUGE (Lin, 2004) and AutoSummENG (Giannakopou- M-D. Albakour, U. Kruschwitz, and S. Lucas. 2010.
los et al., 2008) evaluation methods. Sentence-level attachment prediction. In Proceedings of
In our experiments with AutoSummENG we obtained val- the 1st Information Retrieval Facility Conference, Lec-
ues for “CharGraphValue” in the range 0.516–0.586. This ture Notes in Computer Science 6107, Vienna. Springer.
indicates how much the graph representation of a model M. Alghamdi, M. Chafic, and M. Mohamed. 2009. Arabic
summary overlaps with a given peer summary, taking into language resources and tools for speech and natural lan-
38/119
guage: Kacst and balamand. In 2nd International Con- USA. ACL.
ference on Arabic Language Resources & Tools, Cairo, C. Lin. 2004. ROUGE: A package for automatic evalua-
Egypt. tion of summaries. In Proceedings of the Workshop on
P. B. Baxendale. 1958. Machine-made index for technical Text Summarization Branches Out (WAS 2004), pages
literature—an experiment. IBM Journal of Research and 25–26.
Development, 2. F. Liu and Y. Liu. 2008. Correlation between rouge and hu-
C. Callison-Burch. 2009. Fast, cheap, and creative: Eval- man evaluation of extractive meeting summaries. In HLT
uating translation quality using Amazon’s Mechanical ’08: Proceedings of the 46th Annual Meeting of the As-
Turk. In Proceedings of the 2009 Conference on Empiri- sociation for Computational Linguistics on Human Lan-
cal Methods in Natural Language Processing (EMNLP), guage Technologies, pages 201–204. ACL.
pages 286–295. Association for Computational Linguis- H. P. Luhn. 1958. The automatic creation of litera-
tics. ture abstracts. IBM Journal of Research Development,
M. Diab, K. Hacioglu, and D. Jurafsky. 2007. Auto- 2(2):159–165.
matic Processing of Modern Standard Arabic Text. In B. Maegaard, M. Atiyya, K. Choukri, S. Krauwer, C. Mok-
A. Soudi, A. van den Bosch, and G. Neumann, editors, bel, and M. Yaseen. 2008. Medar: Collaboration be-
Arabic Computational Morphology: Knowledge-based tween european and mediterranean arabic partners to
and Empirical Methods, Text, Speech and Language support the development of language technology for ara-
Technology, pages 159–179. Springer Netherlands. bic. In In Proceedings of the 6th International Confer-
S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, ence on Language Resources and Evaluation (LREC),
and R. Harshman. 1988. Using latent semantic analy- Marrakech, Morocco.
sis to improve access to textual information. In CHI ’88: C. D. Manning and H. Schütze. 1999. Foundations of Sta-
Proceedings of the SIGCHI conference on Human fac- tistical Natural Language Processing. The MIT Press,
tors in computing systems, pages 281–285. ACM. Cambridge, Massachusetts.
M. El-Haj and B. Hammo. 2008. Evaluation of query- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
based Arabic text summarization system. In Proceeding BLEU: a method for automatic evaluation of machine
of the IEEE International Conference on Natural Lan- translation. In Proceeding of the 40th Annual Meeting
guage Processing and Knowledge Engineering, NLP- on Association for Computational Linguistics (ACL’02).
KE’08, pages 1–7, Beijing, China. IEEE Computer Soci- Association for Computational Linguistics.
ety. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. 2008.
M. El-Haj, U. Kruschwitz, and C. Fox. 2009. Experiment- Cheap and Fast - But is it Good? Evaluating Non-Expert
ing with Automatic Text Summarization for Arabic. In Annotations for Natural Language Tasks. In Proceed-
Proceedings of the 4th Language and Technology Con- ings of the 2008 Conference on Empirical Methods in
ference (LTC’09), pages 365–369, Poznań, Poland. Natural Language Processing, pages 254–263. Associa-
tion for Computational Linguistics.
M.A. Fattah and Fuji Ren. 2008. Automatic text summa-
J. Steinberger and K. Ježek. 2004. Using latent seman-
rization. In Proceedings of World Academy of Science,
tic analysis in text summarization and summary evalua-
volume 27, pages 192–195. World Academy of Science.
tion. In Proceedings of the 5th International Conference
M. Fiszman, D. Demner-Fushman, H. Kilicoglu, and T. C. on Information Systems Implementation and Modelling
Rindflesch. 2009. Automatic summarization of medline (ISIM), pages 93–100.
citations for evidence-based medical treatment: A topic-
J.-Y. Yeh, H.-R. Ke, and W.-P. Yang. 2008. iSpread-
oriented evaluation. Jouranl of Biomedical Informatics,
Rank: Ranking sentences for extraction-based summa-
42(5):801–813.
rization using feature weight propagation in the sentence
G. Giannakopoulos, V. Karkaletsis, G. Vouros, and P. Stam- similarity network. Expert Systems with Applications,
atopoulos. 2008. Summarization system evaluation re- 35(3):1451 – 1462.
visited: N-gram graphs. ACM Transactions on Speech
and Language Processing (TSLP), 5(3):1–39.
Y. Gong and X. Liu. 2001. Generic text summarization
using relevance measure and latent semantic analysis. In
SIGIR ’01: Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 19–25. ACM.
S. P. Hobson, B. J. Dorr, C. Monz, and R. Schwartz. 2007.
Task-based evaluation of text summarization using rel-
evance prediction. Information Processing & Manage-
ment, 43(6):1482–1499.
R. Katragadda, P. Pingali, and V. Varma. 2009. Sentence
position revisited: a robust light-weight update summa-
rization ’baseline’ algorithm. In CLIAWS3 ’09: Proceed-
ings of the Third International Workshop on Cross Lin-
gual Information Access, pages 46–52, Morristown, NJ,
39/119
DefArabicQA: Arabic Definition Question Answering System
Omar Trigui1, Lamia Hadrich Belguith1 , Paolo Rosso2
1
ANLP Research Group- MIRACL Laboratory, University of Sfax, Tunisia
2
Natural Language Engineering Lab. - ELiRF, Universidad Politécnica de Valencia, Spain
{omar.trigui,l.belguith}@fsegs.rnu.tn, prosso@dsic.upv.es
Abstract
Today the Web is the largest resource of knowledge and, therefore, sometimes this makes it difficult to find precise
information. Current search engines can only return ranked snippets containing the effective answers to a query user. But,
they can not return the exact answers. Question Answering systems present the solution to obtain effective and exact answers
to a user question asked in natural language question instead of keywords query. Unfortunately, Question Answering task for
the Arabic language has not been investigated enough in the last decade, compared to other languages. In this paper, we
tackle the definition Question Answering task for the Arabic language. We propose an Arabic definitional Question
Answering system based on a pattern approach to identify exact and accurate definitions about organization using Web
resources. We experimented this system using 2000 snippets returned by Google search engine and Wikipedia Arabic version
and a set of 50 organization definition questions. The obtained results are very encouraging: (90%) of the questions used have
complete (vital) definitions in the top-five answers and (64%) of them have complete definitions in the top-one answer. MRR
was (0.81).
40/119
al.,2002) system’s. Their evaluation was based on a The architecture of the DefArabicQA system is
set of 25 documents from the Web and 12 illustrated in Figure 1. From a general viewpoint,
questions. (Benajiba et al.,2007a) developed the system is composed of the following
‘ArabiQA’ a factual QA system. They employed components: i) question analysis, ii) passage
Arabic-JIRS4 (Benajiba et al.,2007b), a passage retrieval, iii) definition extraction and iv) ranking
retrieval system to search the relevant passages. candidate definitions.
They used also the named entity system ANERsys This system does not use any sophisticated
(Benajiba et al.,2007c) to identify and classify syntactic or semantic techniques, as those used for
named entities within the passages retrieved. The factual QA systems (Hammo et al.,2002; Benajiba
test-set consists of 200 questions and 11,000 et al.,2007).
documents from Wikipedia Arabic version. They
reached a precision of 83.3% (Benajiba et 3. 1 Question analysis
al.,2007a). (Brini et al.,2009) developed a prototype This module is a vital component of DefArabicQA.
to build an Arabic factual Question Answering The result of this module is the identification of the
system using Nooj platform5 to identify answers topic question (i.e., named entity) and the
from a set of education books. Most of these dedication of the answer type expected. The
researches cited above, have not made test-bed question topic is identified by using two lexical
publicly available, which makes it impossible to question patterns (Table. 1) and the answer type
compare their evaluation results. expected is deduced from the interrogative pronoun
As we have already said, there is not a research of the question.
focused on definitional QA systems for the Arabic
language. Therefore, we have considered that an
Expected answer
effort needs to be done in this direction. We built an Question patterns
types
Arabic QA system, which we named DefArabicQA
Who+be+<topic> ? من ھو| من ھي >الموضوع<؟ Person
that identifies and extracts the answers (i.e., exact
definitions) from Web resources. Our approach is What+be+<topic> ? ما ھو| ما ھي> الموضوع<؟ Organization
inspired from researches that have obtained good
results in TREC experiments. Among these Table 1. Question patterns and their expected
researches we cite the work of (Grunfeld & Kwok, answer types used by DefArabicQA system
2006) which is based on techniques from IR,
pattern matching and metakeyword detection with 3.2 Passage retrieval
little linguistic analysis and no natural language The passage retrieval module collects the top-n
understanding. snippets retrieved by the Web search engine. This
specific query is constituted of the question topic
3 The DefArabicQA system which is identified by the question analysis module.
After collecting the top-n snippets, only those
snippets containing the integrate question topic are
kept on the basis of some heuristic (e.g. length of a
snippet must be more than 13 characters).
41/119
topic in a snippet is recognized by a specific centroid vector and in the candidate definition
pattern. CDi , 1 ≤ k ≤ n and f ik is the frequency of wordk.
42/119
definition, 2% by the fifth candidate definition as 10% (in the second experiment). Also, the Rate of
shown in Table 2. The systems missed 18% of the the questions answered by the first returned
questions as shown in Table 3. MRR was equal to candidate definition was increased from 54% (in the
0.70 as shown in Table 4. first experiment) to 64% (in the second
experiment).
4.2 Results of the second experiment
The main goal of the second experiment is to 6. Conclusion and future work
measure the value added by the Web resource In this paper we proposed a definitional Question
Wikipedia to the results obtained in the first Answering system called DefArabicQA. This
experiment with the Google search engine. system provides effective and exact answers to
In this experiment, we used the same set of definition questions expressed in Arabic language
questions of the first experiment with Google from Web resources. It is based on an approach
search engine and Wikipedia as Web resources. Out which employs a little linguistic analysis and no
of the 50 questions in the test collection, 45 language understanding capability. DefArabicQA
questions (90%) were answered correctly by identifies candidate definitions by using a set of
complete definitions in the top-five candidate lexical patterns, filters these candidate definitions
definitions. 64% of the questions were answered by by using heuristic rules and ranks them by using a
the first returned candidate definition, 16% by the statistical approach.
second candidate definition, 4% by the third Two evaluation experiments have been carried out
candidate definition, 2% by the fourth candidate on DefArabicQA. The first experiment was based
definition and 4% by the fifth candidate definition on Google as a Web resource and has obtained an
as shown in Table 2. The system missed 10% of the MRR equal to 0.70 and a rate of questions
questions as shown in Table 3. The obtained value answered by the first answer equal to 54%, while
of MRR is 0.81 (see Table 4). the second experiment was based on Google
coupled with Wikipedia as Web resources. In this
Experiment I Experiment II experiment, we obtained an MRR equal to 0.81 and
st
a rate of questions answered by the first answer
Rank 1 27 (54%) 32 (64%) equal to 64%. 50 definition questions are used for
nd
Rank 2 7 (14%) 8 (16%) both experiments.
Rank 3 th
3 (6%) 2 (4%) As future works, we plan to improve the quality of
th
the definitions when it is truncated. Indeed, in some
Rank 4 3 (6%) 1 (2%) cases, few words are missed at the end of the
Rank 5th 1 (2%) 2 (4%) definition answer. This is due to the fact that the
Top-five 41 (82%) 45 (90%) snippet itself is truncated. As a solution, we will
download the original Web page and segment the
Table 2. Rate of the answered questions for each useful snippet correctly using a tokenizer. We also
Rank (the Top-5 positions) plan to conduct an empirical study to determine
different weights to the three used criteria for
Experiment I Experiment II ranking the candidate definitions. These weights
Top-5 9 (18%) 5 (10%) will reflect the importance of each criterion.
Table 3. Rate of non answered questions (in the Acknowledgments
Top-5 positions)
This research work started thanks to the bilateral
Spain-Tunisia research project on "Answer
Experiment I Experiment II Extraction for Definition Questions in Arabic"
MRR 0.70 0.81 (AECID-PCI B/017961/08).
The work of the third author was carried out in the
Table 4. MRR values for both experiments framework of the AECID-PCI C/026728/09 and the
TIN2009-13391-C04-03 research projects.
5. Discussion
The two experiments cited above showed that our
approach applied in DefArabicQA system returned References
reasonably good results.
Benajiba, Y., Rosso, P., and Lyhyaoui, A. (2007.a).
The Web resource Wikipedia has improved the
Implementation of the ArabiQA Question
results of DefArabicQA when it was coupled with
Answering System's Components. In
Google in the second experiment. The MRR was
increased from 0.70 (in the first experiment) to 0.81 Proceedings of Workshop on Arabic Natural
(in the second experiment) and the rate of non Language Processing, 2nd Information
answered question in the Top-5 positions was Communication Technologies Int. Symposium.
decreased from 18% (in the first experiment) to
43/119
Benajiba, Y., Rosso, P., and J.M. Gomez. (2007.b).
Adapting the JIRS Passage Retrieval System to
the Arabic Language. In Proceeding of CICLing
conference, Springer-Verlag, 2007. pages 530--
541.
44/119
Techniques for Arabic Morphological Detokenization
and Orthographic Denormalization
Ahmed El Kholy and Nizar Habash
Center for Computational Learning Systems, Columbia University
475 Riverside Drive New York, NY 10115
{akholy,habash}@ccls.columbia.edu
Abstract
The common wisdom in the field of Natural Language Processing (NLP) is that orthographic normalization and morphological tok-
enization help in many NLP applications for morphologically rich languages like Arabic. However, when Arabic is the target output,
it should be properly detokenized and orthographically correct. We examine a set of six detokenization techniques over various tok-
enization schemes. We also compare two techniques for orthographic denormalization. We discuss the effect of detokenization and
denormalization on statistical machine translation as a case study. We report on results which surpass previously published efforts.
[cnj+ [prt+ [art+ BASE +pro]]] • Rule-based (R): use deterministic rules to handle all
of the cases described in Table 1. We pick the most
At the deepest level, the BASE can have either the defi- frequent decision for ambiguous cases.
nite article (+ È@ Al+ ‘the’) or a member of the class of
pronominal enclitics, +pro, (e.g., Ñë+ +hm ‘their/them’). • Table-based (T): use a lookup table mapping tokenized
Next comes the class of particle proclitics (prt+), e.g., + È l+ forms to detokenized forms. The table is based on
‘to/for’. At the shallowest level of attachment we find the pairs of tokenized and detokenized words from our
conjunction proclitic (cnj+), e.g., + ð w+ ‘and’. The attach- language model data which had been processed by
ment of clitics to word forms is not a simple concatenation MADA. We pick the most frequent decision for am-
process. There are several orthographic and morphological biguous cases. Words not in the table are handled
adjustment rules that are applied to the word. An almost with the (S) technique. This technique essentially se-
complete list of these rules relevant to this paper are pre- lects the detokenized form with the highest conditional
sented and exemplified in Table 1. probability P (detokenized|tokenized).
It is important to make the distinction here between simple
word segmentation, which splits off word substrings with • Table+Rule(T+R): same as (T) except that we back off
no orthographic/morphological adjustments, and tokeniza- to (R) not (S).
tion, which does. Although segmentation by itself can have
important advantages, it leads to the creation of inconsistent The above four techniques are the same as those used by
or ambiguous word forms: consider the words éJ.JºÓ mktb~ Badr et al. (2008). We introduce two new techniques that
‘library’ and ÑîDJ. JºÓ mktbthm ‘their library’. A simple seg- use a 5-gram untokenized-form language model and the
mentation of the second word creates the non-word string disambig utility in the SRILM toolkit (Stolcke, 2002) to
. JºÓ mktbt; however, applying adjustment rules as part of
I decide among different alternatives:
the tokenization generates the same form of the basic word
in the two cases. For more details, see (Habash, 2007). In • T+LM: we use all the forms in the (T) approach. Al-
this paper, we do not explore morphological tokenization ternatives are given different conditional probabilities,
beyond decliticization. P (detokenized|tokenized), derived from the tables.
Backoff is the (S) technique. This technique essen-
4. Approach tially selects the detokenized form with the highest
We would like to study the value of a variety of detokeniza- P (detokenized|tokenized) × PLM (detokenized).
tion techniques over different tokenization schemes and or-
thographic normalization. We report results on naturally • T+R+LM: same as (T+LM) but with (R) as backoff.
46/119
Rule Name Condition Result Example
Definite Article ? È+ È@+ È l+Al+l? + ÉË ll+ I.JºÓ+ È@+ È l+Al+mktb I.JºÒÊË llmktb ‘for the office’
éJm.Ì+ È@+ È l+Al+ljn~ éJj.ÊË lljn~ ‘for the committee’
Ta-Marbuta
è- -~ +pron H - -t +pron Ñë+ éJ . JºÓ mktb~+hm ÑîDJ. JºÓ mktbthm ‘their library’
Alif-Maqsura ø- -ý +pron @- -A +pron è+ øðP rwY+h è@ð P rwAh ‘he watered it’
exceptionally ø- -y +pron è+ úΫ ςlY+h éJÊ« ςlyh ‘on him’
Waw-of-Plurality @ð- -wA +pron ð- -w +pron è+ @ñJ.J» ktbwA+h èñJ.J» ktbwh ‘they wrote it’
Õç'- -tm +pron ñÖß- -tmw +pronè+ ÕæJ . J» ktbtmw+h èñÒJ.J» ktbtmwh ‘you [pl.] wrote it’
Hamza Z- -’ +pron ø- -ŷ +pron è+ ZAîE. bhA’+h éKAîE. bhAŷh ‘his glory [gen.]’
less frequently ð- -ŵ +pron è+ ZAîE. bhA’+h èðAîE. bhAŵh ‘his glory [nom.]’
less frequently Z- -’ +pron è+ ZAîE. bhA’+h èZAîE. bhA’h ‘his glory [acc.]’
Y-Shadda ø+ ø- -y +y øy ø+ úæA¯ qADy+y úæA¯ qADy ‘my judge’
N-Assimilation áÓ mn +m/n Ð m +m/n AÓ+ áÓ mn+mA AÜØ mmA ‘from which’
á« ςn +m/n ¨ ς +m/n áÓ+ á« ςn+mn áÔ« ςmn ‘about whom’
B+ à @ Ân +lA B @ ÂlA B+ à @ Ân+lA B @ ÂlA ‘that ... not’
Table 2: A comparison of the different tokenization schemes studied in this paper in terms of their definition, the relative
change from no-tokenization (D0) in tokens (Token#) and enriched and reduced word types (E NR Type# and R ED Type#),
MADA’s error rate in producing the enriched tokens, the reduced tokens and just segmentation (S EG); the out-of-vocabulary
(OOV) rate; and finally the perplexity value associated with different tokenization. OOV rates and perplexity values are
measured against the NIST MT04 test set while prediction error rates are measured against a Penn Arabic Treebank devset.
E NR R ED
Detokenization
E NR R ED MADA-E NR Joint-D E T OK-E NR
S 38.36 35.53 39.73
R 1.41 3.03 10.59
T 1.37 1.54 8.92 9.46
T+R 0.79 0.95 8.68 9.22
T+LM 1.20 1.29 9.34 6.23
T+R+LM 0.62 0.71 7.39 5.89
Table 4: Detokenization and enrichment results for D3 tokenization scheme in terms of sentence-level detokenization error
rate.
tuning is based on the tokenized Arabic without detokeniza- bootstrap resampling (Koehn, 2004). Training over R ED
tion. We use a maximum phrase length of size 8 for all ex- Arabic then enriching its output sometimes yields better re-
periments. We report results on the 2005 NIST MT evalu- sults than training on E NR directly which is the case with
ation set (MT05). These test sets were created for Arabic- the TB tokenization scheme. However, sometimes the op-
English MT and have 4 English references. We use only posite is true as demonstrated in the D3 results. This is
one Arabic reference in reverse direction for both tuning due to the tradeoff between the quality of translation and
and testing. We evaluate using BLEU-4 (Papineni et al., the quality of detokenization which is discussed in the next
2002) although we are aware of its caveats (Callison-Burch section.
et al., 2006).
5.3.3. Detokenization Experiments
5.3.2. Tokenization Experiments
We measure the performance of the different detokeniza-
System E NR R ED tion techniques discussed in Section 4. against the SMT
Evaluation E NR R ED E NR R ED output for the TB tokenization scheme. We report results
in terms of BLEU scores in Table 6. The results for basic
D0 24.63 24.67 24.66 24.71
E NR and R ED detokenization are in columns two and three.
D1 25.92 25.99 26.06 26.12
Column four presents the results for the Joint-D E T OK-E NR
D2 26.41 26.49 26.06 26.15
approach to joint enriching and detokenization of tokenized
TB 26.46 26.51 26.73 26.80
reduced output discussed in Section 4.
S2 25.71 25.76 26.11 26.19
When comparing Table 6 (in BLEU scores) with the corre-
D3 25.68 25.75 25.03 25.10
sponding cells in Table 4 (in sentence-level detokenization
Table 5: Comparing different tokenization schemes for sta- error rate), we observe that the wide range of performance
tistical MT in BLEU scores over detokenized Arabic (using in Table 4 is not reflected in BLEU scores in Table 6. This
T+R+LM technique) is expected given the different natures of the tasks and met-
rics used. Although the various detokenization techniques
do not preserve their relative order completely, the S tech-
We compare the performance of the different tokenization nique remains the worst performer and T+R+LM remains
schemes and normalization conditions. The results are pre- the best in both tables. However, the R and T+LM tech-
sented in Table 5 using T+R+LM detokenization technique. niques perform relatively much better with MT output than
The best performer across all conditions is the TB scheme. they do with naturally occurring text. The most interest-
The previously reported best performer was S2 (Badr et ing observation is perhaps that under the best performing
al., 2008), which was only compared against D0 and D3 T+R+LM technique, joint detokenization and enrichment
tokenizations. Our results are consistent with Badr et al. (Joint-D E T OK-E NR) outperforms E NR detokenization de-
(2008)’s results regarding D0 and D3. However, our TB spite the fact that Joint-D E T OK-E NR has over nine times
result outperforms S2. The differences between TB and the error rate in Table 4. This shows that improved MT
all other conditions are statistically significant above the quality using R ED training data out-weighs the lower qual-
95% level. Statistical significance is computed using paired ity of automatic enrichment.
49/119
Detokenization
E NR R ED 7. Acknowledgement
E NR R ED Joint-D E T OK-E NR
S 25.57 26.04 N/A The work presented here was funded by a Google research
R 26.45 26.78 N/A award. We would like to thank Ioannis Tsochantaridis, Ma-
rine Carpuat, Alon Lavie, Hassan Al-Haj and Ibrahim Badr
T 26.40 26.78 22.44
for helpful discussions.
T+R 26.40 26.78 22.44
T+LM 26.46 26.80 26.73
T+R+LM 26.46 26.80 26.73 8. References
Table 6: BLEU scores for SMT outputs with different deto- Ibrahim Badr, Rabih Zbib, and James Glass. 2008. Seg-
kenization techniques over TB tokenization scheme mentation for english-to-arabic statistical machine trans-
lation. In Proceedings of ACL-08: HLT, Short Papers,
pages 153–156, Columbus, Ohio, June. Association for
Computational Linguistics.
5.3.4. SMT Detokenization Error Analysis Jeff A. Bilmes and Katrin Kirchhoff. 2003. Factored lan-
Since we do not have a gold detokenization reference for guage models and generalized parallel backoff. In Pro-
our MT output, we automatically identify detokenization ceedings of the Human Language Technology Confer-
errors resulting in non-words (i.e., invalid words). We an- ence/North American Chapter of Association for Com-
alyze the SMT output for the D3 tokenization scheme and putational Linguistics (HLT/NAACL-03), pages 4–6, Ed-
T+R+LM detokenization technique using the morphologi- monton, Canada.
cal analyzer component in the MADA toolkit,4 which pro- T. Buckwalter. 2004. Buckwalter Arabic Morphologi-
vides all possible morphological analyses for a given word cal Analyzer Version 2.0. Linguistic Data Consortium,
and identifies words with no analysis. We find 94 cases University of Pennsylvania, 2002. LDC Cat alog No.:
of words with no analysis out of 27,151 words (0.34%), LDC2004L02, ISBN 1-58563-324-0.
appearing in 84 sentences out of 1,056 (7.9%). Most of Chris Callison-Burch, Miles Osborne, and Philipp Koehn.
the errors come from producing incompatible sequences 2006. Re-evaluating the Role of BLEU in Machine
of clitics, such as having a definite article with a pronom- Translation Research. In Proceedings of the 11th con-
inal clitic. For instance, the tokenized word AK+ é¯C«+ È@ ference of the European Chapter of the Association for
Al+ςlAq~+nA ‘the+relation+our’ is detokenized to AJJ¯CªË@ Computational Linguistics (EACL’06), pages 249–256,
AlςlAqtnA which is grammatically incorrect. This is not a Trento, Italy.
detokenization problem per se but rather an MT error. Such F. Diehl, M.J.F. Gales, M. Tomalin, and P.C. Woodland.
errors could still be addressed with specific detokenization 2009. Morphological Analysis and Decomposition for
extensions such as removing either the definite article or the Arabic Speech-to-Text Systems. In Proceedings of In-
pronominal clitic. terSpeech.
Nizar Habash and Owen Rambow. 2005. Arabic Tokeniza-
tion, Part-of-Speech Tagging and Morphological Disam-
6. Conclusions and Future Work biguation in One Fell Swoop. In Proceedings of the
43rd Annual Meeting of the Association for Computa-
We presented experiments studying six detokenization tional Linguistics (ACL’05), pages 573–580, Ann Arbor,
techniques to produce orthographically correct and en- Michigan, June. Association for Computational Linguis-
riched Arabic text. We presented results on naturally oc- tics.
curring Arabic text and MT output against different tok-
Nizar Habash and Fatiha Sadat. 2006. Arabic Prepro-
enization schemes. The best technique under all conditions
cessing Schemes for Statistical Machine Translation. In
is T+R+LM for both naturally occurring Arabic text and
Proceedings of the 7th Meeting of the North American
MT output. Regarding enrichment, joint enrichment with
Chapter of the Association for Computational Linguis-
detokenization gives better results than performing the two
tics/Human Language Technologies Conference (HLT-
tasks in two separate steps. Moreover, the best setup for
NAACL06), pages 49–52, New York, NY.
MT is training on R ED text and then enriching and detok-
enizing the output using the joint technique. Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter.
2007. On Arabic Transliteration. In A. van den Bosch
In the future, we plan to investigate the creation of mappers
and A. Soudi, editors, Arabic Computational Mor-
trained on seen examples in our tables to produce ranked
phology: Knowledge-based and Empirical Methods.
detokenized alternatives for unseen tokenized word forms.
Springer.
In addition, we plan to examine language modeling ap-
Nizar Habash. 2007. Arabic Morphological Representa-
proaches that target Arabic’s complex morphology such as
tions for Machine Translation. In A. van den Bosch and
factored LMs (Bilmes and Kirchhoff, 2003). We also plan
A. Soudi, editors, Arabic Computational Morphology:
to explore ways to make detokenization robust to MT er-
Knowledge-based and Empirical Methods. Springer.
rors.
Ilana Heintz. 2008. Arabic language modeling with finite
state transducers. In Proceedings of the ACL-08: HLT
4 Student Research Workshop, pages 37–42, Columbus,
This component uses the databases of the Buckwalter Arabic
Morphological Analyzer (Buckwalter, 2004). Ohio, June. Association for Computational Linguistics.
50/119
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Fed-
erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,
R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst.
2007. Moses: open source toolkit for statistical machine
translation. In Proceedings of the 45th Annual Meeting
of the Association for Computational Linguistics Com-
panion Volume Proceedings of the Demo and Poster Ses-
sions, pages 177–180, Prague, Czech Republic, June.
Philipp Koehn. 2004. Statistical significance tests forma-
chine translation evaluation. In Proceedings of the Em-
pirical Methods in Natural Language Processing Con-
ference (EMNLP’04), Barcelona, Spain.
Young-Suk Lee. 2004. Morphological analysis for statisti-
cal machine translation. In Proceedings of the 5th Meet-
ing of the North American Chapter of the Association for
Computational Linguistics/Human Language Technolo-
gies Conference (HLT-NAACL04), pages 57–60, Boston,
MA.
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wig-
dan Mekki. 2004. The Penn Arabic Treebank : Building
a Large-Scale Annotated Arabic Corpus. In NEMLAR
Conference on Arabic Language Resources and Tools,
pages 102–109, Cairo, Egypt.
Franz Josef Och and Hermann Ney. 2003. A Systematic
Comparison of Various Statistical Alignment Models.
Computational Linguistics, 29(1):19–52.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
Zhu. 2002. BLEU: a Method for Automatic Evaluation
of Machine Translation. In Proceedings of the 40th An-
nual Meeting of the Association for Computational Lin-
guistics, pages 311–318, Philadelphia, PA.
Ruhi Sarikaya and Yonggang Deng. 2007. Joint
morphological-lexical language modeling for machine
translation. In Human Language Technologies 2007:
The Conference of the North American Chapter of the
Association for Computational Linguistics; Companion
Volume, Short Papers, pages 145–148, Rochester, New
York, April. Association for Computational Linguistics.
Andreas Stolcke. 2002. SRILM - an Extensible Language
Modeling Toolkit. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP),
volume 2, pages 901–904, Denver, CO.
Andreas Zollmann, Ashish Venugopal, and Stephan Vogel.
2006. Bridging the inflection morphology gap for ara-
bic statistical machine translation. In Proceedings of the
Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers, pages 201–204, New
York City, USA. Association for Computational Linguis-
tics.
51/119
Tagging Amazigh with AnCoraPipe
Abstract
Over the last few years, Moroccan society has known a lot of debate about the Amazigh language and culture. The creation of a new
governmental institution, namely IRCAM, has made it possible for the Amazigh language and culture to reclaim their rightful place in
many domains. Taking into consideration the situation of the Amazigh language which needs more tools and scientific work to achieve
its automatic processing, the aim of this paper is to present the Amazigh language features for a morphology annotation purpose. Put in
another way, the paper is meant to address the issue of Amazigh’s tagging with the multilevel annotation tool AnCora Pipe. This tool is
adapted to use a specific tagset to annotate Amazigh corpora with a new defined writing system. This step may well be viewed as the
first step for an automatic processing of the Amazigh language; the main aim at very beginning being to achieve a part of speech tagger.
52/119
ordering and comparison method for comparing character which means “if”;
strings and description of the common template tailorable - Prepositions are always an independent set of characters
ordering; with respect to the noun they precede; however, if the
- Part 1: general principles governing keyboard layouts of preposition is followed by a pronoun, both the preposition
the standard ISO/IEC 9995 related to keyboard layouts for and the noun make a single whitespace-delimited string.
text and office systems. For example: ⵖⵔ (ɣr) “to, at” + ⵉ (i) “me” possessive
Most Amazigh words may be conceived of as having pronoun gives ⵖⴰⵔⵉ/ⵖⵓⵔⵉ (ɣari/ɣuri) “to me, at me, with
consonantal roots. They can have one, two, three or four me”;
consonants, and may sometimes extend to five. Words are - Particles are always isolated. There are aspect particles
made out of these roots by following a pattern (Chafiq such as ⴰⵇⵇⴰ (aqqa), ⴰⵔ (ar), ⴰⴷ (ad), particles of negation
1991). For example the word ‘aslmad’ is built up from the such as ⵓⵔ (ur), orientation particles like ⵏⵏ in ⴰⵡⵉ ⵏⵏ!(awi
root lmd “study” by following the pattern as12a3, where nn) “take it there” and a predicative particle ⴷ (d);
the number 1 is replaced by the first consonant of the root, - Determinants take always the form of single between two
number 2 is replaced by the second consonant of the root blank spaces. Determiners are divided into articles,
and number 3 is replaced by the 3rd consonant of the root. demonstratives, exclamatives, indefinite articles,
Concerning spelling, the system put by IRCAM is based on interrogatives, numerals, ordinals, possessives,
a set of rules and principles applied to “words” along which presentatives, quantifiers. ⴽⵓⵍⵍⵓ (kullu) “all” is a quantifier
the parsing of pronounced speech into written separated for instance;
words is effected. A grapheme, a written word, according - Amazigh punctuation marks are similar to the punctuation
to the spelling system is a succession of letters which can marks adopted in international languages and have the
sometimes be one letter delimited by whitespace or same functions. Capital letters, nonetheless, do not occur
punctuation. neither at the beginning of sentences nor at the initial of
The graphic rules for Amazigh words are set out as follows proper names.
(Ameur et al 2006a, 2006b, Boukhris et al 2008): The English terminology used above was extracted form
- Nouns consist of a single word occurring between two (Boumalk and Naït-Zerrad, 2009).
blank spaces. To the noun are attached the morphological
affixes of gender (masculine/ feminine), number 3. Amazigh corpora
(singular/plural) and state (free/construct) as it is shown in
the following examples: ⴰⵎⵣⴷⴰⵖ/ⵜⵜⴰⵎⵣⴷⴰⵖⵜⵜ (amzdaɣ 3.1 Amazigh corpora features
(masc.)/tamzdaɣt(fem.)) “a dweller”, ⴰⵎⵣⴷⴰⵖ/ⵉⵉⵎⵣⴷⴰⵖⵏⵏ
Amazigh corpora have the following characteristics:
(amzdaɣ (sing.)/imzdaɣn(plr.)) “dweller/dwellers”, and
- They are extracted from geographically circumscribed
ⴰⵎⵣⴷⴰⵖ/ⵓⵎⵣⴷⴰⵖ (amzdaɣ (free state)/umzdaɣ (construct
dialects;
state)). Kinship names constitute a special class since they
- Some varieties are less represented than others, or not
are necessarily determined by possessive markers which
studied at all;
form with them one word, for example: ⴱⴰⴱⴰⴽ ⴽ (babak)
- There is special need for a more general type of work
which means “your father”;
whose goal is to collect the data of all dialects;
- Quality names/adjectives constitute a single word along
- Existing publications are scattered and inaccessible in
with the morphological indicators of gender (masculine/
most cases. Some of them go back to the XIXth century
feminine), number (singular/plural), and state
and the beginning of the XXth century. The few existing
(free/construct);
copies of those references are only available in specialized
- Verbs are single graphic words along with its inflectional
libraries, mainly in France;
(person, number, aspect) or derivational morphemes. For
- General documents containing the data of all Amazigh
example: ⵜⵜⴰⵣⵣⵍ /ttazzl/which means “you run
dialects do not exist (phonetics, semantics, morphology,
(imperfective)”. The verb is separated by a blank space
phraseology…etc.).
from its predecessor and successor pronouns, i.e.: ⵢⴰⵙⵉ ⵜⵏ
- Some existing texts need revision because of
/ ⴰⴷ ⵜⵏ ⵢⴰⵙⵉ (“yasi tn / ad tn yasi” which means “he took
segmentation problems.
them / he will take them”);
To constitute an annotated corpus, we have chosen a list of
- Pronouns are isolated from the words they refer to.
corpora extracted from the Amazigh version of IRCAM’s
Pronouns in Amazigh are demonstrative, exclamative,
web site 1 , the periodical Inghmisn n usinag 2 (IRCAM
indefinite, interrogative, personal, possessive, or relative.
newsletter) and school textbooks. We were able to reach a
For instance, ⴰⴷ (ad) in the phrase ⴰⴱⵔⵉⴷ ⴰⴷ (abrid ad),
total number of words superior to 20k words. A
which means “this way”, is an example of a demonstrative
comparative quantity of corpora was used in tagging other
pronoun;
languages, for example (Allauzen and Bonneau-Maynard,
- An adverb consists of one word which occurs between
2008).
two blank spaces. Adverbs are divided into adverbs of
place, time, quantity, manner, and interrogative adverbs.
- Focus mechanisms, interjections and conjunctions are 1
written in the form of single words occurring between two www.ircam.ma
2
blank spaces. An example of a conjunction is: ⵎⵔ (mr) Freely downloadable from
http://www.ircam.ma/amz/index.php?soc=bulle
53/119
3.2 Writing systems U+2D56 ⵖ ɣ غ V, v 86, 118 G
Amazigh corpora produced up to now are written on the U+2D59 ⵙ s س S, s 83, 115 s
basis of different writing systems, most of them use U+2D5A ⵚ ṣ ص Ã, ã 195, 227 S
Tifinaghe-IRCAM (Tifinaghe-IRCAM makes use of U+2D5B ⵛ c ش C, c 67, 99 c
Tifinaghe glyphs but Latin characters) and Tifinaghe
U+2D5C ⵜ t ت T, t 84, 116 t
Unicode. It is important to say that the texts written in
U+2D5F ⵟ ṭ ط Ï, ï 207, 239 T
Tifinaghe Unicode are increasingly used.
Even though, we have decided to use a specific writing U+2D61 ⵡ w ۉ W, w 87, 119 w
system based on ASCII characters for the following U+2D62 ⵢ +ي ي Y, y 89, 121 y
reasons: U+2D63 ⵣ z ز Z, z 90, 122 z
- To have a common set of characters for annotated U+2D65 ⵥ ẓ ژ Ç, ç 199, 231 Z
corpora; No correspondant
- To facilitate texts treatment for annotators since ASCII U+2D6F ⵯ ɀ in Tifinaghe-IRCAM °
characters are known by all systems;
- To handle its use due to the fact that people are still more Table1: The mapping from existing writing systems and the
familiar with Arabic and Latin writing systems. chosen writing system.
In Table 1 of correspondences between the different A transliteration tool was build in order to handle
writing systems and transliteration correspondences is transliteration to and from the chosen writing system and to
shown correct some elements such as the character “^” which
exists in some texts due to input error in entring some
Tifinaghe Used characters in Tifinaghe letters. So the sentence portion “ⴰⵙⵙ ⵏ ⵜⵎⵖⵔⴰ”
for tagging
Chosen characters
Arabic
codes
Code
4. AnCoraPipe tool
U+2D30 ⴰ a ا A, a 65, 97 a AnCoraPipe (Bertran et al. 2008) is a corpus annotation
U+2D31 ⴱ b ب B, b 66, 98 b
tool which allows different linguistic levels to be annotated
efficiently, since it uses the same format for all stages. The
U+2D33 ⴳ g گ G, g 71, 103 g
tool reduces the annotation time and makes easy the
U+2D33& ⴳ integration of the different annotators and the different
U+2D6F ⵯ gɀ +گ Å, å 197, 229 g° annotation levels.
U+2D37 ⴷ d د D, d 68, 100 d The input documents may have a standard XML format,
U+2D39 ⴹ ḍ ض Ä, ä 196, 228 D allowing to represent tree structures (specially usefull at
U+2D3B ⴻ e3 ﻳــــﻲ E, e 69, 101 e syntactic anotation stages). As XML is a wide spread
standard, there are many tools available for its analysis,
U+2D3C ⴼ f ف F, f 70, 102 f
transformation and management.
U+2D3D ⴽ k K, k 75, 107 k
AnCoraPipe includes an integrated search engine based on
U+2D3D& ⴽ XPath language (http://www.w3.org/TR/xpath/), which
U+2D6F ⵯ kɀ +گ Æ, æ 198, 230 k allows to find structures of all kinds among the documents.
U+2D40 ⵀ h ه H, h 72,104 h For corpus analysis, an export tool can summarize the
U+2D40 ⵃ ḥ ح P, p 80,112 H attributes of all nodes in the corpus in a grid that can easily
U+2D44 ⵄ ε ع O, o 79, 111 E
be imported to basic analysis tools (such as Excel or
OpenOffice calc), statistical software (SPSS) or Machine
U+2D45 ⵅ x خ X, x 88, 120 x
Learning tools (Weka).
U+2D47 ⵇ q ق Q, q 81, 113 q A default tagset is provided in the standard installation. It
U+2D49 ⵉ i ي I, i 73, 105 i has been designed as generic as possible in order to match
U+2D4A ⵊ j ج J, j 74, 106 j the requisites of a wide amount of languages. In spite of
U+2D4D ⵍ l ل L, l 76, 108 l that, if the generic tagset is not useful, the interface is fully
U+2D4E ⵎ m م M, m 77, 109 m
customizable to allow different tagsets defined by the user.
In order to allow AnCoraPipe usable in a full variety of
U+2D4F ⵏ n ن N, n 78, 110 n
languages, the user can change the visualization font. This
U+2D53 ⵓ u و W, w 87, 119 u may help viewing non-latin scripts such as Chinese, Arabic
U+2D54 ⵔ r ر R, r 82, 114 r or Amazigh.
U+2D55 ⵕ ṛ Ë, ë 203, 235 R AnCoraPipe is currently an Eclipse Plugin. Eclipse is an
extendable integrated development environment. With this
plugin, all features included in Eclipse are made available
note : different use in the IPA which uses the letter ə
3
54/119
for corpus annotation and developing. In particular, the In Table 2 the node Residual stands for attributes like
Eclipse’s collaboration and team plugins can be used to currency, number, date, math marks and other unknown
organize the work of a group of annotators. residual words.
55/119
CRF++ toolkits4. introduction au domaine berbère, éditions du CNRS, 1984.
P 232-242.
6. Conclusion and future works
Cohen, D. (2007). Chamito-sémitiques (langues). In
In this paper, after a brief description about social and
linguistic characteristics of the Amazigh language, we have Encyclopædia Universalis.
addressed the basic principles we followed for tagging Iazzi, E., Outahajala,M. (2008), Amazigh Data Base. In
Amazigh written corpora with AnCoraPipe: the tagset used, proceedings of LREC 08.
the transliteration and the annotation tool. Kudo, T., Yuji Matsumoto, Y. (2000). Use of Support
In the future, it is our goal to tag more corpora to constitute Vector Learning for Chunk Identification.
a reference corpus for works on Amazigh NLP and we plan Lafferty, J. McCallum, A. Pereira, F. (2001). Conditional
also to work on Amazigh Base Phrase Chunking. Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data. In proceedings of
Acknowledgments ICML-01 282-289
We would like to thank Manuel Bertran for improving the Outahajala, M., Zenkouar, L. (2005). La norme du tri, du
AnCora Pipe tool to support Amazigh features, all IRCAM clavier et Unicode. In proceedings of the workshop : la
researchers and Professor Iazzi El Mehdi from Ibn Zohr typographie entre les domaines de l'art et l'informatique,
University, Agadir for their explanations and precious help. pp. 223—238.
The work of the last two authors was carried out thanks to Saa, F. (2006). Les thèmes verbaux de l'Amazighe. In
AECID-PCI C/026728/09 and TIN2009-13391-C04-03/04 proceedings of the workshop: Structures morphologiques
research projects. de l’Amazighe, pp.102--111.
Zenkouar, L. (2004). L’écriture Amazighe Tifinaghe et
References Unicode, in Etudes et documents berbères. Paris (France).
n° 22, pp. 175—192.
Allauzen, A. Bonneau-Maynard, H. (2008).Training and
Zenkouar, L. (2008). Normes des technologies de
evaluation of POS taggers on the French MULTITAG
l’information pour l’ancrage de l’écriture Amazighe, in
corpus. In proceedings of LREC 08.
Etudes et documents berbères. Paris (France), n° 27, pp.
Ameur, M., Boujajar, A., Boukhris, F. Boukouss, A.,
159—172.
Boumaled, A., Elmedlaoui, M., Iazzi, E., Souifi, H.
(2006a), Initiation à la langue Amazighe. Publications de
l’IRCAM. pp. 45—77.
Ameur, M., Boujajar, A., Boukhris, F. Boukouss, A.,
Boumaled, A., Elmedlaoui, M., Iazzi, E. (2006b) Graphie
et orthographe de l’Amazighe. Publications de l’IRCAM.
Andries, P. (2004). La police open type Hapax berbère. In
proceedings of the workshop : la typographie entre les
domaines de l'art et l'informatique, pp. 183—196.
Bertran, M., Borrega, O., Recasens, M., Soriano, B. (2008).
AnCoraPipe: A tool for multilevel annotation.
Procesamiento del lenguaje Natural, nº 41. Madrid (Spain).
Boukhris, F. Boumalk, A. El moujahid, E., Souifi, H.
(2008). La nouvelle grammaire de l’Amazighe.
Publications de l’IRCAM.
Boukhris, F. (2006). Structure morphologique de la
préposition en Amazighe. In proceedings of the workshop:
Structures morphologiques de l’Amazighe. Publications de
l’IRCAM. pp. 46-56.
Boukouss, A. (1995). Société, langues et cultures au
Maroc: Enjeux symboliques, publications de la Faculté
des Lettres de Rabat.
Boumalk, A., Naït-Zerrad, K. (2009). Amawal n tjrrumt
-Vocabulaire grammatical. Publications de l’IRCAM.
Chafiq, M. (1991) . !"ز#$% '& ا#(*)ن در+* وأر+أر. éd.
Arabo-africaines.
Chaker, S. (1989). Textes en linguistique berbère -
4
Freely downloadable from
http://chasen.org/~taku/software/YamCha/ and
http://crfpp.sourceforge.net/
56/119
Verb Morphology of Hebrew and Maltese — Towards an Open Source Type
Theoretical Resource Grammar in GF
Dana Dannélls∗ and John J. Camilleri†
∗
Department of Swedish Language, University of Gothenburg
SE-405 30 Gothenburg, Sweden
†
Department of Intelligent Computer Systems, University of Malta
Msida MSD2080, Malta
dana.dannells@svenska.gu.se, jcam0003@um.edu.mt
Abstract
One of the first issues that a programmer must tackle when writing a complete computer program that processes natural language is
how to design the morphological component. A typical morphological component should cover three main aspects in a given language:
(1) the lexicon, i.e. how morphemes are encoded, (2) orthographic changes, and (3) morphotactic variations. This is in particular
challenging when dealing with Semitic languages because of their non-concatenative morphology called root and pattern morphology.
In this paper we describe the design of two morphological components for Hebrew and Maltese verbs in the context of the Grammatical
Framework (GF). The components are implemented as a part of larger grammars and are currently under development. We found that
although Hebrew and Maltese share some common characteristics in their morphology, it seems difficult to generalize morphosyntactic
rules across Semitic verbs when the focus is towards computational linguistics motivated lexicons. We describe and compare the verb
morphology of Hebrew and Maltese and motivate our implementation efforts towards a complete open source type theoretical resource
grammars for Semitic languages. Future work will focus on semantic aspects of morphological processing.
59/119
{ t=Quad ; r={ K=K ; T=T ; B=B ; L=L } ; 6. Discussion and related work
p={ v1=v1 ; v2=v2 } } ;
} Although there are already some morphological analyzers
available for Hebrew (Itai and Wintner, 2008; Yona and
Wintner., 2008) and data resources available for Maltese
Lexicon
(Rosner et al., 1999), they are not directly usable within
In this example, functions are linearized by using two dif-
the Grammatical Framework. To exploit the advantages of-
ferent operations defined for: regular inflection of verbs
fered by GF, the language’s grammar must be implemented
(used in write_V2), where the verb is given in perfect tense,
in this formalism. One of the advantages of implementing
third person, singular, masculine and irregular inflection of
Semitic non-concatenative morphology in a typed language
verbs (used in pray_V), where two additional strings are
such as GF compared with other finite state languages is
given, namely the imperative singular and the imperative
that strings are formed by records, and not through con-
plural forms of the verb.
catenation. Moreover, once the core grammar is defined
write_V2 = mkVerb "kiteb" ; and the structure and the form of the lexicon is determined,
pray_V = mkVerb "talab" "itlob" "itolbu";
it is possible to automatically acquire lexical entries from
exiting lexical resources. In the context of GF, three wide-
4.4. Inflection paradigm coverage lexicons have been acquired automatically: Bul-
An example of the output produced by GF for the verb garian (Angelov, 2008b), Finnish (Tutkimuskeskus, 2006)
‘write’ is illustrated in Table 1. and Swedish (Angelov, 2008a).
In this work, the design decisions taken by the program-
Hebrew Maltese mers are based on different points of arguments concerning
mkVPaal “ktb” mkVerb “kiteb” the division of labour between a linguistically trained gram-
Perfect marian and a lexicographer. The Maltese implementation
Vp1Sg ⇒ “ktbty” (Per1 Sg) ⇒ “ktibt” consider stems in the lexicon rather than patterns and roots,
Vp1Pl ⇒ “ktbnw” (Per1 Pl) ⇒“ktibna” cf. Rosner et al. (1998); in the framework of GF, classes of
Vp2SgMasc ⇒ “ktbt” inflectional phenomena are given an abstract representation
(Per2 Sg) ⇒“ktibt”
Vp2SgFem ⇒ “ktbt” that interact with the root and pattern system. In Hebrew,
Vp2PlMasc ⇒ “ktbtM” recognizing prefixes and suffixes is not always sufficient for
(Per2 Pl) ⇒“ktibtu”
Vp2PlFem ⇒ “ktbtN” recognizing the root of the verb. Although root recognition
Vp3SgMasc ⇒ “ktb” (Per3Sg Masc) ⇒ “kiteb” is mandatory for generating the verb’s complete conjuga-
Vp3SgFem ⇒ “ktbh” (Per3Sg Fem) ⇒ “kitbet” tion table, changes in patterns and the absence of root let-
Vp3PlMasc ⇒ “ktbw” ters in different lexemes make it increasingly hard to infer
Per3Pl ⇒ “kitbu”
Vp3PlFem ⇒ “ktbw” the root (Deutsch and Frost, 2002) which requires a large
Imperfect amount of tri-consonantal constraints. This is in particular
Vp1Sg ⇒ “Aktwb” (Per1 Sg) ⇒ “nikteb” true for lexemes derived from weak roots where one of the
Vp1Pl ⇒ “nktwb” (Per1 Pl) ⇒ “niktbu” root consonants is often missing (Frost et al., 2000). To
Vp2SgMasc ⇒ “tktwb” avoid a large amount of morphosyntactic rules, we choose
(Per2 Sg) ⇒ “tikteb”
Vp2SgFem ⇒ “tktby” to employ semantic markings in the lexicon by specifying
Vp2PlMasc ⇒ “tktbw” roots and patterns instead of lexemes; this computationally
(Per2 Pl) ⇒ “tiktbu”
Vp2PlFem ⇒ “tktbw” motivated approach becomes plausible since the meaning
Vp3SgMasc ⇒ “yktwb” (Per3Sg Masc) ⇒ “jikteb” of the lexeme is already known.
Vp3SgFem ⇒ “tktwb” (Per3Sg Fem) ⇒ “tikteb”
Vp3PlMasc ⇒ “yktbw” 7. Conclusions and Future Work
Per3Pl ⇒ “jiktbu”
Vp3PlFem ⇒ “yktbw”
In this paper we have presented implementations of He-
Table 1: Example of Hebrew and Maltese verb inflection brew and Maltese components that tend to convey the non-
tables of the verb ‘write’. concatenative morphology of their verbs. Although we
could identify common characteristics among these two
Semitic languages, we found it difficult to generalize mor-
phosyntactic rules across Semitic verbs when the focus is
5. State of the work towards a computational motivated lexicon.
The core syntax implemented for the two languages has When designing a computer system that can process several
around 13 categories and 22 construction functions. It languages automatically it is useful to generalize as many
covers simple syntactic constructions including predication morphosyntactic rules across languages that belong to the
rules which are built from noun and verb phrases. same language group. One fundamental question that rises
The lexicons were manually populated with a small number from our implementations is to what extent we can general-
of lexical units, covering around 20 verbs and 10 nouns in ize the concrete syntaxes of Semitic languages. One way to
each language. The Maltese verb morphology covers the approach this question is by employing semantic markings
root groups: strong, defective and quadriliteral. In Hebrew, in the lexicons of the Semitic languages and focus on se-
the strong verb paradigms and five weak verb paradigms in mantic aspects of morphological processing. This remains
binyan pa’al are covered. for future work.
60/119
8. References ceedings of the Workshop on Computational Approaches
Krasimir Angelov. 2008a. Importing SALDO in GF. to Semitic Languages, pages 97–101, Morristown, NJ,
http://spraakbanken.gu.se/personal/ USA. Association for Computational Linguistics.
lars/kurs/lgres08/tpaper/angelov.pdf. M. Rosner, J. Caruana, and R. Fabri. 1999. Linguistic and
Krasimir Angelov. 2008b. Type-theoretical Bulgarian computational aspects of maltilex. In Arabic Translation
grammar. In In B. Nordström and eds. A. Ranta, edi- and Localisation Symposium: Proceedings of the Work-
tors, Advances in Natural Language Processing (GoTAL shop, pages 2—-10, Tunis.
2008), volume 5221 of LNCS/LNAI, pages 52—-64. Kotimaisten Kielten Tutkimuskeskus. 2006. KOTUS
Joseph Aquilina. 1960. The structure of Maltese. Valletta: wordlist. http://kaino.kotus.fi/sanat/
Royal University of Malta. nykysuomi.
Joseph Aquilina. 1962. Papers in Maltese linguistics. Adam Ussishkin and Alina Twist. 2007. Lexical access in
Royal University, Malta. Maltese using visual and auditory lexical decision. In
Conference of Maltese Linguistics, L-Ghaqda Internaz-
Maya Arad. 2005. Roots and patterns: Hebrew morpho-
zjonali tal-Lingwistika Maltija, University of Bremen,
syntax. Springer, The Netherlands.
Germany, October.
Edna Amir Coffin and Shmuel Bolozky. 2005. A Refer-
Shlomo Yona and Shuly Wintner. 2008. A finite-state mor-
ence Grammar of Modern Hebrew. Cambridge Univer-
phological grammar of Hebrew. Natural Language En-
sity Press.
gineering, 14(2):173–190, April.
Ali Dada and Aarne Ranta. 2007. Implementing an open
source arabic resource grammar in GF. In M. Mughazy,
editor, Perspectives on Arabic Linguistics., Papers from
the Twentieth Annual Symposium on Arabic Linguistics.
John Benjamins Publishing Company, March 26.
Avital Deutsch and Ram Frost, 2002. Lexical organization
and lexical access in a non-concatenated morphology,
chapter 9. John Benjamins.
R. Frost, A. Deutsch, and K. I. Forster. 2000. Decompos-
ing morphologically complex words in a nonlinear mor-
phology. Journal of Experimental Psychology: Learn-
ing, Memory and Cognition, 26(3):751–765.
Gideon Goldberg. 1994. Principles of Semitic word-
structure. In G. Goldberg and S. Raz, editors, In
Semantic and ushitic studies, pages 29–64. Wies-
baden:Harrassowitz.
Robert D. Hoberman and M. Aronoff, 2003. The verbal
morphology of Maltese: From Semitic to Romance, chap-
ter 3, pages 61–78. Amsterdam: John Benjamins.
Alon Itai and Shuly Wintner. 2008. Language resources
for Hebrew. Language Resources and Evaluation,
42(1):75–98, March.
P. Martin-Löf. 1975. An intuitionistic theory of types:
Predicative part. In H. E. Rose and J. C. Shepherdson,
editors, Proc. of Logic Colloquium ’73, Bristol, UK, vol-
ume 80, pages 73–118. North-Holland.
John J. McCarthy. 1979. On stress and syllabification. Lin-
guistic Inquiry, 10:443–465.
John J. McCarthy. 1981. The representation of consonant
length in Hebrew. Linguistic Inquiry, 12:322–327.
Manwel Mifsud. 1995. Loan verbs in Maltese: a descrip-
tive and comparative study. Leiden, New York, USA.
Aarne Ranta. 2004. Grammatical framework, a type-
theoretical grammar formalism. Journal of Functional
Programming, 14(2):145–189.
Aarne Ranta. 2009. The GF resource gram-
mar library. The on-line journal Linguistics
in Language Technology (LiLT), 2(2). http:
//elanguage.net/journals/index.php/
lilt/article/viewFile/214/158.
M. Rosner, J. Caruana R., and Fabri. 1998. Maltilex: a
computational lexicon for Maltese. In Semitic ’98: Pro-
61/119
Syllable Based Transcription of English Words into
Perso-Arabic Writing System
Jalal Maleki
Dept. of Computer and Information Science
Linkping University
SE-581 83 Linkping
Sweden
Email: jma@ida.liu.se
62/119
2.2 Vowels 3. The syllables are individually modified to ful-
fill the contraints of Persian syllable structures.
From a transcription point of view, vowel corre- For example, spring (CCCVCC) is transformed
spondence between Persian and English phonology is to espering (VCCVCVCC) using e epenthesis,
also imperfect and relatively simple. Some examples prompt (CCVCCC) is transformed to perompet
are shown in Table-1. Some English diphthongs are (CVCVCCVC). See Table-3 for more examples.
treated as two separate vowels whereas some others
are interpreted as a single vowel. 4. The resulting Dabire word is resyllabified. For
Phonological mapping is followed by conversion of example, espering is syllabified as es.pe.ring
phonemic romanized Persian to PA-Script. Type of
5. Application of context-dependent replace rules
syllable containing a vowel and the characteristics of
[3] to enforce orthographical conventions of Per-
the neighboring graphemes determine the choice of
sian [5, 13, 1]
grapheme (or allographs) for the vowel. As an exam-
ple, Table-2 shows the various and digraphs used for 6. Finally, the Dabire-word is transliterated to
writing the vowel /i/ in different contexts [11]. Perso-Arabic Unicode.
2.3 Syllable Constraints and Consonant Clusters Step 1-3 are currently implemented in Lisp and steps
4-6 are implemented as transducers in XFST [3]
Syllable structure in Persian is restricted to The syllabification step (4) which is one of the main
(C)V(C)(C), whereas, English allows the more com- modules of the system is explained further. The syl-
plex structure (C)(C)(C)V(C)(C)(C)(C). labification transducer works from left to right on the
One of the main problems in writing English words input string and ensures that the number of consonants
in PA-Script is the transformation of syllables. For ex- in the onset is maximized. Given the syllabic struc-
ample, the word ’question’ represented as /K, W, EH1, ture of Persian, this essentially means that if a vowel,
S, CH, AH0, N/ in CMUPD with the syllables /K, W, V, is preceded by a consonant, C, then CV initiates
EH1, S/ and /CH, AH0, N/ is transcribed to kuesšen a syllable. For example, for a word such as jârue,
one syllable at a time and finally resyllabified as ku- the syllabification jâ.ru.e (CV.CV.V) is selected and
ñ». Resyl-
es-šen and transliterated to PA-Script á jâr.u.e (CVC.V.V) is rejected. The correct syllabifica-
labification is necessary since consonant clusters are tion would naturally lead to correct writing since as
broken by vowel epenthesis. mentioned earlier, vowels are written differently de-
In general, the Persian transcription of English pending on their position in the syllable.
words involves short vowel insertion into consonant The following XFST-definitions form the core of
clusters and resyllabification (See Table-3 for exam- the syllabification [11]:
ples.)
define Sy V|VC|VCC|CV|CVC|CVCC;
3 The Implementation
define Sfy C* V C* @->
Transcription of an English word w into P-Script in- ... "." || _ Sy;
volves a number of steps which are briefly discussed
below. The first statement defines a language (Sy) contain-
ing all syllables of Dabire. V, VC etc. are defined as
1. w is looked up in the syllabified CMUPD dictio- regular languages that represent well-formed syllables
nary [4] and its syllabified pronunciation p(w) is in Dabire. For example, CVCC is defined as,
retrieved. For example, given the word ’surgical’,
we get: ((S ER1) (JH IH0) (K AH0 L)) define CVCC [C V C C] .o. ˜$NotAllowed;
2. Syllables of p(w) are transcribed to Dabire which which defines the language containing all possible
is a phonemic orthorgraphy for Persian. For the CVCC syllables and excluding the untolerated conso-
’surgical’, we get ((s e r) (g i) (kâl)). nant clusters in NotAllowed such as bp, kq, and cc.
63/119
Vowel Example Word Phonemes Persian Phoneme Romanized Persian Perso-Arabic
AA odd AA D â âd X@
AE at AE T a at
H@
AH hut HH AH T â hât HAë
AO ought AO T o ot
Hð@
AW cow K AW â kâv ðA¿
AY hide HH AY D ây hâyd YK Aë
Table 1. Some Vowels from CMU Pronunciation Dictionary with Examples
The second statement defines a replacement rule [3] [3] K. R. Beesley and L. Karttunen. Finite State Mor-
that represents the syllabification process. The oper- phology. CSLI Publications, 2003.
ator @> ensures that the shortest possible strings (of [4] Carnegie Mellon University. CMU pronuncia-
tion dictionary. http://www.speech.cs.cmu.edu/cgi-
the form C* V C*) are selected in left to right direc-
bin/cmudict, 2008.
tion and identified as syllables which are separated by [5] Farhangestan. Dastur e Khatt e Farsi (Persian Or-
a dot. thography), volume Supplement No. 7. Persian
Table-4 includes examples that illustrate examples Academy, Tehran, 2003.
of input/output for this. [6] J. Johanson. Transcription of names written in farsi
into english. In A. Farghaly and K. Megerdoomian,
4 Discussion and Evaluation editors, Proceedings of the 2nd workshop on com-
putational approaches to Arabic Script-based lan-
guages, pages 74–80, 2007.
We have introduced a rule based transcription of [7] S. Karimi, A. Turpin, and F. Scholer. English to per-
English to PA-Script. Earlier work [2, 8, 6, 7] mainly sian transliteration. In Lecture Notes in Computer Sci-
relies on statistical methods. ence, volume 4209, pages 255–266. Springer, 2006.
Our method produces correct transcriptions for [8] M. M. Kashani, F. Popowich, and A. Sarkar. Auto-
most of the data-set randomly selected from CMUPD. matic transliteration of proper nouns from arabic to
Quantitative evaluation of the method is in progress. english. In A. Farghaly and K. Megerdoomian, edi-
tors, Proceedings of the 2nd workshop on computa-
The performance of the system is dependent on the
tional approaches to Arabic Script-based languages,
availability of syllabified English words and future im- pages 81–87, 2007.
provements would require use of statistical methods [9] R. R. Z. Malek. Qavâed e Emlâ ye Fârsi. Golâb,
for automatically handling words that do not exist in 2001.
the dictionary. Some early experiments [14] based on [10] J. Maleki. A Romanized Transcription for Persian. In
CMUPD show a success rate of 71.6% in automatic Proceedings of Natural Language Processing Track
grapheme to phoneme conversion of English words not (INFOS2008), Cairo, 2008.
[11] J. Maleki and L. Ahrenberg. Converting Romanized
present in CMUPD. Further development would also
Persian to Arabic Writing System Using Syllabifica-
require integration of automatic syllabification of En- tion. In Proceedings of the LREC2008, Marrakech,
glish [12] into the system. 2008.
[12] Y. Marchand, C. R. Adsett, and R. I. Damper. Au-
References tomatic Syllabification in English: A Comparison
of Different Algorithms. Language and Speech,
[1] M. S. Adib-Soltâni. An Introduction to Persian Or- 52(1):1–27, 2009.
thography - (in Persian). Amir Kabir Publishing [13] S. Neysari. A Study on Persian Orthography - (in Per-
House, Tehrân, 2000. sian). Sâzmân e Câp o Entešârât, 1996.
[2] Y. Al-Onaizan and K. Knight. Machine transliteration [14] S. Stymne. Private communication. Linköping, 2010.
of names in arabic text. In ACL Workshop on Compu-
tational Approaches to Semitic Languages, 2002.
64/119
Word Segment Segment Segment Intra-Word
/i/ Initial Initial Medial Final Isolated
V, VC, VCC K@ JK JJ úæ ùK, ø@
áK @ QKAK ÕæJË úæJËAg ùKñKAK., ø@ éJ¯P
CVC, CVCC K J
XQK èXQ
CV K J ù ø
P@YK X P@YJK. ú» Ag øPA¿
Table 2. Mapping /i/ to P-Script Graphemes
Table 3. Epenthesis in consonant cluster transcription. C1 stands for all consonants except /w/ and /y/. C2
stands for all consonants except /w/, /y/ and /r/. C3 Stands for all consonants except /s/ and /š/.
65/119
COLABA: Arabic Dialect Annotation and Processing
Mona Diab, Nizar Habash, Owen Rambow, Mohamed Altantawy, Yassine Benajiba
Center for Computational Learning Systems
475 Riverside Drive, Suite 850
New York, NY 10115
Columbia University
{mdiab,habash,rambow,mtantawy,ybenajiba}@ccls.columbia.edu
Abstract
In this paper, we describe COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs. We describe
the objectives of the project, the process flow and the interaction between the different components. We briefly describe the manual
annotation effort and the resources created. Finally, we sketch how these resources and tools are put together to create DIRA, a term-
expansion tool for information retrieval over dialectal Arabic collections using Modern Standard Arabic queries.
1. Introduction these genres. In fact, applying NLP tools designed for MSA
directly to DA yields significantly lower performance, mak-
The Arabic language is a collection of historically related ing it imperative to direct the research to building resources
variants. Arabic dialects, collectively henceforth Dialectal and dedicated tools for DA processing.
Arabic (DA), are the day to day vernaculars spoken in the DA lacks large amounts of consistent data due to two fac-
Arab world. They live side by side with Modern Standard tors: a lack of orthographic standards for the dialects, and
Arabic (MSA). As spoken varieties of Arabic, they differ a lack of overall Arabic content on the web, let alone DA
from MSA on all levels of linguistic representation, from content. These lead to a severe deficiency in the availabil-
phonology, morphology and lexicon to syntax, semantics, ity of computational annotations for DA data. The project
and pragmatic language use. The most extreme differences presented here – Cross Lingual Arabic Blog Alerts (CO-
are on phonological and morphological levels. LABA) – aims at addressing some of these gaps by building
The language of education in the Arab world is MSA. DA is large-scale annotated DA resources as well as DA process-
perceived as a lower form of expression in the Arab world; ing tools.1
and therefore, not granted the status of MSA, which has This paper is organized as follows. Section 2. gives a high
implications on the way DA is used in daily written venues. level description of the COLABA project and reviews the
On the other hand, being the spoken language, the native project objectives. Section 3. discusses the annotated re-
tongue of millions, DA has earned the status of living lan- sources being created. Section 4. reviews the tools created
guages in linguistic studies, thus we see the emergence of for the annotation process as well as for the processing of
serious efforts to study the patterns and regularities in these the content of the DA data. Finally, Section 5. showcases
linguistic varieties of Arabic (Brustad, 2000; Holes, 2004; how we are synthesizing the resources and tools created for
Bateson, 1967; Erwin, 1963; Cowell, 1964; Rice and Sa’id, DA for one targeted application.
1979; Abdel-Massih et al., 1979). To date most of these
studies have been field studies or theoretical in nature with 2. The COLABA Project
limited annotated data. In current statistical Natural Lan- COLABA is a multi-site partnership project. This paper,
guage Processing (NLP) there is an inherent need for large- however, focuses only on the Columbia University contri-
scale annotated resources for a language. For DA, there butions to the overall project.
has been some limited focused efforts (Kilany et al., 2002; COLABA is an initiative to process Arabic social media
Maamouri et al., 2004; Maamouri et al., 2006); however, data such as blogs, discussion forums, chats, etc. Given
overall, the absence of large annotated resources continues that the language of such social media is typically DA, one
to create a pronounced bottleneck for processing and build- of the main objective of COLABA is to illustrate the signif-
ing robust tools and applications. icant impact of the use of dedicated resources for the pro-
DA is a pervasive form of the Arabic language, especially cessing of DA on NLP applications. Accordingly, together
given the ubiquity of the web. DA is emerging as the lan- with our partners on COLABA, we chose Information Re-
guage of informal communication online, in emails, blogs, trieval (IR) as the main testbed application for our ability to
discussion forums, chats, SMS, etc, as they are media that process DA.
are closer to the spoken form of language. These genres Given a query in MSA, using the resources and processes
pose significant challenges to NLP in general for any lan- created under the COLABA project, the IR system is able
guage including English. The challenge arises from the to retrieve relevant DA blog data in addition to MSA
fact that the language is less controlled and more speech data/blogs, thus allowing the user access to as much Arabic
like while many of the textually oriented NLP techniques
are tailored to processing edited text. The problem is com- 1
We do not address the issue of augmenting Arabic web con-
pounded for Arabic precisely because of the use of DA in tent in this work.
66/119
content (in the inclusive sense of MSA and DA) as possi- tuition is that the more words in the blogs that are not
ble. The IR system may be viewed as a cross lingual/cross analyzed or recognized by a MSA morphological an-
dialectal IR system due to the significant linguistic differ- alyzer, the more dialectal the blog. It is worth noting
ences between the dialects and MSA. We do not describe that at this point we only identify that words are not
the details of the IR system or evaluate it here; although we MSA and we make the simplifying assumption that
allude to it throughout the paper. they are DA. This process results in an initial ranking
There are several crucial components needed in order for of the blog data in terms of dialectness.
this objective to be realized. The COLABA IR sys-
tem should be able to take an MSA query and convert 3. Content Clean-Up. The content of the highly ranked
it/translate it, or its component words to DA or alterna- dialectal blogs is sent for an initial round of manual
tively convert all DA documents in the search collection clean up handling speech effects and typographical er-
to MSA before searching on them with the MSA query. In rors (typos) (see Section3.2.). Additionally, one of the
COLABA, we resort to the first solution. Namely, given challenging aspects of processing blog data is the se-
MSA query terms, we process them and convert them to vere lack of punctuation. Hence, we add a step for
DA. This is performed using our DIRA system described sentence boundary insertion as part of the cleaning up
in Section 5.. DIRA takes in an MSA query term(s) and process (see Section 3.3.). The full guidelines will be
translates it/(them) to their corresponding equivalent DA presented in a future publication.
terms. In order for DIRA to perform such an operation it
requires two resources: a lexicon of MSA-DA term corre- 4. Second Ranking of Blogs and Dialectalness De-
spondences, and a robust morphological analyzer/generator tection. The resulting cleaned up blogs are passed
that can handle the different varieties of Arabic. The pro- through the DI pipeline again. However, this time,
cess of creating the needed lexicon of term correspondences we need to identify the actual lexical items and add
is described in detail in Section 3.. The morphological an- them to our lexical resources with their relevant infor-
alyzer/generator, MAGEAD, is described in detail in Sec- mation. In this stage, in addition to identifying the
tion 4.3.. dialectal unigrams using the DI pipeline as described
For evaluation, we need to harvest large amounts of data in step 2, we identify out of vocabulary bigrams and
from the web. We create sets of queries in domains of in- trigrams allowing us to add entries to our created re-
terest and dialects of interest to COLABA. The URLs gen- sources for words that look like MSA words (i.e. cog-
erally serve as good indicators of the dialect of a website; nates and faux amis that already exist in our lexica,
however, given the fluidity of the content and variety in di- yet are specified only as MSA). This process renders
alectal usage in different social media, we decided to per- a second ranking for the blog documents and allows
form dialect identification on the lexical level. us to hone in on the most dialectal words in an ef-
Moreover, knowing the dialect of the lexical items in a doc- ficient manner. This process is further elaborated in
ument helps narrow down the search space in the under- Section 4.2..
lying lexica for the morphological analyzer/generator. Ac-
5. Content Annotation. The content of the blogs that
cordingly, we will also describe the process of dialect an-
are most dialectal are sent for further content annota-
notation for the data.
tion. The highest ranking blogs undergo full word-by-
The current focus of the project is on blogs spanning four
word dialect annotation as described in Section 3.5..
different dialects: Egyptian (EGY), Iraqi (IRQ), Levantine
Based on step 4, the most frequent surface words that
(LEV), and (a much smaller effort on) Moroccan (MOR).
are deemed dialectal are added to our underlying lex-
Our focus has been on harvesting blogs covering 3 do-
ical resources. Adding an entry to our resources en-
mains: social issues, religion and politics.
tails rendering it in its lemma form since our lexical
Once the web blog data is harvested as described in Sec-
database uses lemmas as its entry forms. We create the
tion 3.1., it is subjected to several processes before it is
underlying lemma (process described in Section 3.6.)
ready to be used with our tools, namely MAGEAD and
and its associated morphological details as described
DIRA. The annotation steps are as follows:
in Section 3.7.. Crucially, we tailor the morphologi-
1. Meta-linguistic Clean Up. The raw data is cleaned cal information to the needs of MAGEAD. The choice
from html mark up, advertisements, spam, encoding of surface words to be annotated is ranked based on
issues, and so on. Meta-linguistic information such as the word’s frequency and its absence from the MSA
date and time of post, poster identity information and resources. Hence the surface forms are ranked as
such is preserved for use in later stages. follows: unknown frequent words, unknown words,
then known words that participate in infrequent bi-
2. Initial Ranking of the Blogs. The sheer amount of grams/trigrams compared to MSA bigrams/trigrams.
data harvested is huge; therefore, we need to select All the DA data is rendered into a Colaba Conven-
blogs that have the most dialectal content so as to tional Orthography (CCO) described in Section 3.4..
maximally address the gap between MSA and DA re- Annotators are required to use the CCO for all their
sources. To that end, we apply a simple DA identifi- content annotations.
cation (DI) pipeline to the blog document collection
ranking them by the level of dialectal content. The DI To efficiently clean up the harvested data and annotate its
pipeline is described in detail in Section 4.2.. The in- content, we needed to create an easy to use user interface
67/119
with an underlying complex database repository that orga- for such a task is not straight forward. Thus, we simpli-
nizes the data and makes it readily available for further re- fied the task to the narrow identification of the following
search. The annotation tool is described in Section 4.1.. categories:
Input text
èXñÓ éÊ¿ ø Qå @ ñm'
. . « ø YK. ð XBð I.Jk. ð h. ð QK@ ø YK. è@P ñJ»Xð QJk. AÓ Yg@ ø YK.
After manual sentence boundary detection
è@P ñJ»Xð QJk. AÓ Yg@ ø YK.
XBð I.Jk. ð h. ð QK@ ø YK.
èXñÓ éÊ¿ ø Qå @ ñm'
. . « ø YK. ð
Table 2: LEV blog excerpt with sentence boundaries identified.
• CCO explicitly indicates the pronounced short vow- 3.5. Dialect Annotation
els and consonant doubling, which are expressed in Our goal is to annotate all the words in running text with
Arabic script with optional diacritics. Accordingly, their degree of dialectalness. In our conception, for the
there is no explicit marking for the sukuun diacritic purposes of COLABA we think of MSA as a variant di-
which we find in Arabic script. For example, the CCO alect; hence, we take it to be the default case for the Arabic
for I. »QÓ mrkb in EGY could be markib ‘boat’ or mi- words in the blogs. We define a dialectal scale with respect
rakkib ‘something put together/causing to ride’ or mu- to orthography, morphology and lexicon. We do not han-
rakkab ‘complex’. dle phrasal level or segment level annotation at this stage
• Clitic boundaries are marked with a +. This is an of our annotation, we strictly abide by a word level annota-
attempt at bridging the gap between phonology and tion.4 The annotators are required to provide the CCO rep-
morphology. We consider the following affixations resentation (in Section 3.4.) for all the words in the blog.
as clitics: conjunctions, prepositions, future particles, If a word as it appears in the original blog maintains its
progressive particles, negative particles, definite arti- meaning and orthography as in MSA then it is considered
cles, negative circumfixes, and attached pronouns. For the default MSA for dialect annotation purposes, however
example, in EGY CCO ÐCð wslAm ‘and peace’ is if it is pronounced in its context dialectically then its CCO
. JºK AÓ mAyktbš ‘he doesn’t representation will reflect the dialectal pronunciation, e.g.
rendered we+sala:m and
write’ is rendered ma+yiktib+c.
I.JºK, yktb ‘he writes’ is considered MSA from a dialect
annotation perspective, but in an EGY context its CCO rep-
• We use the ^ symbol to indicate the presence of the resentation is rendered yiktib rather than the MSA CCO of
Ta Marbuta (feminine marker) morpheme or of the yaktub.
Tanween (nunation) morpheme (marker of indefinite- Word dialectness is annotated according to a 5-point scale
ness). For example, éJ.JºÓ mktbh̄ ‘library’ is rendered building on previous efforts by Habash et al. (2008):
JºJK. bi+yiktib
in CCO as maktaba^ (EGY). Another example is AJÊÔ« • WL1: MSA with dialect morphology I .
ςmlyAã ‘practically’, which is rendered in CCO as ‘he is writing’, I
. JºJë ha+yiktib ‘he will write’
3amaliyyan^.
• WL2: MSA faux amis where the words look MSA but
CCO is comparable to previous efforts on creating re- are semantically used dialectically such as Ñ« 3am a
sources for Arabic dialects (Maamouri et al., 2004; Kilany LEV progressive particle meaning ‘in the state of’ or
et al., 2002). However, unlike Maamouri et al. (2004), MSA ‘uncle’
CCO is not defined as an Arabic script dialectal orthogra-
phy. CCO is in the middle between the morphophonemic • WL3: Dialect lexeme with MSA morphology such as
and phonetic representations used in Kilany et al. (2002) É«Q sa+yiz3al ‘he will be upset’
for Egyptian Arabic. CCO is quite different from com-
• WL4: Dialect lexeme where the word is simply a di-
monly used transliteration schemes for Arabic in NLP such mic ‘not’
alectal word such as the negative particle Ó
as Buckwalter transliteration in that CCO (unlike Buckwal-
ter) is not bijective with Arabic standard orthography.
For the rest of this section, we will use CCO in place of the 4
Annotators are aware of multiword expressions and they note
HSB transliteration except when indicated. them when encountered.
69/119
• WL5: Dialect lexeme with a consistent systematic identify the various POS based on form, meaning, and
phonological variation from MSA, e.g., LEV éKCK grammatical function illustrated using numerous examples.
tala:te^ ‘three’ versus éKCK Tala:Ta^. The set of POS tags are as follows: (Common) Noun,
Proper Noun, Adjective, Verb, Adverb, Pronoun, Preposi-
In addition, we specify another six word categories that are tion, Demonstrative, Interrogative, Number, and Quantifier.
of relevance to the annotation task on the word level: For- We require the annotators to provide a detailed morphologi-
eign Word (ñKCJk., jila:to, ‘gelato ice cream’), Borrowed cal profile for three of the POS tags mentioned above: Verb,
Word ( YK@ ½K ð, wi:k 2end, ‘weekend’), Arabic Named En- Noun and Adjective. For this task, our main goal is to iden-
tify irregular morphological behavior. They transcribe all
. AKX ðQÔ«, 3amr dya:b, ‘Amr Diab’), Foreign Named
tity ( H
their data entries in the CCO representation only as defined
Entity ( QKPA¿ ùÒJk., jimi kartar, ‘Jimmy Carter’), Typo (fur-
in Section 3.4.. We use the Arabic script below mainly for
ther typographical errors that are not caught in the first illustration in the following examples.
round of manual clean-up), and in case they don’t know
the word, they are instructed to annotate it as unknown. • Verb Lemma: In addition to the basic 3rd person
masculine singular (3MS) active perfective form of
3.6. Lemma Creation
This task is performed for a subset of the words in the
the dialectal verb lemma, e.g., H . Qå cirib ‘he drank’
(EGY), the annotators are required to enter: (i) the
blogs. We focus our efforts first on the cases where an MSA
morphological analyzer fails at rendering any analysis for
3MS active imperfective H . Qå yicrab; (ii) the 3MS
a given word in a blog. We are aware that our sampling
passive perfective is H . Qå@ incarab; (iii) the 3MS
ignores the faux amis cases with MSA as described in Sec- passive imperfective H. QåJK yincirib; and (iv) and the
tion 3.5.. Thus, for each chosen/sampled dialectal surface masculine singular imperative H . Qå @ icrab.
word used in an example usage from the blog, the annotator
is required to provide a lemma, an MSA equivalent, an En- • Noun Lemma: The annotators are required to en-
glish equivalent, and a dialect ID. All the dialectal entries ter the feminine singular form of the noun if avail-
are expected to be entered in the CCO schema as defined in able. They are explicitly asked not to veer too much
Section 3.4.. away from the morphological form of the lemma, so
for example, they are not supposed to put I sit
We define a lemma (citation form) as the basic entry form
of a word into a lexical resource. The lemma represents ‘woman/lady’ as the feminine form of Ég. @P ra:gil
the semantic core, the most important part of the word that ‘man’. The annotators are asked to specify the ratio-
carries its meaning. In case of nouns and adjectives, the nality/humanness of the noun which interacts in Ara-
lemma is the definite masculine singular form (without the bic with morphosyntactic agreement. Additional op-
explicit definite article). And in case of verbs, the lemma is tional word forms to provide are any broken plurals,
the 3rd person masculine singular perfective active voice. mass count plural collectives, and plurals of plurals,
All lemmas are clitic-free. e.g rigga:la^ and riga:l ‘men’ are both broken plurals
A dialectal surface word may have multiple underlying of ra:gil ‘man’.
lemmas depending on the example usages we present to the
annotators. For example, the word éJ.»QÓ mrkbh occurs in • Adjective Lemma: For adjectives, the annotators pro-
vide the feminine singular form and any broken plu-
two examples in our data: 1. éK YK AK. éJ.»QÓ ú×A sa:mi mi-
rals, e.g. the adjective Èð @ 2awwel ‘first [masc.sing]’
rakkib+uh be+2ide:+h ‘Sami built it with his own hands’
has the corresponding EGY lemma mirakkib ‘build’; and has the feminine singular form úÍð @ 2u:la and the bro-
@ñk@P éËAg
2. éJÓ éJ.»QÓ @ð Q
QË@ ir+rigga:la^ ra:7u yictiru
. ken plural ÉK@ð @ 2awa:2il.
markib+uh minn+uh ‘The men went to buy his boat from
him’ with the corresponding lemma markib ‘boat’. The an-
4. Tools for COLABA
notators are asked to explicitly associate each of the created
lemmas with one or more of the presented corresponding In order to process and manage the large amounts of data
usage examples. at hand, we needed to create a set of tools to streamline the
annotation process, prioritize the harvested data for manual
3.7. Morphological Profile Creation annotation, then use the created resources for MAGEAD.
Finally, we further define a morphological profile for the
entered lemmas created in Section 3.6.. A computation- 4.1. Annotation Interface
ally oriented morphological profile is needed to complete Our annotation interface serves as the portal which annota-
the necessary tools relevant for the morphological analyzer tors use to annotate the data. It also serves as the repository
MAGEAD (see Section 4.3.). We ask the annotators to se- for the data, the annotations and management of the anno-
lect (they are given a list of choices) the relevant part-of- tators. The annotation interface application runs on a web
speech tag (POS) for a given lemma as it is used in the server because it is the easiest and most efficient way to al-
blogs. For some of the POS tags, the annotators are re- low different annotators to work remotely, by entering their
quested to provide further morphological specifications. annotations into a central database. It also manages the an-
In our guidelines, we define coarse level POS tags by pro- notators tasks and tracks their activities efficiently. For a
viding the annotators with detailed diagnostics on how to more detailed description of the interface see (Benajiba and
70/119
Diab, 2010). For efficiency and security purposes, the an- 4.2. DA Identification Pipeline
notation application uses two different servers. In the first We developed a simple module to determine the degree to
one, we allocate all the html files and dynamic web pages. which a text includes DA words. Specifically, given Ara-
We use PHP to handle the dynamic part of the application bic text as input, we were interested in determining how
which includes the interaction with the database. The sec- many words are not MSA. The main idea is to use an MSA
ond server is a database server that runs on PostgreSQL.5 morphological analyzer, Buckwalter Arabic Morphological
Our database comprises 22 relational databases that are cat- Analyzer (BAMA) (Buckwalter, 2004), to analyze the input
egorized into tables for: text. If BAMA is able to generate a morphological analysis
• Basic information that is necessary for different mod- for an input word, then we consider that word MSA.
ules of the application. These tables are also signif- As a result, we have a conservative assessment of the di-
icantly useful to ease the maintenance and update of alectness of an input text. A major source of potential errors
the application. are names which are not in BAMA.
We assessed our pipeline on sample blog posts from our
• User permissions: We have various types of users with
harvested data. In an EGY blog post6 19% of the word
different permissions and associated privileges. These
types failed BAMA analysis. These words are mainly DA
tables allow the application to easily check the permis-
words with few named entities. Similar experiments were
sions of a user for every possible action.
conducted on IRQ,7 LEV,8 and MOR9 blog posts yielding
• Annotation information: This is the core table cat- 13.5%, 8% and 26% of non-MSA word types, respectively.
egory of our database. Its tables save the annota- It is worth noting the high percentage of out of vocabulary
tion information entered by each annotator. They also words for the Moroccan thread compared to the other di-
save additional information such as the amount of time alects. Also, by comparison, the low number of misses for
taken by an annotator to finish an annotation task. Levantine. This may be attributed to the fact that BAMA
For our application, we define three types of users, hence covers some Levantine words due to the LDC’s effort on
three views (see Figure 1): the Levantine Treebank (Maamouri et al., 2006).
We further analyzed BAMA-missed word types from a 30K
1. Annotator. An Annotator can perform an annota- word blog collection. We took a sample of 100 words from
tion task, check the number of his/her completed an- the 2,036 missed words. We found that 35% are dialectal
notations, and compare his/her speed and efficiency words and that 30% are named entities. The rest are MSA
against other annotators. An annotator can only work word that are handled by BAMA. We further analyzed two
on one dialect by definition since they are required to 100 string samples of least frequent bigrams and trigrams of
possess native knowledge it. An annotator might be word types (measured against an MSA language model) in
involved in more than one annotation task. the 30K word collection. We found that 50% of all bigrams
2. Lead Annotator. A Lead annotator (i) manages the an- and 25% of trigrams involved at least one dialectal word.
notators’ accounts, (ii) assigns a number of task units The percentages of named entities for bigrams and trigrams
to the annotators, and, (iii) checks the speed and work in our sample sets are 19% and 43%, respectively.
quality of the annotators. Leads also do the tasks
themselves creating a gold annotation for comparison 4.3. MAGEAD
purposes among the annotations carried out by the an- M AGEAD is a morphological analyzer and generator for
notators. A lead is an expert in only one dialect and the Arabic language family, by which we mean both MSA
thus s/he can only intervene for the annotations related and DA. For a fuller discussion of M AGEAD (including an
to that dialect. evaluation), see (Habash et al., 2005; Habash and Rambow,
2006; Altantawy et al., 2010). For an excellent discussion
3. Administrator. An Administrator (i) manages the of related work, see (Al-Sughaiyer and Al-Kharashi, 2004).
Leads’ accounts, (ii) manages the annotators’ ac-
M AGEAD relates (bidirectionally) a lexeme and a set of lin-
counts, (iii) transfers the data from text files to the
guistic features to a surface word form through a sequence
database, (iv) purges the annotated data from the data
of transformations. In a generation perspective, the features
base to xml files, and (v) produces reports such as
are translated to abstract morphemes which are then or-
inter-annotator agreement statistics, number of blogs
dered, and expressed as concrete morphemes. The concrete
annotated, etc.
templatic morphemes are interdigitated and affixes added,
The website uses modern JavaScript libraries in order to finally morphological and phonological rewrite rules are
provide highly dynamic graphical user interfaces (GUI). applied. In this section, we discuss our organization of lin-
Such GUIs facilitate the annotator’s job leading to signifi- guistic knowledge, and give some examples; a more com-
cant gain in performance speed by (i) maximizing the num- plete discussion of the organization of linguistic knowledge
ber of annotations that can be performed by a mouse click in M AGEAD can be found in (Habash et al., 2005).
rather than a keyboard entry and by (ii) using color cod-
ing for fast checks. Each of the GUIs which compose our 6
http://wanna-b-a-bride.blogspot.com/2009/09/blog-
web applications has been carefully checked to be consis- post_29.html
7
tent with the annotation guidelines. http://archive.hawaaworld.com/showthread.php?t=606067&page=76
8
http://www.shabablek.com/vb/t40156.html
5 9
http://www.postgresql.org/ http://forum.oujdacity.net/topic-t5743.html
71/119
Figure 1: Servers and views organization.
Lexeme and Features Morphological analyses are rep- DA/MSA independent. Although as more Arabic variants
resented in terms of a lexeme and features. We define the are added, some modifications may be needed. Our current
lexeme to be a triple consisting of a root, a morphological MBC hierarchy specification for both MSA and Levantine,
behavior class (MBC), and a meaning index. We do not which covers only the verbs, comprises 66 classes, of which
deal with issues relating to word sense here and therefore 25 are abstract, i.e., only used for organizing the inheritance
do not further discuss the meaning index. It is through this hierarchy and never instantiated in a lexeme.
view of the lexeme (which incorporates productive deriva- MAGEAD Morphemes To keep the MBC hierarchy
tional morphology without making claims about semantic variant-independent, we have also chosen a variant-
predictability) that we can have both a lexeme-based repre- independent representation of the morphemes that the MBC
sentation, and operate without a lexicon (as we may need hierarchy maps to. We refer to these morphemes as abstract
to do when dealing with a dialect). In fact, because lex- morphemes (AMs). The AMs are then ordered into the
emes have internal structure, we can hypothesize lexemes surface order of the corresponding concrete morphemes.
on the fly without having to make wild guesses (we know The ordering of AMs is specified in a variant-independent
the pattern, it is only the root that we are guessing). Our context-free grammar. At this point, our example (1) looks
evaluation shows that this approach does not overgenerate. like this:
We use as our example the surface form HQëX P@ Aizda-
harat (Azdhrt without diacritics) “she/it flourished". The (2) [Root:zhr][PAT_PV:VIII]
M AGEAD lexeme-and-features representation of this word [VOC_PV:VIII-act] + [SUBJSUF_PV:3FS]
form is as follows:
Note that the root, pattern, and vocalism are not ordered
(1) Root:zhr MBC:verb-VIII POS:V PER:3 GEN:F with respect to each other, they are simply juxtaposed.
NUM:SG ASPECT:PERF The ‘+’ sign indicates the ordering of affixival morphemes.
Only now are the AMs translated to concrete morphemes
Morphological Behavior Class An MBC maps sets (CMs), which are concatenated in the specified order. Our
of linguistic feature-value pairs to sets of abstract mor- example becomes:
phemes. For example, MBC verb-VIII maps the feature-
value pair ASPECT:PERF to the abstract root morpheme (3) <zhr,V1tV2V3,iaa> +at
[PAT_PV:VIII], which in MSA corresponds to the concrete
root morpheme V1tV2V3, while the MBC verb-II maps AS- Simple interdigitation of root, pattern and vocalism then
PECT:PERF to the abstract root morpheme [PAT_PV:II], yields the form iztahar+at.
which in MSA corresponds to the concrete root morpheme MAGEAD Rules We have two types of rules. Mor-
1V22V3. We define MBCs using a hierarchical representa- phophonemic/phonological rules map from the morphemic
tion with non-monotonic inheritance. The hierarchy allows representation to the phonological and orthographic repre-
us to specify only once those feature-to-morpheme map- sentations. For MSA, we have 69 rules of this type. Ortho-
pings for all MBCs which share them. For example, the graphic rules rewrite only the orthographic representation.
root node of our MBC hierarchy is a word, and all Arabic These include, for example, rules for using the gemination
words share certain mappings, such as that from the lin- shadda (consonant doubling diacritic). For Levantine, we
guistic feature conj:w to the clitic w+. This means that have 53 such rules.
all Arabic words can take a cliticized conjunction. Sim- For our example, we get /izdaharat/ at the phonological
ilarly, the object pronominal clitics are the same for all level. Using standard MSA diacritized orthography, our
transitive verbs, no matter what their templatic pattern is. example becomes Aizdaharat (in transliteration). Remov-
We have developed a specification language for expressing P@
ing the diacritics turns this into the more familiar HQëX
MBC hierarchies in a concise manner. Our hypothesis is Azdhrt. Note that in analysis mode, we hypothesize all pos-
that the MBC hierarchy is Arabic variant-independent, i.e. sible diacritics (a finite number, even in combination) and
72/119
perform the analysis on the resulting multi-path automaton. types of surface forms for the search engine (the contextual
We follow (Kiraz, 2000) in using a multi-tape representa- material is left unchanged):
tion. We extend the analysis of Kiraz by introducing a fifth
tier. The five tiers are used as follows: Tier 1: pattern and • Mode 1: MSA inflected forms. For example, the
affixational morphemes; Tier 2: root; Tier 3: vocalism; Tier MSA query term iJ. @ ÂSbH ‘he became’ is expanded
4: phonological representation; Tier 5: orthographic repre-
sentation. In the generation direction, tiers 1 through 3 are to several MSA forms including AJj. @ ÂSbHnA ‘we
always input tiers. Tier 4 is first an output tier, and subse- became’, iJ. sySbH ‘he will become’, etc.
quently an input tier. Tier 5 is always an output tier.
We implemented our multi-tape finite state automata as a • Mode 2: MSA inflected with dialectal morphemes.
layer on top of the AT&T two-tape finite state transducers It is common in DA to borrow an MSA verb and in-
(Mohri et al., 1998). We defined a specification language flect it using dialectal morphology; we refer to this
for the higher multi-tape level, the new M ORPHTOOLS for- phenomenon as intra-word code switching. For exam-
mat. Specification in the M ORPHTOOLS format of different
ple, the MSA query term iJ. @ ÂSbH can be expanded
types of information such as rules or context-free gram-
mars for morpheme ordering are compiled to the appro- into iJ.Jë hySbH ‘he will become’ and @ñjJ.Jë hyS-
priate L EXTOOLS format (an NLP-oriented extension of bHwA ‘they will become’.
the AT&T toolkit for finite-state machines, (Sproat, 1995)).
For reasons of space, we omit a further discussion of M OR - • Mode 3: MSA lemma translated to a dialectal lemma,
PHTOOLS . For details, see (Habash et al., 2005). and then inflected with dialectal morphemes. For ex-
From MSA to Levantine and Egyptian We modified ample, the MSA query term iJ. @ ÂSbH can be ex-
M AGEAD so that it accepts Levantine rather than MSA
verbs. Our effort concentrated on the orthographic repre-
panded into EGY ù®K. bqý ‘he became’ and ù®J . Jë hy-
bqý ‘he will become’.
sentation; to simplify our task, we used a diacritic-free or-
thography for Levantine developed at the Linguistic Data
Currently, DIRA handles EGY and LEV; with the exis-
Consortium (Maamouri et al., 2006). Changes were done
tence of more resources for additional dialects, they will
only to the representations of linguistic knowledge, not to
be added. The DIRA system architecture is shown in Fig-
the processing engine. We modified the MBC hierarchy,
ure 2. After submitting an MSA query to DIRA, the verb is
but only minor changes were needed. The AM ordering
extracted out of its context and sent to the MSA verb lemma
can be read off from examples in a fairly straightforward
detector, which is responsible for analyzing an MSA verb
manner; the introduction of an indirect object AM, since
(using MAGEAD in the analysis direction) and computing
it cliticizes to the verb in dialect, would, for example, re-
its lemma (using MAGEAD in the generation direction).
quire an extension to the ordering specification. The map-
The next steps depend on the chosen dialects and modes.
ping from AMs to CMs, which is variant-specific, can be
If translation to one or more dialects is required, the in-
obtained easily from a linguistically trained (near-)native
put lemma is translated to the dialects (Mode 3). Then,
speaker or from a grammar handbook. Finally, the rules,
the MAGEAD analyzer is run on the lemma (MSA or DA,
which again can be variant-specific, require either a good
if translated) to determine the underlying morphemes (root
morpho-phonological treatise for the dialect, a linguisti-
and pattern), which are then used to generate all inflected
cally trained (near-)native speaker, or extensive access to
forms using MAGEAD (again, which forms are generated
an informant. In our case, the entire conversion from MSA
depends on the mode). Finally, the generated forms are
to Levantine was performed by a native speaker linguist in
re-injected in the original query context (duplicates are re-
about six hours. A similar but more limited effort was done
moved).
to extend the Levantine system to Egyptian by introducing
the Egyptian concrete morpheme for the future marker +ë
h+ ‘will’. 6. Conclusions and Future Work
We presented COLABA, a large effort to create resources
5. Resource Integration & Use: DIRA
and processing tools for Dialectal Arabic. We briefly de-
DIRA (Dialectal Information Retrieval for Arabic) is a scribed the objectives of the project and the various types
component in an information retrieval (IR) system for Ara- of resources and tools created under it. We plan to continue
bic. It integrates the different resources created above in its working on improving the resources and tools created so
pipeline. As mentioned before, one of the main problems of far and extending them to handle more dialects and more
searching Arabic text is the diglossic nature of the Arabic types of dialectal data. We are also considering branching
speaking world. Though MSA is used in formal contexts on into application areas other than IR that can benefit from
the Internet, e.g., in news reports, DA is dominant in user- the created resources, in particular, machine translation and
generated data such as weblogs and web discussion forums. language learning.
Furthermore, the fact that Arabic is a morphologically rich
language only adds problems for IR systems. DIRA ad-
dresses both of these issues. DIRA is basically a query-
Acknowledgments
term expansion module. It takes an MSA verb (and possi- This work has been mostly funded by ACXIOM Corpora-
bly some contextual material) as input and generates three tion.
73/119
Figure 2: DIRA system architecture
Alon Itai
Knowledge Center for Processing Hebrew
Computer Science Department
Technion, Haifa, Israel
E-mail: itai@cs.technion.ac.il
Abstract
The paper discusses searching a corpus for linguistic patterns. Semitic languages have complex morphology and ambiguous writing
systems. We explore the properties of Semitic Languages that challenge linguistic search and describe how we used the Corpus
Workbench (CWB) to enable linguistic searches in Hebrew corpora.
75/119
3. CWB mxiibim
CWB – the Corpus Workbench – is a tool created at the :mxaiib-ADJECTIVE-masculine-plural-abs-indef
University of Stuttgart for searching corpora. The tool :xiib-PARTICIPLE-Pi'el-xwb-unspecified-masculine-
enables linguistically motivated searches, for example plural-abs-indef
one may search a single word, say "interesting". :xiib-VERB-Pi'el-xwb-unspecified-masculine-plural-
The query language consists of Boolean combinations of present:PREFIX-m-preposition
regular expressions, which uses the POSIX EGREP :xiib-NOUN-masculine-plural-abs-indefinite
syntax, e.g. the query :PREFIX-m-preposition- xiib-ADJECTIVE- masculine-
"interest(s|(ed|ing)(ly)?)?" plural -abs-indef:
yields a search for either of the words interest,
interests, interested, interesting, The analyses are:
interestedly, interestingly. 1. The adjective mxiib, gender masculine, number
plural, status absolute and the word is indefinite;
One can also search for the lemma, say "lemma=go" 2. The verb xiib it is a participle of a verb whose binyan
should yield sentences containing the words go, goes, (inflection pattern of verb) is Pi'el, the root is xwb, the
going, went and gone. The search can be focused on person is unspecified, gender masculine, number
part of speech "POS=VERB". CWB deals with incomplete plural, the type of participle is noun, the status
specifications by using regular expressions. For example, absolute and the word is indefinite.
a verb can be subcategorized as VBG (present/past) and 3. A verb whose root is xwb, binyan Pi'el, person
VGN (participle). The query [pos="VB.*"] matches unspecified, number plural and tense present.
both forms and may be used to math all parts of speech 4. The noun xiib, prefixed by the preposition m.
that start with the letters VB. ("." matches any single 5. The adjective xiib, prefixed by the preposition m.
character and "*" after a pattern indicates 0 or more
repetitions, thus ".*" matches any string of length 0 or Thus one can retrieve the word by any one of the queries
more.). Finally, a query may consist of several words thus by POS:
["boy"][POS=VERB] yields all sentences that contain [POS=".*-ADJECTIVE-.*"], [POS=".*-PARTICIPLE-.*"],
the word boy followed by a verb. [POS=".*-VERB-.*"], [POS=".*-NOUN-.*"].
However, one may also specify additional properties by
To accommodate linguistic searches, the corpus needs to using a pattern that matches subfields:
be tagged with the appropriate data (such as, lemma, [POS=".*PREFIX-[^:]*preposition[^:]*-NOUN-.*"]
POS). The system then loads the tagged corpus to create indicating that we are searching for a noun that is prefixed
an index. To that end the corpus should be reformatted in a by a preposition. The sequence [^:]* denotes any
special format. sequence of 0 or more characters that does not contain ":"
and is used to skip over unspecified sub-fields. Since the
CWB has been used for a variety of languages. It also different analyses of a word are separated by ":" and ":"
supports UTF-8, thus allowing easy processing of non cannot appear within an analysis, the query cannot be
Latin alphabets. satisfied by matching the part of the query by one analysis
and the remainder of the query by a subsequent analysis.
76/119
6. Writing Queries
However, one must be careful to avoid queries of the type Even though it is possible to write queries in the above
[pos=".*NOUN.*" & pos=".*-singular-.*"] format we feel that it is unwieldy. First the format is
since then we might return a word that has one analysis as complicated and one may easily err. However more
a plural noun and another analysis as a singular verb. importantly, in order to write a query one must be familiar
with all the features of each POS and in which order they
5. Performance appear in the index. This is extremely user-unfriendly and
To test the performance of the system we uploaded a file we don't believe many people will be able to use such a
of 814,147 words, with a total of 1,564,324 analyses, i.e., system.
2.36 analyses per word. Table 1 shows a sample of
queries and their performance. The more general the To overcome this problem, we are in the process of
queries the more time they required. However, the creating a GUI which will show for each POS the
running time for these queries is reasonable. If the running appropriate subfields and once a subfield is chosen a
time is linear in the size of the corpus, CWB should be menu will show all possible values of that subfield.
able to support queries to 100 million word corpora. Unspecified subfields will be filled by placeholders. The
One problem we encountered is that of space. The index graphic query will then be translated to a CWB query and
of the 814,147 word file required 25.2 MB. Thus each the results of this query will be presented to the user. We
word requires about 31 bytes. Thus a 100 Million word believe that the GUI will also be helpful for queries in
corpus would require a 3.09 Gigabyte index file. languages that now use CWB format.
[word="[]"עלword="[]"מנתpos=".*-VERB-[^:]*-infinitive:.*"]; 0.038 28
[word="[]"עלword="[]"מנתpos=".*PREFIX-ש.*"]; 0.025 5
[word="[]"ביתpos=":ספר.*"]; 0.017 7
[word="[]"ביתpos=".*:^[ספר:]*-SUFFIX-possessive-.*"]; 0.014 1
";"כותב 0.009
[pos=".*:[^:]*-VERB-[^:]*:.*"]; 0.569
".*"; 1.854
[pos=".*"]; 1.85
77/119
Association for Computational Linguistics, pp.
7. Conclusion 573--580, Ann Arbor.
Until now Linguistic searches were oriented to Western Hajič, J. (2000). Morphological tagging: data vs.
languages. Semitic languages exhibit more complex dictionaries. In Proceedings of NAACL-ANLP, pp.
patterns, which at first sight might require designing 94--101, Seattle, Washington.
entirely new tools. We have showed how to reuse existing
tool to efficiently conduct sophisticated searches. Hajič, J. and Barbora Hladká, B. (1998). Tagging
Inflective Languages: Prediction of Morphological
The interface of current systems is UNIX based. This Categories for a Rich, Structured Tagset. In
might be acceptable when the linguistic features are Proceedings of COLING-ACL 1998. pp. 483--490,
simple, however, for complex features, it is virtually Montreal, Canada
impossible to memorize all the possibilities and render the Itai, A. and Wintner, S. (2008). Language Resources for
queries properly. Thus a special GUI is necessary. Hebrew. Language Resources and Evaluation, 42, pp.
75--98.
Lee, Y-S. et al. (2003). Language model based Arabic
8. Acknowledgements word segmentation. In ACL 2003, pp. 399--406.
It is a pleasure to thank Ulrich Heid, Serge Heiden and Segal, E. (2001). Hebrew morphological analyzer for
Andrew Hardie who helped us use CWB. Last and Hebrew undotted texts. M.Sc. thesis, Computer
foremost I wish to thank Gassan Tabajah whose technical Science Department, Technion, Haifa, Israel.
assistance was invaluable.
9. References
78/119
Algerian Arabic Speech Database (ALGASD):
Description and Research Applications
Abstract
This paper presents Algerian Speech Database (ALGASD) for standard Arabic language and relative research applications. The
project concerns 300 Algerian native speakers whom are selected statistically from 11 regions of the country. These different areas
are assumed representing the principal variations of pronunciations denoted between the populations.
ALGASD took into consideration many features as: gender, age and education level of every speaker. The basic text used to
elaborate the database is constituted by 200 phonetically balanced sentences. Number of recordings achieves to 1080 read sentences.
ALGASD provides an interesting set of perspectives for speech science applications, such as: automatic speech recognition,
acoustic phonetic analysis, prosodic researches; etc. This voice bank is used until now in several studies like rhythm analysis of
Algerian speakers. It is also used in training and testing phases of speech recognition systems, etc.
79/119
media (newspapers), universities, etc. (Arezki, 2008; Recordings are made in quiet environments well known
Cheriguen,1997). by the speakers. The same conditions of sound recording
are respected for all regions. We selected the best reading
3. Corpus Design and deleted all sentences which contained hesitations, re-
Text material of ALGASD is built from 200 Arabic recorded utterances which were not spoken clearly,
Phonetically Balanced Sentences (APBS) (Boudraa, & correctly, too soft or too loud. The average duration of
Boudraa & Guerin 1998). From which we conceived sentences is about 2.8 seconds. Rate of recording is
three types of corpora. Every corpus aims to provide us a normal. The sound files are in wave format, coded on 16
specific acoustic-phonetic knowledge. Common Corpus bits and sampling at 16 KHz.
(Cc): is used to list a maximum of dialectal variations of
pronunciation observed among Algerians. It is composed
of two utterances of APBS read by all speakers. Reserved
5. Recordings
Corpus (Cr): brought all existing phonetic oppositions in Recordings were preceded as follow:
the Arabic language. It is endowed with 30 sentences of Every 3 texts of Cr are distributed periodically on the
APBS which are divided into 10 texts of 3 sentences and 11 regions. In the beginning, we shared these 3 texts and
sheared between groups of speakers. In order to increase gave them to 3 speakers (2 male /1 female), excepted for
some consonants' occurrences, we broke some times the R9, where it was endowed only by 2 texts for 2 speakers
balance. Individual Corpus (Ci): is constituted of 168 (1 male/1 female). But after, we augmented the number
remaining sentences. They are used to gather maximum of recordings by increasing the number of speakers for
of contextual allophones. each region (Table.2). Total speakers and recordings
reached then respectively to 86 and 258 sound files.
Cc text material was read by all speakers of ALGASD
To elaborate ALGASD, we selected 300 Algerian
(300 speakers). Number of readings achieved to 600
speakers from 11 regions of the country which mapped
the most important variation of pronunciations between recordings. As regards to Ci' text, we realized 2 different
sub-sets of recordings: the first one contains 32 utterances
inhabitants. All participants are native speakers and had
read by all speakers of Cr corpus. The second one is
grown up in or near localities selected for this research.
According to the most recent census of inhabitants constituted of 136 sentences statistically distributed
between 136 other speakers for all regions. From this
available in ONS web site (ONS), we distributed
operation, two sentences were remaining. We added them
statically all speakers between these areas with regard to
the real number and gender of inhabitants for each region to R9 texts because it contained the less number of
speakers
(Table.1).
M F Recordings
Regions Female Male T. Speaker/
Region R1 12 11 69
R1 Algiers 40 (50%) 40 (50%) 80 (27%) R2 5 5 30
R2 Tizi Ouzou 17 (50%) 17 (50%) 34 (11%) R3 4 3 21
R3 Medea 13 (52%) 12 (48%) 25 (8%) R4 4 3 21
R4 Constantine 13 (52%) 12 (48%) 25 (8%) R5 3 2 15
R5 Jijel 09 (50%) 09 (50%) 18 (6%) R6 3 2 15
R6 Annaba 09 (52%) 08 (48%) 17 (6%) R7 6 5 33
R7 Oran 19 (50%) 19 (50%) 38 (13%) R8 4 3 21
R8 Tlemcen 13 (50%) 13 (50%) 26 (9%) R9 1 1 6
R9 Bechar 04 (52%) 03 (48%) 07 (2%) R10 3 2 15
R10 El Oued 08 (50%) 08 (50%) 16 (5%) R11 2 2 12
R11 Ghardaïa 07 (50%) 07 (50%) 14 (5%) 11 47 39 258
Total 11 152 148 (49%) 300 (100%) 86 speakers
(51%)
Table 2: Recordings of Cr corpus
Table 1: Speakers' distribution in ALGASD
In conclusion, 28 % of speakers read 6 sentences, 45 %
read 3 sentences and 26% read only 2 ones. Total number
4. ALGASD Features of ALGASD recordings reached to 1080 (Table 3).
The speaker profile used in database takes into
consideration age and education level of every speaker.
We suggested, so, for these two features respectively
three different categories: (18-30/ 30-45/ +45) and
(Middle/Graduate/Post Graduate).
80/119
Corpora N° utterances speakers Total
Cheriguen, F. (1997). Politique linguistique en Algérie,
Cc 2 300 600 (55.5%) in Mots, Les langages du politique, n52, pp. 62-74.
Cr 30 86 258 (24.0%)
Choukri, K. Hamid, S. Paulsson, N. (2005). Specifiaction
Ci 168 222 222 (20.5%) of the Arabic Broadcast News Speech Corpus
TOTAL 200 300 1080 (100%) NEMLAR: http://www.nemlar.org.
Table 3: Total corpora and speakers of ALGASD Droua-Hamadani, G. & al. (2009). ALGASD PROJECT:
Statistical Study of Vocalic Variations according to
Education Levels of Algiers Speakers., Intonational
6. Research Applications of ALGASD Variation in Arabic Conference IVA09, York,
speech corpus (England).
ALGASD corpus is characterized by many aspects as: a
high quality of recordings, a large number of speakers, Droua-Hamadani, G. & Selouani, S.A. & Boudraa, M.
speaker's features which reflect many differences due to (2009). ALGASD Algerian voice bank project
region, age, gender, education levels and the dialect ALGASD's adaptation for continuous speech
varieties. All these characteristics provide an interesting recognition system. The 5th International Conference
set of perspectives for speech science applications, such on Computer Science Practice in Arabic (CSPA '09
as: automatic speech recognition, acoustic phonetic AICCSA09- IEEE), Rabat (Marroco).
analysis, perceptual experiments to study classification of
the different regional varieties spoken within Algeria SA, Gopalakrishna, A. & al (2005). Development of Indian
prosodic studies as rhythm, comparison of Algerian SA Language Speech Databases of Large Vocabulary
with Arabic of Maghreb countries or eastern ones, etc.
Speech Recognition Systems. Proceedings of
International Conference On Speech an Computer
ALGASD database was used until now in many studies (SPECOM), Patras, Greece.
as: statistical study of qualitative and quantitative vocalic
variations according to education levels of Algiers Linguistic Data Consortium (LDC):
speakers (Droua-Hamadani, & Selouani & Boudraa &
http://www.ldc.upenn.edu.
Boudraa , 2009); Location of Algerian Standard Arabic
Rhythm between stressed languages (to appear); Impact
of education levels on duration and rhythm of Algerian Marçais, P. Textes arabes de Djidjelli (1954). Presse
modern Standard Arabic (to appear). By respecting universitaire de France.
some recommendations in the selection and the
distribution of both sound material and speakers, we built Mohamed, A. M. Alghamdi, M. & Z. Muzaffar, Z.
from ALGASD two required corpora to train and test (2007). Speaker Verification Based on Saudi Accented
speech recognition system for Algerian Standard Arabic Arabic Database. International Symposium on Signal
(Droua-Hamadani, & Selouani & Boudraa & Boudraa , Processing and its Applications in conjunction with
2009). the International Conference on Information Sciences,
Signal Processing and its Applications. Sharjah,
7. References United Arab United Arab Emirates.
Arezki, A. (2008). Le rôle et la place du français dans le
système éducatif algérien. Revue du Réseau des National Office of Statistics (ONS): http://www.ONS.dz.
Observatoires du Français Contemporain en Afrique,
N° 23. pp 21-31. Petrovska, D. & al (1998). POLYCOST: A Telephone-
Speech Database for Speaker recognition. Proceedings
Boudraa, M. & B. Boudraa, B. & Guerin, B. (1998). RLA2C ("Speaker Recognition and its Commercial
Twenty Lists of Ten Arabic Sentences for Assessment. and Forensic Applications"), Avignon, France, pp.
ACUSTICA Acta-acustica. Vol.84. 211-214. ( http://circhp.epfl.ch/polycost).
81/119
Computers, Linguistics, and Phonetics between
Language and Speech, Bernhard Schröder et al (Ed.)
Springer, Berlin 1998, ISBN Proceedings of the 4th
Conference on NLP - Konvens-98, Bonn, Germany.
82/119
Integrating Annotated Spoken Maltese Data into Corpora
of Written Maltese
Alexandra Vella †*, Flavia Chetcuti†, Sarah Grech†, Michael Spagnol‡
University of Malta†, University of Cologne*, University of Konstanz‡
alexandra.vella@um.edu.mt, fchetcuti@hotmail.com, sgrec01@um.edu.mt, michael.spagnol@uni-konstanz.de
Abstract
Spoken data features to a lesser extent in corpora available for languages than do written data. This paper addresses this issue by
presenting work carried out to date on the development of a corpus of spoken Maltese. It outlines the standards for the PRAAT
annotation of Maltese data at the orthographic level, and reports on preliminary work on the annotation of Maltese prosody and
development of ToBI-style standards for Maltese. Procedures being developed for exporting PRAAT TextGrid information for the
purposes of incorporation into a predominantly written corpus of Maltese are then discussed. The paper also demonstrates how
characteristics of speech notoriously difficult to deal with have been tackled and how the exported output from the PRAAT annotations
can be enhanced through the representation also of phenomena, sometimes referred to as “normal disfluencies”, which include “filled
pauses” and other vocalisations of a quasi-lexical nature having various functions of a discourse-management type such as
“backchannelling”.
83/119
well as to “backchannels” or features involved in first letter of target items (on which, see also Subsection
providing feedback to other participants in a dialogue. 3.1.4) is always capitalised. When lexical stress in target
Some concluding remarks are provided in Section 6. items is misplaced, the syllable which has been stressed is
capitalised in the annotation, e.g. the expected position of
3. Structure of the annotations stress in the proper noun PERgola in the target item Hotel
As mentioned in Section 2, information of different types Pergola ‘Hotel Pergola’ is antepenultimate; capitalisation
is included in separate tiers in the PRAAT annotations in the TextGrid annotation PerGOla indicates that stress
carried out. A sample of a very short extract from one was assigned penultimately by the speaker in this
recording, together with the associated annotation, is instance.2
shown in Figure 1 below.
Sentential punctuation marks such as question marks ( ? )
and full-stops ( . ) are included in the annotation, and are
MC_CB_C1_NEWfullyrevised
22.5993279 generally used in line with punctuation conventions rather
than to indicate a fall or rise in pitch. Final punctuation
marks such as exclamation marks ( ! ), ellipsis ( … ),
quotation marks ( ‘ ’, “ ” ), etc., by contrast, have not been
included in the annotation. The punctuation marks in this
M-hm. G ? addi
minnbej’ Sqaq il-Merill u Triq Marmarà. tielg ? a.
Ibqa’ group are intended to indicate, in written text, the
M-hm? presence of elements typical of speech. Specifically, the
Pa Overlap Br Pa-C exclamation mark indicates use of intonation of a
TB8 TE8 TB17 TE17
particularly “marked” kind, whilst ellipsis often indicates
FP FP
a pause in speech or an unfinished sentence. Both these
Aspiration
elements are catered for in tiers other than the SP1 and
18.71 22.6
SP2 tiers (see Subsection 3.1.4 and 3.1.2 respectively).
Time (s) Quotation marks in written texts often indicate direct, as
Figure 1: Sample excerpt from MC_CB_C1 opposed to indirect speech or narrative, not a relevant
factor with respect to the annotation standards being
3.1 Annotation tiers discussed given that the texts in question consist solely of
The standards used in carrying out the annotations are speech. Hyphens ( - ), accents ( ` ) and apostrophes ( ’ ) are
summarised below in Subsections 3.1.1 – 3.1.4. used in the normal way as for written Maltese. Note that
Subsections 3.1.1 – 3.1.3 deal with the SP1 and SP2, apostrophes are also used to indicate elision, as noted
Br-Pa-Os and FPs tiers respectively, while Subsection above. Internal punctuation marks such as dashes ( – ),
3.1.4 deals with the remaining TIs and MISC tiers, as semi-colons ( ; ), colons ( : ), and particularly commas ( , ),
well as, briefly, with the prosodic Tone, Prom and an important element of punctuation in written texts, are
Functions tiers. avoided although also catered for in the annotations (see
Subsection 3.1.2). Such punctuation marks sometimes
3.1.1. SP1 and SP2 tiers coincide with the location of phrase boundaries of
The word-by-word annotation makes use of standard different sorts, but do not always do so. Their use is not as
orthography, including the new spelling rules published clearly regulated as is that of other punctuation marks, and
by Il-Kunsill Nazzjonali tal-Ilsien Malti in 2008 and all therefore would give poor results in terms of
Maltese characters use Unicode codification (see inter-transcriber reliability.
Akkademja tal-Malti, 2004; Kunsill tal-Malti, 2008). In a
number of cases, however, there is some variation with Phrasal units involving a determiner and a noun or
respect to regular standard orthography, as it is considered adjective (e.g. il-bajja ‘the bay’, il-kbir ‘big, lit. the big’,
important for the word-by-word annotation to provide as etc.), as well as units with a particle plus determiner and
close a record as possible to what was actually said. Thus, noun or adjective (fid-direzzjoni ‘in the direction (of)’,
for example, in cases of elision of different sorts, a tar-re ‘of the king’, etc.) are segmented together. Simple
convention similar to that used in standard orthography particles, on the contrary, are segmented as separate
(e.g. tazz’ilma for tazza ilma ‘a glass of water’), that is the expressions from the word they precede (e.g. ta’ Mejju ‘of
use of an apostrophe, is extended to include initial elision, May’, fi Triq Ermola ‘in Ermola Street’, etc.). Additional
e.g. ’iġifieri for jiġifieri ‘that is to say’.1 There are also conventions used which are at odds with standard
instances of insertions. In such cases, inserted segments punctuation are question marks at the beginning of a word
are added to the transcription in square brackets (e.g. to mark a dubious or unclear expression (e.g. ?Iwa/Imma
nagħmlu [i]l-proġett ‘we [will] perform the task’). ‘yes/but’), asterisks immediately before a word to mark
an ungrammatical or non-existent expression in the
Capitalisation follows punctuation rules in Maltese. The language (e.g. il-bajja *tar-Ray lit. ‘Ray’s the bay’) as
1 2
Where possible, examples provided are taken from the Map Where an indication of lexical stress is necessary in the above,
Task annotations carried out. the syllable in question is shown in bold capitals.
84/119
used in linguistics to indicate unacceptability, and slashes roughly defined as “forms not usually found in a
on both sides of a word to indicate non-Maltese words dictionary”. The FPs tier is extremely useful to the
(e.g. /anticlockwise/). phonetician using PRAAT as her/his main analysis tool
since it increases searchability (but see also Subsection
3.1.2. Br-Pa-Os tier 3.1.4).
The Br-Pa-Os tier is used to indicate the presence of
breaks and pauses, as well as that of overlap, in the One of the difficulties encountered in the annotation of
dialogues. Examples of the distinctions made, together “non-standard forms” is that such forms have no clearly
with a description of specific characteristics in each case, recognisable “standard” representation, something which
are given in Table 1 below. can prove problematic even to established writers . In fact,
of the forms whose occurrence is noted in the FP tier, only
Examples Characteristics six, namely “e”, “eħe”, “eqq”, “ew”, “ħeqq” and “ta” are
Triq Mar<...> Br False start or repair listed in Aquilina’s (1987/1990) dictionary. In addition,
Triq Mannarino. truncation and correction other researchers refer to different forms or to similar
‘Mar<...> Street Br Mannarino Street.’ unexpected forms having functions that seem to be different to the
Għaddi minn bej’ Sqaq il-Merill Br Intra-turn ones in the data analysed (see, e.g., Borg &
u Triq Marmarà. break within constituent Azzopardi-Alexander, 1997; Mifsud & Borg, 1997).
‘Go between Merill Alley Br and Marmarà Street.’ unexpected
Għaddi Br Intra-turn Forms of the sort whose occurrence is noted in this tier are
minn bejniethom. break before adverbial typically found in spontaneous speech and include
‘Go Br between them.’ less unexpected phenomena such as “repetitions”, “repairs” and “filled
Mela Br Similar to comma pauses” (mentioned earlier and reported in Cruttenden,
tibda mill-Bajja ta’ Ray. break before main clause 1997). Such phenomena often serve clear functions and
‘So Br begin at Ray’s Bay.’ expected have their own specific characteristics, phonetic,
u Triq Marmarà. Pa Intra-speaker; full-stop including prosodic, as well as otherwise (see, e.g.,
Ibqa’ tielgħa. break across sentences Shriberg, 1999).
‘and Marmarà Street. Pa Keep walking upwards.’ expected
SP1: Ibqa’ tielgħa. Pa-C Inter-speaker; full-stop As things stand at present in fact, the FPs tier conflates
SP2: Sewwa. break across sentences into one relatively undifferentiated group, a number of
‘Keep walking upwards. Pa-C Good.’ expected different phenomena. 3 Preliminary analysis of elements
SP2: M-hm? Inter-speaker included in this tier reported by Vella et al. (2009) makes
SP1: Għaddi minn bejn Sqaq il-... overlap possible a distinction between “real” filled pauses (FPs)
‘M-hm? O Go between Merill Alley...’ – and other phenomena – the latter will also be discussed in
this Subsection.
Table 1: Examples of Br-Pa-O distinctions made
Although consensus amongst researchers on what exactly
Differentiation between breaks and pauses is based on a “counts” as an FP is limited, linguists usually agree that
broad distinction between intra-sentence gaps, labelled FPs are discourse elements which, rather than
Break, and inter-sentence ones, labelled Pause. contributing information, “fill” silences resulting from
Transcribers were instructed to allow intonation, as well pauses of various sorts. They also agree that such
as their intuitions, to inform their decisions. The elements can contribute meaning and/or communicative
distinction between Break and Pause, correlates, roughly function but do not always do so, and that they often have
speaking, with the comma vs. full-stop distinction made a role to play in the organisation of discourse.
in writing. Unexpected intra-speaker mid-turn pauses
associated with “normal disfluency”-type phenomena Preliminary analysis of the forms flagged in the FPs tier
mentioned earlier (see Sections 1 and 2) are however also in the data annotated included a durational, distributional
labelled as Break. Within speaker pauses across and phonetic (particularly prosodic) study of the forms.
sentences are distinguished from those across speakers by One outcome of the durational study is that it has made
means of the label Pause vs. Pause-C(hange). A study of possible the standardisation of annotation guidelines for
both the distribution and durational characteristics of the various “non-standard forms” found in the data. To
breaks vs. pauses is planned. Such a study is expected to give one example, the original annotations include three
throw light on the nature of different types of “different” forms: e, ee and eee. The durational study
phonological boundaries and related boundary strength in carried out however suggests that the different labels do
Maltese (but see also Section 3.1.4 below). not in fact correlate with a difference in the duration of the
entities in question: instances transcribed as eee are not in
3.1.3. The FPs tier fact longer than instances transcribed as ee, which are not,
This tier is a very important part of the annotations. It is
used to note the position of any “non-standard forms” 3
transcribed in the SP1 and SP2 tiers, such forms being It is possible that the FP tier will in fact be renamed in
subsequent annotation work.
85/119
in turn, longer than instances transcribed as e: the one found four times in the data annotated, is described by
label eee is therefore being suggested as the “standard” Aquilina as “short for taf, you know”, the latter as an
for all occurrences of this type of FP and the original “occasional variant of jew” (1990:1382; 1987:290).
annotations have been amended accordingly. References Although the status of both these vocalisations as
below are to the labels as amended rather than to the “quasi-lexical” is unclear, they are mentioned here since
original labels. they may share with m-hm the function of
backchannelling mentioned earlier.
The analysis mentioned above has led to the identification
of a number of “real” FPs in Maltese, similar to FPs A third class of forms, in this case vocalisations which
described for other languages. These are eee, mmm and seem to be similar to the “ideophones and interjections”
emm, all of which contain the element/s [e] and [m]. The category listed by Borg & Azzopardi-Alexander (1997)
analysis also noted two other “forms” of FPs annotated in also occur in the data. The latter include forms which have
the data, namely ehh and ehm. While the latter may be been annotated in our data as follows: fff, eqq, ħeqq and
phonetic variants of the above-mentioned eee and emm, ttt. The latter is described by Borg &
instantiations of ehh and ehm in the data annotated are Azzopardi-Alexander (1997:338) as an “alveolar click ttt
significantly longer than their eee and emm counterparts. [!]...commonly used to express lack of agreement with an
They also appear to have an element of “glottalisation” interlocutor” and in fact it is this use that is attested in the
not normally characteristic of instances of eee and emm.4 data annotated, rather than a use indicating disapproval,
often transcribed orthographically as tsk, and involving
The distributional analysis of the “real” FPs eee, mmm repetition of the click (see Mifsud & Borg, 1997).
and emm suggests that, overall, there is a very high
tendency for silence to occur, to the left, to the right, or on 3.1.4. Other tiers
both sides of these FPs. A slightly greater tendency for The TIs tier indicates the presence, within the text, of
these kind of FPs (particularly eee and mmm) to occur Map Task target items. Target items included in the Map
following a silence, rather than preceding one is exhibited. Task allow comparability across speakers and contain
Analysis of the intonation of the “real” FPs identified is solely sonorant elements to allow for better pitch tracking
still ongoing, however, a preliminary characterisation of e.g. l-AmoRIN in Sqaq l-Amorin ‘Budgerigar Street’,
the intonation of such forms is one involving a long period l-Ewwel ta’ MEJju in Triq l-Ewwel ta’ Mejju ‘First of
of level pitch around the middle of the speaker’s pitch May Street’ and Amery in Triq Amery ‘Amery Street’.
range (see also Vella et al., 2009). Target items were carefully selected to represent different
syllable structure and stress possibilities in Maltese (see
A number of phenomena other than “real” FPs also occur Vella & Farrugia 2006). The TIs tier is extremely useful
in the data. The most important of these is a highly to the phonetician using PRAAT as her/his main analysis
frequent class of forms involving “quasi-lexical” tool since it increases searchability. It is not of great
vocalisations such as m-hm and eħe/aħa/ija/iwa, which importance to the computational linguist, however, and
tend to have clear meanings (perhaps similar to will therefore not be considered further here.
Cruttenden’s 1997:175 “intonational idioms”). The form
m-hm is particularly worthy of note. This was originally As the tier name suggests, MISC contains miscellaneous
transcribed as mhm in the data annotated. The main reason information of different sorts. One example of a feature
for the use of the hyphen in the amended annotations is which gets recorded in the MISC tier is the case of
that this form is very different phonetically from the inaudibly released final stops. Words ending in such
“real” FPs described above in that it is a two-syllable plosives have been transcribed in standard orthography, a
vocalisation having a specific intonational form note also being added in the MISC tier to say that a final
consisting of a “stylised” rise in pitch from relatively low, plosive had been inaudibly released. Cases of vowel
level F0 on the first syllable, to higher, but still level F0 on coalescence, particularly at word boundaries, are also
the second syllable. The hyphenated form m-hm was transcribed as in standard orthography, a note once more
thought to better mirror the characteristics of this being inserted in the MISC tier to this effect. Other
vocalisation, thus rendering the orthographic annotation features noted in this tier include unexpected vowel
more immediately transparent to the reader. quality realisations and idiosyncratic pronunciations
including unusual use of aspiration, devoicing etc.;
M-hm parallels neatly with informal renderings of iva various “normal dysfluency”-type phenomena such as
‘yes’ such as ija and iwa, as well as with the more frequent interruptions, abandoning of words, trailing off, unclear
eħe, in having a significant “backchannelling” function stretches of speech; also voice quality features such as
(see Savino & Vella, forthcoming). Two further short creak and non-linguistic elements such as noise.
expressions are annotated in the FP tier: ta and ew. The
former, very common in everyday conversation, but only The contents of this tier are transcriber-dependent to an
extent which is not fully desirable. Some general
4
It should be noted however that there may be an idiosyncratic observations on dealing with features such as those in the
element to these particular forms given that all instances of ehhs MISC tier will be made in Section 4.
and ehms noted in the data come from the same speaker.
86/119
Only rudimentary guidelines are available to date in the 3.3 Alignment of tiers
case of the three prosodic tiers Tone, Prom and As mentioned earlier in this Section (see Subsection 3.1.1)
Functions, for which the following outline will suffice. an important feature of PRAAT-style annotations is that
The annotation in these tiers is intended as a means of involving the time-alignment of the waveform
furthering research on Maltese prosody. Issues such as the information to information in other tiers. Thus, the
relationship between perceived stress and intonation and orthographic annotation of the spoken data goes hand in
the nature of the intonation patterns typical of Maltese and hand with the word-by-word segmentation in such a way
their distribution relative to discourse functions such as that also allows information such as the starting time and
ending time, and consequently the duration of each
those involved in initiating a conversation, issuing an
“segment” to be captured. Thus, the information in the
instruction, etc. SP1 and SP2 tiers in particular, but more generally that in
the separate tiers of the annotation, involves
An important aim of this analysis is that of developing an time-alignment either of particular intervals or of
adaptation for Maltese of the Tone and Break Indices particular points in the waveform to the information in
(ToBI) framework for the annotation of prosody in the other tiers of the annotation. This information, which is
tradition of recent work in this area (see for example viewed in PRAAT as shown in Figure 1, is an extremely
Silverman et al., 1992). Such an adaptation will impact on useful feature for the purposes of analysis. However, it
the development of standards for a Break Indices poses a number of problems when it comes to
component, something which it is hoped the study of incorporating the information from PRAAT TextGrid
phonological boundaries and boundary strength annotations into a corpus composed mainly of texts of a
written form and it is this issue which will be discussed in
mentioned above in 3.1.2 will in fact input into. A B(reak)
the next Section.
I(ndices) tier does not in fact yet feature in the annotation
carried out. Annotation of data using preliminary
4. Preparing SPeech ANnotations for
ToBI-style standards for Maltese based on the analysis of
integration into corpora of Maltese
Maltese intonation carried out within the
Autosegmental-Metrical framework (Pierrehumbert 1990; As mentioned above, TextGrid annotations, whilst useful
Ladd, 2008) by Vella (see, for example, 1995, 2003, 2007, to phoneticians, do not necessarily allow for
2009a, 2009b) should also contribute to further straightforward incorporation into corpora consisting
consolidation of the phonological analysis of Maltese mainly of written texts. The annotations, though stored
prosody. in .txt format, contain information which is as such
“redundant” for corpus linguistics. The TextGrid
Typically, annotation begins in the Prom tier, to identify information relevant to the short excerpt shown in Figure
perceived prominence of accented and/or stressed 1 has been extracted from the relevant TextGrid and
syllables. With a stretch of speech thus highlighted as presented in the Appendix. The information entered into
important in some way, related intonation patterns on the each interval labelled is listed together with an indication
Tone tier, as well as discourse features on the Functions of start time and end time by tier. In the case of point tiers
tier can then be annotated. The decision to include a Prom – such as the Tone tier in our annotations, which is
tier is based on work by Grabe (2001), on the annotation however not illustrated in Figure 1 – any label inputted
of the IViE (Intonational Variation in English) corpus, into the TextGrid is listed together with its position in
which specifies how pitch movement is “anchored” to time.
syllables marked as prominent. The Tone tier then
describes tones in terms of the way these link to identified Having produced the TextGrid annotations using PRAAT,
prominent syllables and boundaries. This feature of the it was considered necessary to establish procedures for
annotations should prove useful in the case of Maltese exporting the information of relevance for the
since a distinction between different degrees of incorporation of samples of spoken Maltese in a
prominence at the Prom tier may make it possible to predominantly written corpus of Maltese. The desired
account not only for more common “pitch accent”-type outcome is machine-readable text containing not only the
phenomena (see, e.g., Bolinger, 1958), but also for orthographic transcriptions relevant to the contributions
phenomena of the so-called “phrase accent”-type of the speakers in the dialogue in the form of a playscript,
identified for Maltese (see Vella, 2003; following Grice, but also any information from the PRAAT annotations
Arvaniti & Ladd 2000). Annotation at the level of prosody which would be useful for processing the spoken “texts”
is currently underway. in line with principles established also for the written ones.
For ease of reference, a conventional playscript-type
A further tier, the Functions tier, contains information transcript of the excerpt shown in Figure 1 is given below:
relating to discourse features as detailed by Carletta et al.
(1995) in the coding of the HCRC Map Task Corpus. SP1: M-hm.
Their system describes typical features of turn-taking in SP2: M-hm?
conversation such as “initiating moves” like INSTRUCT or SP1: (Overlapping) Għaddi minn bej’ Sqaq il-Merill...u Triq Marmarà.
EXPLAIN and “response moves” like ACKNOWLEDGE or Ibqa’ tielgħa.
CLARIFY. (Go between Merill Alley...and Marmarà Street. Keep moving upwards.)
87/119
Using a PRAAT script called obstacles to overcome. Assuming some kind of mark-up
GetLabelsOfIntervals_WithPauses (Lennes), it is possible similar to that used in the BNC, and an element <u>
to reduce the information shown in the Appendix as in (utterance) corresponding to the written text <p>
Table 2 below: (paragraph) element, grouping a sequence of <s>
(sentence) elements (BNC User Reference Guide), the
Speaker Words in sequence Pause duration short dialogue above could be encoded as follows:
(0.23 s)
SP1: M-hm. <u who=“SP1”> <w...“bejn”>
(0.22 s) <s...> <w...“Sqaq”>
SP2: M-hm? <pause dur=0.23s> <w...“il-Merill”>
(-0.11s) <w...“M-hm”> <pause dur=0.20s>
SP1: Għaddi <c c5=”PUN”>.</c> <w...“u”>
Minn <u who=“SP2”> <w...“Triq”>
bej’ <w...“M-hm”> <w...“Marmarà”>
Sqaq <c c5=”PUN”>?</c> <c c5=”PUN”>.</c>
il-Merill <u who=“SP1”> <s...>
(0.20s) <s...> <pause dur=0.24s>
U <overlap dur=-0.11s> <w...“Ibqa’”>
Triq <w...“Għaddi”> <w...“tielgħa”>
Marmarà. <w...“minn”> <c c5=”PUN”>.</c>
(0.24s)
Ibqa’ The above demonstrates that the output of the script used
tielgħa. here already goes a long way towards helping us
accomplish our purpose.
Table 2: Contents of .txt file following extraction of
information from PRAAT TextGrid Some of the elements in the text, e.g. use of <...> to
indicate elision could easily be adapted for the purposes
Exporting selected information from the PRAAT of automatic tagging – the BNC suggests the use of an
TextGrids as shown above makes it possible for some element <trunc> in such cases. Information relating to
important features of the annotation carried out to be other elements such as <unclear> (entered in the MISC
retained in text which, unlike TextGrids in their raw form, tier in the current TextGrid annotations), could also be
is easy to incorporate into a corpus composed mainly of retrieved, albeit possibly in a less straightforward
written texts. Although, improvements on this fashion.
preliminary attempt at exporting the data can be
envisaged, the output of the script used can already be One element of particular interest in the context of this
seen to contain a number of useful features. One of these paper is the element <vocal>. The BNC User Reference
is the fact that corpus-processing tasks such as paragraph Guide describes this as: “(Vocalized semi-lexical) any
(the spoken equivalent of which would be the utterance) vocalized but not necessarily lexical phenomenon for
and sentence splitting, as well as tokenisation, would example voiced pauses, non-lexical backchannels, etc.”.
seem to be relatively straightforward tasks given the One of the outcomes of the project SPAN is in fact a
word-by-word segmentation and annotation in the categorisation of different types of vocalisations as
original format. POS tagging would need to proceed on follows (see Vella et al., 2009):
the same lines as in the case of written texts.
1. “real” FPs such as eee, mmm and emm having an
A second very important feature of the output of the script actual pause as their counterpart;
is that it captures information on both pause, and on 2. non-lexical vocalisations such as m-hm which
overlap, a very significant feature of speech which is parallel with quasi-lexical vocalisations such as
completely absent from written texts. These two features eħe and with lexical words such as iva; and
are recorded in the right hand column of the output, a 3. “paralinguistic vocalisations” such as fff, ħeqq,
positive value indicating a pause, a negative value overlap. ttt etc.
It should be a relatively straightforward for a script to be
developed which will allow for such information to be Given that, in all cases, the elements in these categories
converted into the appropriate tags. consist of a relatively closed set of items – which after all
could be added to in cases of new items being identified –
5. The encoding of features of spoken such elements should be relatively easy to identify on the
Maltese basis of their orthographic rendering in the annotations,
although one would need to assume proper training of
Given conversion of PRAAT annotations in a way similar transcribers to established standards and guidelines. It is
to that described in Section 4 above, there remain few being suggested here that vocalisations involving some
88/119
element of meaning would be better tagged as “words”, 12-Apr-10
thus leaving the element “vocalisations” as a means of Grice, M., Ladd, D.R.L., Arvaniti, A. (2000). On the place of
recording events of a purely non-linguistic nature. A full phrase accents in intonational phonology. Phonology, 17, pp.
categorisation of different events of this sort, as well as of 143--185.
other vocalisations of a “quasi-lexical” nature still awaits HCRC Map Task Corpus, http://www.hcrc.ed.ac.uk/maptask/
research. visited 10-Nov-09.
Kunsill tal-Malti,
6. Conclusion http://kunsilltalmalti.gov.mt/filebank/documents/Decizjonijiet1_
In conclusion, standards and guidelines for the 25.07.08.pdf visited 15-Apr-10.
orthographic annotation, as well as preliminary standards Ladd, D.R.L. (2008). Intonational Phonology. 2nd edition.
for the annotation of prosody, of spoken Maltese, are in Cambridge: Cambridge University Press.
place. Exportation of TextGrid information to a format Lennes, M. (2010). Mietta’s Praat scripts,
more readily incorporable into corpora of written data is GetLabelsOfIntervals_WithPauses script.
also doable. However, possibilities for automating or http://www.helsinki.fi/~lennes/praat-scripts/ visited 18–Jan-10.
semi-automating procedures of conversion need to be Mifsud, M., Borg, A. (1997). Fuq l-Għatba tal-Malti. [Threshold
explored. Lastly, improved knowledge of the workings of Level in Maltese]. Strasbourg: Council of Europe Publishing.
these relatively-less-described features of Maltese should Pierrehumbert, J. (1980). The Phonetics and Phonology of English
serve not only to improve the quality of HLTs such as Intonation. Ph.D. thesis, MIT.
Text-to-Speech systems for Maltese, but also to improve Rosner, M., Caruana, J., Fabri, R. (1998). MaltiLex: a
methodologies for the evaluation of such HLTs. computational lexicon for Maltese. In Proceedings of the
COLING-ACL Workshop on Computational Approaches to
7. Acknowledgements Semitic Languages. Morristown, NJ: Association for
Computational Linguistics, pp. 97--101.
We would like to thank the University of Malta’s
Rosner, M., Caruana, J., Fabri, R., Loughraieb, M., Montebello,
Research Fund Committee (vote numbers 73-759 and
M., Galea, D., Mangion, G. (2000). Linguistic and
31-418) for making funding for the projects SPAN (1) and
computational aspects of MaltiLex. In Proceedings of ATLAS:
SPAN (2) available for the years starting January 2007 and
The Arabic Translation and Localization Symposium. PLACE,
January 2008.
pp. 2 -- 9.
Rosner, M. (2009). Electronic language resources for Maltese. In B.
8. References
Comrie, R. Fabri, E. Hume, M. Mifsud & M. Vanhove (Eds.),
Introducing Maltese Linguistics. Amsterdam: John Benjamins,
Akkademja tal-Malti. (2004). Tagħrif fuq il-Kitba Maltija II. Malta: pp. 251--276.
Klabb Kotba Maltin. Savino, M., Vella, A. (Forthcoming). Intonational backchannelling
Aquilina, J. (1987/1990). Maltese-English Dictionary. Volumes 1 & strategies in Italian and Maltese Map Task dialogues.
2. Malta: Midsea Books Ltd. Shriberg, E.E. (1999). Phonetic consequences of speech disfluency.
Boersma, P., Weenick, D. (2008). PRAAT: doing phonetics by In Proceedings of the 13th International Congress of Phonetic
computer. (Version 5.0.08). http://www.praat.org visited Sciences 1999, San Francisco, CA, pp. 619--622.
11-Feb-08. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M.,
Bolinger, D. (1958). A theory of pitch accent in English. Word, 14, Wightman, C., Price, P., Pierrrehumbert, J., Hirschberg, J.
pp. 109--149. (1992). ToBI: a standard for labeling English prosody. In
Borg, A., Azzopardi-Alexander, M. (1997). Maltese. [Descriptive Proceedings of the 1992 International Conference on Spoken
Grammars]. London/New York: Routledge. Language Processing. Banff, Canada, p. 867--870.
British National Corpus, Vella, A. (1995). Prosodic Structure and Intonation in Maltese and
http://www.natcorp.ox.ac.uk/corpus/index.xml visited its Influence on Maltese English. Unpublished Ph.D thesis,
20-Feb-10. University of Edinburgh.
Carletta, J., Isard, A., Kowtko, J., Doherty-Sneddon, G., Anderson, Vella, A. (2003). Phrase accents in Maltese: distribution and
A. (1995). The coding of dialogue structure in a corpus. In J.A. realisation. In Proceedings of the 15th International Congress of
Andernach, S.P. van de Burat & G.F. van der Hoeven (Eds.), Phonetic Sciences. Barcelona, pp. 1775--1778.
Proceedings of the Twentieth Workshop on Language Technology: Vella, A. (2007). The phonetics and phonology of wh-question
corpus-based approaches to dialogue modelling, pp. 25--34. intonation in Maltese. In Proceedings of the 16th International
Cruttenden, A. (1997). Intonation. 2nd edition. Cambridge: Congress of Phonetic Sciences. Saarbrücken, pp. 1285--1288.
Cambridge University Press. Vella, A. (2009a). Maltese intonation and focus structure. In R.
Dalli, A. (2001). Interoperable extensible linguistic databases. In Fabri (Ed.), Maltese Linguistics: A Snapshot. In Memory of
Proceedings of the IRCS Workshop on Linguistics Databases, Joseph A. Cremona. [Il-Lingwa Tagħna Vol. 1]. Bochum:
University of Pennsylvania, Philadelphia, pp. 74--81. Niemeyer, pp. 63--92.
Gibbon, D., Moore, R., Winski, R. (1997). Handbook of Standards Vella, A. (2009b). On Maltese prosody. In B. Comrie, R. Fabri, E.
and Resources for Spoken Language Systems. Berlin: Mouton de Hume, M. Mifsud & M. Vanhove (Eds.), Introducing Maltese
Gruyter. Linguistics. Amsterdam: John Benjamins, pp. 47--68.
Grabe, E. (2001). The IViE Labelling Guide. (Version 3). Vella, A., Farrugia, P-J. (2006). MalToBI – building an annotated
http://www.phon.ox.ac.uk/files/apps/IViE//guide.html# visited
89/119
corpus of spoken Maltese. In Proceedings of Speech Prosody variation in the type and prosody of filled pauses in Maltese.
2006, Dresden. Paper given at the 2nd International Conference of Maltese
Vella, A., Chetcuti, F., Grech, S., Spagnol, M. (2008). SPeech Linguistics, Bremen.
ANnotation: developing guidelines for spoken corpora. Paper Yvnge, V. (1970). On getting a word in edgewise. Papers from the
given at the 1st International Conference of Maltese Linguistics, Sixth Regional Meeting, Chicago Linguistic Society, Chicago, pp.
Bremen. 567-577.
Vella, A., Chetcuti, F., Grech, S., Spagnol, M. (2009). Interspeaker
Appendix
TextGrid information for sample excerpt in Figure 1. The sequences of left and right angled brackets indicate the positions at which information from the different
tiers of the original TextGrid was removed in order to make it possible for time-aligned information from the various tiers to be presented here.
90/119
A Web Application for Dialectal Arabic Text Annotation
Yassine Benajiba and Mona Diab
Center for Computational Learning Systems
Columbia University, NY, NY 10115
{ybenajiba,mdiab}@ccls.columbia.edu
Abstract
Design and implementation of an application which allows many annotators to annotate data and enter the information into a central
database is not a trivial task. Such an application has to guarantee a high level of security, consistent and robust back-ups for the underly-
ing database, and aid in increasing the speed and efficiency of the annotation by providing the annotators with intuitive GUIs. Moreover
it needs to ensure that the data is stored with a minimal amount of redundancy in order to simultaneously save all the information while
not losing on speed. In this paper, we describe a web application which is used to annotate many Dialectal Arabic texts. It aims at
optimizing speed, accuracy and efficiency while maintaining the security and integrity of the data.
consistent annotated resources allows for the building of allow the lead annotators to assign different tasks to
applications such as Information Retrieval, Information different annotators at different times, help them trace
Extraction and Statistical Machine Translation on the DA the annotations already accomplished, and should al-
data. In this project, we have targeted four Arabic Dialects, low them to give illustrative constructive feedback
namely: Egyptian, Iraqi, Levantine, and Moroccan. And from within the tool with regards to the annotation
the harvested data is on the order of half a million Arabic quality.
blogs. The DA data is harvested based on manually
created queries in the respective dialects as well as a list of Even though many of these annotation tools, such as
compiled dialect specific URLs. Once the data is harvested GATE(Damljanovic et al., 2008; Maynard, 2008; Aswani
it is automatically cleaned from metadata and the content and Gaizauskas, 2009), Annotea(Kahan et al., 2001) and
part is prepared for manual annotation. MnM(Vargas-Vera et al., 2002) among others, have proven
The application that we present in this paper, successful in serving their intended purposes, none of them
COLANN GUI, is designed and implemented in the was flexible enough for being tailored to the COLABA
framework of the COLABA project. goals.
COLANN GUI is the interface used by the annota- The remainder of this paper is organized as follows: We
tors to annotate the data with the relevant information. give an overview of the system in Section 2.; Section 3. il-
COLANN GUI uses two different servers for its front-end lustrates the detailed functionalities of the application; Sec-
and back-end components. It also allows many annotators tion 4. describes each of the annotation tasks handled by
to access the database remotely. It offers several views de- the application; We give further details about the database
pending on the type of user and the annotation task assigned in Section 5. and finally, some future directions are shared
to an annotator at any given time. The decision to develop in Section 6..
an annotation application in-house was taken after unsuc-
cessfully trying to find an off-the-shelf tool which can offer 2. Overall System View
the functionalities we are interested in. Some of these func- COLANN GUI is a web application. We have chosen
tionalities are: such a set up, in lieu of a desktop one, as it allows us to
• Task dependency management: Some of the annota- build a machine and platform independent application.
tion tasks are dependent on each other whereas oth- Moreover, the administrator (or super user) will have
ers are completely detached. It is pretty important in to handle only one central database that is multi-user
our tasks to be able to manage the annotation tasks compatible. Furthermore, the COLANN GUI is browser
in a way to keep track of each word in each sentence independent, i.e. all the scripts running in the background
and organize the information entered by the annota- are completely browser independent hence allowing all
tor efficiently. It is conceivable that the same word the complicated operations to run on the server side only.
could have different annotations assigned by different COLANN GUI uses PHP scripts to interact with the server
annotators in different tasks whereas most the avail- database, and uses JavaScripts to increase GUI interactivity.
able tools do not have the flexibility to be tailored is
such fashion; and Safety and security are essential issues to be thought of
when designing a web application. For safety considera-
• Annotators’ management: the tool should be able to tions, we employ a subversion network (SVN) and auto-
92/119
matic back-up servers. For security considerations we orga- 4. An annotator could check the speed of others (anony-
nize our application in two different servers, both of which mously and randomized) on a specific task once they
is behind several firewalls (see Figure 3). submit their own
3. COLANN GUI: A Web Application 5. View annotations shared with them by the Lead Anno-
tator
As an annotation tool, we have designed COLANN GUI
with three types of users in mind: Annotators, Lead
Annotators, and Super User. The design structure of 4. Annotation Tasks
COLANN GUI aims to ensure that each annotator is work- A detailed description of the annotation guidelines goes be-
ing on the right data at the right time. The Super User and yond the scope of this paper. The annotation guidelines are
Lead Annotator views allow for the handling of organiza- described in detail in (Diab et al., 2010b). We enumerate
tional tasks such as database manipulations, management the different annotation tasks which our application pro-
of the annotators as well as control of in/out data opera- vides. All the annotation tasks can only be performed by
tions. a user of category Annotator or Lead Annotator for the cre-
Accordingly, each of these different views is associated ation of the gold evaluation data. In all the tasks, the anno-
with different types of permissions which connect to the tator is asked to either save the annotation work, or submit
application. it. If saved they can go back and edit their annotation at a
later time. Once the work is submitted, they are not allowed
3.1. Super User View to go back and edit it. Moreover, the annotators always have
The Super User has the following functionalities: direct access to the relevant task guidelines from the web in-
terface by pressing on the information button provided with
1. Create, edit and delete tables in the database each task.
2. Create, edit and delete lead accounts The annotation tasks are described briefly as follows:
3. Create, edit and delete annotator accounts 1. Typo Identification and Classification and Sentence
Boundary Detection: The annotator is presented with
4. Check the status of the annotation tasks for each anno- the raw data as it is cleaned from the meta data but as it
tator would have been present on web. Blog data is known
5. Transfer the data which needs to be annotated from to have all kinds of speech effects and typos in addi-
text files to the database tion to a severe lack of punctuation.
Accordingly, the first step in content annotation is to
6. Generate reports and statistics on the underlying identify the typos and have them classified and fixed,
database in addition have sentence boundaries identified.
7. Write the annotated data into XML files The typos include: (i) gross misspellings: it is rec-
ognized that DA has no standard orthography, how-
3.2. Lead Annotator View ever many of the words are cognates/homographs with
The Lead Annotator view shares points 3 and 4 of the Su- MSA, the annotator is required to fix misspelling of
per User view. In addition, this view has the following ad- such words if they are misspelled for example Yg. AÖÏ @
ditional functionalities: AlmsAj, “the mosques” would be fixed and re-entered
as Yg. AÖÏ @ AlmsAjd; (ii) speech effects: which consists
1. Assign tasks to the annotators of rendering words such as “Goaaaal” to “Goal”; and
2. Check the annotations submitted by the annotators (iii) missing spaces. The annotator is also asked to
specify the kind of typo found. Figure 4 shows a case
3. Communicate annotation errors to the annotators where the annotator is fixing a “missing space” typo.
The following step is sentence boundary detection.
4. Create gold annotations for samples of the assignment
This step is crucial for many of the language tools
tasks for evaluation purposes. Their annotations are
which cannot handle very long sequences of text, e.g.
saved as those of a special annotator
syntactic parsers. In order to increase the speed and
5. Generate inter-annotator agreement reports and other efficiency of the annotation, we make it possible to
types of relevant statistics on the task and annotator indicate a sentence boundary by clicking on a word
levels in the running text. The sequence of words is simply
split at that click point. The annotator can also de-
3.3. Annotator View cide to merge two sequences of words by clicking at
The annotator view has the following functionalities: the beginning of a line and it automatically appends
the current line to the previous one. It is worth noting
1. Check status of his/her own annotations that all the tasks that follow depend on this step be-
2. Annotate the assigned units of data ing completed. Once this task is completed, the data is
sent to a coarse grained level of dialect identification
3. Check the overall stats of other annotators’ work for (DI) pipeline described in detail in (Diab et al., 2010a).
comparative purposes The result of this DI process is the identification of the
93/119
Figure 3: Servers and views organization.
problem words and sequences that are not recognized scale. Finally, they are required to provide the pho-
by our MSA morphological analyzer, i.e. the words netic transcription of word as specified in our guide-
don’t exist in our underlying dictionaries.3 lines on rendering DA in the COLABA Conventional
Orthography (CCO).
2. Dialect annotation: For each word in the running text
(after the content cleaning step mentioned before), the The GUI at this point only allows the annotator to sub-
annotator is asked to specify its dialect(s) by picking mit his/her annotation work when all the words in the
from a drop down menu. Moreover they are requested text are annotated. The annotators are given the option
to choose the word’s level of dialectalness on a given to mark a word as unknown.
5. The Database stance of the actual sentence in only one table and re-
For this application we need a sophisticated database. We late all the annotation records to it. Only by doing
use the freely available Postgresql management sys- so are we able to save information about millions of
tem4 . The database system is able to save all the annota- words in the database while keeping it easily and ef-
tions which we have described in Section 3.. But it also ficiently accessible. Finally, we also save the time (in
holds other information concerning the users and their an- seconds) taken by the annotators to complete the an-
notation times. Our database contains 22 relational tables, notation tasks. This information is necessary for inter-
which can be split into the following categories: annotator speed comparison.
• Basic information: We have created a number of tables • Assignments: These tables hold information about
which save basic information that is necessary for all how many task assignments have been assigned to
the tasks. By saving such information in the database each annotator and how many of them have already
our application becomes very easy to update and main- been annotated and submitted. This is directly related
tain. For instance, if we decide to add a new POS-tag to the assignment task of the lead annotators described
we just have to add it in the appropriate table. in Subsection 3.2..
• Annotation: These tables are the core of the database. • Users Permissions: When a user is created, we assign
For each of the annotation tasks described in Section 4. a certain category to her/him. This information is used
it saves all the information entered by the annotator by all the scripts to decide on user privileges.
while keeping the redundancy of the information at a
minimum. For instance, for each sentence in the data, • Connection: Whenever a user is connected, this infor-
we want to save all the information entered about the mation is communicated to the database. By doing so
dialect, the lemmas and the morphological informa- we are able to deny a user to connect from two ma-
tion of the dialectal words while saving only one in- chines at the same time.
As mentioned in Section 1., our database is located on a
4
http://www.postgresql.org/ separate server from the web server. This web server can
96/119
Figure 7: Illustrating example of the requested information when an annotator chooses the POS-tag Noun or Verb.
6. Future Work
We are constantly updating our interface incorporating
feedback from the annotators and lead annotators on the
various tasks. The data that is annotated using our applica-
tion is intended to build efficient models of four different
dialects that cover all the major Arabic dialects. The mod-
els will be useful for several NLP applications:
5
Acknowledgments
http://www.ssh.com/support/documentation/online/ssh/adminguide/
32/Port Forwarding.html This work has been funded by ACXIOM Corporation.
97/119
7. References
N. Aswani and R. Gaizauskas. 2009. Evolving a general
framework for text alignment: Case studies with two
south asian languages. In Proceedings of the Interna-
tional Conference on Machine Translation: Twenty-Five
Years On.
Y. Benajiba, M. Diab, and P. Rosso. 2008. Arabic named
entity recognition using optimized feature sets. In Pro-
ceedings of EMNLP’08, pages 284–293.
T. Buckwalter. 2004. Buckwalter Arabic Morphologi-
cal Analyzer Version 2.0. Linguistic Data Consortium,
University of Pennsylvania, 2002. LDC Cat alog No.:
LDC2004L02, ISBN 1-58563-324-0.
D. Damljanovic, V. Tablan, and K. Bontcheva. 2008. A
Text-based Query Interface to owl Ontologies. In Pro-
ceedings of the 6th Language Resources and Evaluation
Conference (LREC).
M. Diab, N. Habash, O. Rambow, M. AlTantawy, and
Y. Benajiba. 2010a. COLABA: Arabic Dialect Annota-
tion and Processing. In Proceedings of the Language Re-
sources (LRs) and Human Language Technologies (HLT)
for Semitic Languages at LREC.
Mona Diab, Nizar Habash, Reem Faraj, and May Ahmar.
2010b. Guidelines for the Annotation of Dialectal Ara-
bic. In Proceedings of the Language Resources (LRs)
and Human Language Technologies (HLT) for Semitic
Languages at LREC.
C. A. Ferguson. 1959. Diglossia. Word, 15:325–340.
N. Habash, O. Rambow, M. Diab, and R. Kanjawi-Faraj.
2008. Guidelines for Annotation of Arabic Dialectness.
In Proceedings of the LREC Workshop on HLT & NLP
within the Arabic world.
J. Kahan, M.R. Koivunen, E. Prud’Hommeaux, and R.R.
Swick. 2001. Annotea: an open rdf infrastructure for
shared web annotations. In Proceedings of the WWW10
Conference.
D. Maynard. 2008. Benchmarking textual annotation tools
for the semantic web. In Proceedings of the 6th Lan-
guage Resources and Evaluation Conference (LREC).
M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni,
A. Stutt, and F. Ciravegna. 2002. Mnm: Ontology
driven semi-automatic and automatic support for seman-
tic markup. In Proceedings of the 13th International
Conference on Knowledge Engineering and Manage-
ment (EKAW).
98/119
Towards a Psycholinguistic Database for Modern Standard Arabic
Sami Boudelaaa, and William D. Marslen-Wilsona
a
MRC Cognition and Brain Sciences Unit
sami.boudelaa@mrc-cbu.cam.ac.uk, William.marslen-wilson@mrc-cbu.cam.ac.uk
Abstract
To date, there are no Arabic databases that provide distributional information about orthographically disambiguated words and
morphemes. Here we present ARALEX (Arabic Lexical database), a new tool providing type and token frequency counts for vowelled
Arabic surface words, stems, bigrams and trigrams. The database also provides type and token frequency for roots and word patterns.
Token frequency counts are based on an automatically annotated 40 million word corpus derived from different Arabic news papers,
while the type frequency counts are based on the Hans Wehr dictionary. This database is a valuable resource for researchers across
many fields. It is available for the community as a web interface and as a stand alone downloadable application on:
http://www.mrc-cbu.cam.ac.uk:8081/ARALEX.online/login.jsp
1
We will be using the standard Buckwalter transliteration scheme throughout this article.
99/119
foreign words and proper Arabic and foreign nouns. For each The orthographic form is defined as the graphic entity
stem we manually established the corresponding root and word written with white space on either side of it. For instance the
pattern. This resulted in the identification of 6804 roots and phrase [ وﺳﻴﺘﻄﻠﺐwsytTlb] and it will require is an orthographic
2329 word patterns. This dictionary is used as a look-up table to form. The unpointed stem is the output of AraMorph once the
guide a deterministic parsing algorithm applied to each stem in clitics and the affixes have been removed. In the example above
the corpus. the unpointed stem is [tTlb] while the pointed stem is [taTal~ab].
The corpus consisted of 40 million words drawn from The root for this stem is {Tlb} and the pattern {tafaE~al}.
various Arabic online newspapers . The most challenging aspect The token frequency statistics are computed from
of the corpus was the absence of vowel diacritics, which makes occurrence counts in the 40 million word corpus as the rate of
Arabic extremely ambiguous. In order to provide accurate occurrence per 1 million words of text, given by:
frequency measures of surface forms and morphemes it was Freq(w) = occ(w)
necessary to disambiguate the corpus by reinstating the missing T/k
vowels. Accordingly, we first stripped the corpus off its html where occ(w) is the number of occurrences of word w in the
tags, sliced it into manageable text files, and then submitted to corpus, T is the total number of words in the corpus, and k = 1,
AraMorph (Buckwalter, 2002). For each word defined as a 000, 000. The generation of token frequencies for orthographic
string of letters with space on either side of it, AraMorph outputs forms consists simply in counting and normalizing the number
(a) a vowelled solution of all the possible alternatives, (b) an of times each distinct orthographic form occurs in the corpus.
exhaustive parse of the word into its component morphemes, Where the token. frequencies of stems, roots and word patterns
and (c) a part-of-speech tag for each solution along with an are concerned, the following procedure was followed: For each
English gloss. record in the corpus the pointed and unpointed stems are
To choose the correct solution for each word, we developed a extracted, then their corresponding root and word pattern are
novel automated technique using Support Vector Machines located in the dictionary, and the occurrence of each of these
(Wilding, 2006). The output of this technique is a probability four units (i.e., the pointed stem, unpointed stem, the root, and
score reflecting the accuracy of the automatic vowelling, and an the word pattern) is recorded. If a pointed stem is not found in
entropy score which measures the amount of uncertainty in the the dictionary, the unpointed stem is used to match on dictionary
probability score 2 . In Initial testing on 792 K words from the entries without diacritics to get a set of pointed stem candidates.
Arabic Treebank, the accuracy of this automatic vowelling Then all corresponding roots and patterns for that set of stems
are located and recorded, thus increasing recall at the cost of
procedure was 93% when case endings were not taken into
potentially decreasing precision.
account, and over 85% when case endings were included. When
The type frequencies of roots and patterns are simply
applied to the full corpus, the procedure was accurate 80% of the raw counts and are extracted from the dictionary. Finally the
time for fully diacritized forms and 90% of the time accurate for character n-gram frequencies (bigrams and trigrams) are
forms without case endings. These figures were further computed from the 40 million word corpus for orthographic
cross-validated against a randomly chosen 500 K words of forms, root, and word patterns as follows:
automatically vowelled words that were also hand-annotated by Freq(g) = occ(g)
a team of native Arabic speakers in Egypt 3 . The validation T/k
showed an overall accuracy of 77.9% suggesting that the where occ(g) is the number of occurrences of n-gram g in the
solutions chosen by the human annotators were also likely to be corpus, T is the total number of n-grams in the corpus, and k = 1,
chosen by the automatic vowel diacritizer. 000, 000. .
100/119
allowing batch processing, and the output can be written into a Wehr, H. (1994). Arabic-English Dictionary. Spoken Language
file or displayed on the screen. To use the ARALEX CLI, the Services, Inc., Ithaca, NY.
user needs to install Java JDK 5.0 or later, and Lucene 2.3.2 or Wilding, M. (2006). Bootstrapping Arabic pointing and
later. An ARALEX command-line interface with Java class files morphological structure. MPhil in Computer Speech, Language
and an ARALEX Lucene database index are also required and and Internet Technology. University of Cambridge, St
can be downloaded from the ARALEX website. Once these Catherine’s College.
components are available and the Lucene core JAR is on the
system classpath for the ARALEX CLI, the interface can be
invoked by the command java SearchDB. If successful, this
should display the input argument format, options, and field
names. At this stage the program requires the directory
containing the ARALEX index files to be specified. Invoking
the command java SearchDB index_dir, where index_dir is the
location of the database index, yields the prompt Enter query.
From now on, any valid Lucene query can be entered (for further
details refer to Boudelaa & Marslen-Wilson, 2010).
6. Conclusion
ARALEX is the first Arabic lexical database to provide
frequency information about vowelled words, morphemes and
letter and phoneme bigrams. It allows experimental researchers
to design well controlled experiments, and provides a valuable
source of information for natural language processing
development. It can also be used to derive basic and/or more
advanced vocabulary lists tailored to the needs of various
language learners.
7. Acknowledgements
This work is supported by a British Academy Large Research
Grant (LRG42466) and MRC (grants U.1055.04.002.00001.01).
The authors would like to thank Sameh Al-Ansary, Ted Briscoe,
Tim Buckwalter, Hubert Jin, Mohamed Maamouri, Fermin
Moscoso del Prado Martin, Dilworth B. Parkinson and Mark
Wilding for their help at different stages of the project.
8. References
Boudelaa, S., & Marslen-Wilson, W.D. (2010). ARALEX: A
lexical database for Modern Standard Arabic.
Behavioral Research Methods in press
Boudelaa, S., Pulvermüller, F., Hauk, O., Shtyrov, Y. &
Marslen-Wilson, W.D. (2010). Arabic morphology in the
neural language system. Journal of Cognitive
Neuroscience in press
Buckwalter, T. (2002). Buckwalter Arabic Morphological
Analyzer Version 1.0. Linguistic Data Consortium,
catalog number LDC2002L49, ISBN 1-58563-257-0.
Ferguson, C. A. (1959). Diglossia. Word, 15, 325-340.
Holes, C. (1995). Modern Arabic: Structures, functions and
varieties. London and New York: Longman.
Maamouri, M., & Bies, A. (2004). Developing an Arabic
Treebank: methods, guidelines, procedures, and tools.
Proceedings of the Workshop on Computational
Approaches to Arabic Script-based languages, p. 2-9.
Geneva, Switzerland.
McCarthy, J. J. (1981). A prosodic theory of non-concatenative
morphology. Linguistic Inquiry, 12, 373-418.
Plunkett, K. & Nakisa, R.C. (1997). A connectionist model of the
Arabic plural system. Language and Cognitive Processes,
12, 807-836.
Versteegh, K. (1997). The Arabic Language. Edingburgh
University Press.
101/119
Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC
Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma, Stephanie Strassel
Linguistic Data Consortium, University of Pennsylvania
3600 Market Street, Suite 810, Philadelphia, PA USA
E-mail: { sgrimes, xuansong, bies, skulick, xma, strassel}@ldc.upenn.edu
Abstract
This contribution describes an Arabic-English parallel word aligned treebank corpus from the Linguistic Data Consortium that is
currently under production. Herein we primarily focus on efforts required to assemble the package and instructions for using it. It was
crucial that word alignment be performed on tokens produced during treebanking to ensure cohesion and greater utility of the corpus.
Word alignment guidelines were enriched to allow for alignment of treebank tokens; in some cases more detailed word alignments are
now possible. We also discuss future annotation enhancements for Arabic-English word alignment.
102/119
treebank corpora if the data inputs are not already 3.3 Word Alignment and Tagging Tool
sentence-wise parallel. Word alignment is performed on unvocalized tokens
rendered in Arabic script. LDC’s word alignment tool
The Arabic Treebank (ATB) distinguishes between source allows annotators to simultaneously align tokens and tag
and treebank tokens. While source tokens are generally them with meta data or semantic labels. A screenshot of
whitespace-delimited words, the treebank tokens are the tool is shown in Figure 1.
produced using a combination of SAMA (Maamouri et al.,
2009) for morphological analysis, selection from amongst The navigation panel on the right side of the software
alternative morphological analyses, and finally splitting displays original (untokenized) source text to help
of the source token into one or more treebank tokens annotators understand the context of surrounding
based on clitic or pronoun boundaries. sentences (which aids in, for example, anaphora
resolution). Having untokenized source text also aids in
For release as part of this corpus, the ATB and EATB are resolving interpretation ambiguities that would arise if
provided in Penn Treebank format (Bies et al., 1995). The annotators could only see tokenized, unvocalized script.
trees are unmodified from ATB/EATB releases except that
the tokens were replaced with token IDs. This structure is 3.3 Additional Tagging for Word Alignment
discussed in greater detail in Section 5. In addition to part-of-speech tags produced as part of
treebank annotation, word alignment annotators have the
3. Word Alignment Annotation option of adding certain language-specific tags to aid in
At the LDC, word alignment is a manual annotation disambiguation. A tagging task for Arabic-English has
process that creates a mapping between words or tokens recently been added to the duties of word alignment
in parallel texts. While automatic or semi-automatic annotators, and it is described as follows.
methods exist for producing alignments, we avoid these
methods. Manual alignment serves as a gold standard for For unaligned words or phrases having locally-related
training automatic word alignment algorithms and for use constituents to which to attach, they are tagged as "GLU"
in machine translation (c.f. Melamed 2001, Véronis and (i.e., "glue"). This indicates local word relations among
Langlais 2000), and it is desirable that annotator decisions dependency constituents. The following are some cases in
during manual alignment not be biased through use of which the GLU tag would be used:
partially pre-aligned tokens. It is felt that annotators may -English subject pronouns omitted in Arabic.
accept the automatic alignment and also lower annotator -Unmatched verb "to be" for Arabic equational
agreement at the same time. sentences.
-Unmatched pronouns and relative nouns when linked to
Using higher-quality manual alignment data for training their referents.
data results in better machine translations. Fossum, -Unmatched possessives ('s and ') when linked to their
Knight, and Abney (2008) showed that using Arabic and possessor.
English parsers or statistical word alignment tools such as -When a preposition in one language has no counterpart,
GIZA++ instead of gold standard annotations contributes the extra preposition attached to the object is marked
to degradations in training data quality that significantly GLU.
impact BLEU scores for machine translation. While -Two or more prepositions in one language while there is
automatic parsing and word aligning have their place in one preposition in the other side; the unmatched
NLP toolkits, use of manually-annotated training data is preposition would be tagged as GLU.
always preferred if annotator resources are available.
It is hoped that the presence of the GLU tag provides a
3.1 Word Alignment Annotation Guidelines clue in understanding morphology better, and we will
LDC's word alignment guidelines are adapted from continue to explore using additional tags for this task.
previous task specifications including those used in the
BLINKER project (Melamed 1998a, 1998b). Single or 4. Uniting Treebank and Word Alignment
multiple tokens (words, punctuation, clitics, etc.) may be Annotation
aligned to a token in the opposite language, or a given This section describes efforts to join treebank and word
token may be marked as not translated. Early LDC alignment annotation.
Arabic-English word alignment releases as part of the
DARPA GALE program were generally based on 4.1 Order of Annotation
whitespace tokenization. The order of annotation in creating a parallel
word-aligned treebank corpus is important. From the
Word alignment guidelines serve to increase annotator parallel corpus, the sentences can first be treebanked or
agreement, but different word alignment projects may word aligned. If word alignment was to proceed first, the
have unique guidelines according to what is deemed tokens used for word alignment would serve as input to
translation equivalence. For example, are pronouns treebanking. However, treebank tokenization includes
permitted to be aligned to proper nouns with which they morphosyntactic analysis, and hence treebank
are coindexed? Our point here is to encourage the corpus tokenization is only determined manually during treebank
user to explore alignment guidelines in detail to better annotation. For this reason, the preferred workflow is to
understand the task. only perform word-alignment annotation after
experienced treebank annotators have fixed tokenization,
103/119
Figure 1. The PyQt-based tool used at LDC for word alignment annotation and tagging.
104/119
and it is this development trajectory we assume for the transcripts (for Arabic) or translations (for English), so
remainder of the paper. that cross-referential integrity with the original data and
with English translations is maintained.
4.2 Tokenization Modification
The word alignment guidelines were adapted so that For both Arabic Treebank and English Treebank, quality
control passes are performed to check for and correct
annotation would be based on the treebank tokens instead
of on source, whitespace tokens. As illustrated by the errors of annotation in the trees. The Corpus Search tool2
is used with a set of error-search queries created at LDC to
following examples, finer alignment distinctions may be
locate and index a range of known likely annotation errors
made when pronouns are considered independent tokens.
The example below appears in the Buckwalter involving improper patterns of tree structures, node labels,
transliteration1 for convenience, but please note that the and the correspondence between part-of-speech tags and
tree structure. The errors found in this way are corrected
bilingual annotators work only with Arabic script.
manually in the treebank annotation files.
Source: فزجوه بالسجن
ّ
In addition, the Arabic Treebank (ATB) closely integrates
Transliterated: fa+ zaj~uwA +h b+ Alsijn
Morpheme gloss: and sent.3P him to jail the Standard Arabic Morphological Analyzer (SAMA)
Gloss: "They sent him to jail." into both the annotation procedure and the integrity
checking procedure. The interaction between SAMA and
Source: ّ
معطلة ّ
سيارته the Treebank is evaluated throughout the workflow, so
Transliterated: say~Arap+h muEaT~alap that the link between the Treebank and SAMA is as
consistent as possible and explicitly notated for each
Morpheme gloss: car his broken
Gloss: "His car is broken." token.
1 2
We use the Buckwalter transliteration. Details are available at CorpusSearch is freely available at
http://www.qamus.org/transliteration.htm. http://corpussearch.sourceforge.net
105/119
Figure 2. File structure illustration
The word alignment file and the Arabic and English tree
files have token numbers which reference the Arabic and
English token files. Within the token files, each token
number for each sentence is expanded to give additional
information. For each token in the English token files, the
token number is listed, followed by a character range in
the raw file to which the token corresponds, and then
finally the token itself. For Arabic, multiple versions of
each token are provided (unvocalized, vocalized, input
string) and in multiple formats (Arabic script, Buckwalter
transliteration).
3
http://kitt.cl.uzh.ch/kitt/treealigner
106/119
International Conference on Language Resources and
Evaluation (LREC 2010).
Ma, X. (2006). Champollion: A Robust Parallel Text
Sentence Aligner. In LREC 2006: Fifth International
Conference on Language Resources and Evaluation.
Maamouri, M., Bies, A., Buckwalter, T., and Jin, H.
(2005). Arabic Treebank: Part 1 v 3.0 (POS with full
vocalization + syntactic analysis). LDC Cat. No.:
LDC2005T02.
Maamouri, M., Bies, A., Kulick, S., Zaghouani, W., Graff,
D., and Ciul, M.. (2010). From Speech to Trees:
Applying Treebank Annotation to Arabic Broadcast
News. In Proceedings of the Seventh International
Conference on Language Resources and Evaluation
(LREC 2010).
Maamouri, M., Graff, D., Bouziri, B., Krouna, S., and
Kulick, S. (2009). LDC Standard Arabic
Morphological Analyzer (SAMA) v. 3.0. LDC Catalog
No.: LDC2009E44. Special GALE release to be
followed by a full LDC publication.
Megyesi, B., Dahlqvist, B., Pettersson, E., and Nivre, J.
(2008). Swedish-Turkish Parallel Treebank. In
Proceedings of Fifth International Conference on
Language Resources and Evaluation (LREC 2008).
Melamed, D.I. (2001). Empirical Methods for Exploiting
Parallel Texts. MIT Press.
Melamed, D.I. (1998a). Annotation Style Guide for the
Blinker Project. University of Pennsylvania (IRCS
Technical Report #98-06).
Melamed, D.I. (1998b). Manual Annotation of
Translational Equivalence: The Blinker Project.
University of Pennsylvania (IRCS Technical Report
#98-07).
Palmer, M., Chiou, F.-D., Xue, N., and Lee, T.-K. (2005)
LDC2005T01, Chinese Treebank 5.0.
Véronis, J. and Langlais, P. (2000). Evaluation of Parallel
Text Alignment Systems -- The ARCADE Project. In J.
Véronis (ed.) Parallel Text Processing, Text, Speech
and Language Technology. ordrecht, The Netherlands:
Kluwer Academic Publishers, pp. 369-388.
107/119
Using English as a Pivot Language to Enhance Danish-Arabic Statistical
Machine Translation
Mossab Al-Hunaity, Bente Maegaard, Dorte Hansen
Center for Language Technology
University of Copenhagen
musab@hum.ku.dk,bmaegaard@hum.ku.dk,dorteh@hum.ku.dk
Abstract
We inspect two pivot strategies for Danish-Arabic statistical machine translation (SMT) system; phrase translation pivot strategy
and sentence translation pivot strategy respectively. English is used as a pivot language. We develop two SMT systems, Danish-
English and English-Arabic. We use different English-Arabic and English-Danish data resources. Our final results show that
SMT systems developed under sentence based pivot strategy outperforms system developed under phrase based pivot strategy,
especially when common parallel corpora are not available.
108/119
Habash and Hu (2009) used English as a pivot phrase translation strategy consistently outperformed
language between Chinese and Arabic where the the sentence translation strategy in their controlled
three languages in their system were based on the experiments. Habash and Hu (2009) used English as
a pivot language while translating from Arabic to
same text. Our work differs in that we train our two
Chinese. Their results showed that pivot strategy
systems on two different unrelated sets of data. This outperforms direct translation systems. Babych et al.
is due to the fact of scares parallel data resources (2007) used Russian language as a pivot from
between Danish and Arabic. Many pivot strategies Ukrainian to English. Their comparison showed that
are suggested in previous studies like the case with it is possible to achieve better translation quality with
Bertoldi et. al (2008) ,Utiyama, Isahara (2007) and pivot approach. Kumar et al. (2007) improved
Habash and Hu (2009). We choose to apply our Arabic-English MT by using available parallel data in
other languages. Their approach was to combine
experiments on two strategies; namely phrase
word alignment systems from multiple bridge
translation and sentence translation, due to the languages by multiplying posterior probability
available data resources and to hold more control on matrices. This approach requires parallel data for
experiments conditions. We plan to inspect further several languages, like the United Nations or
techniques on Danish Arabic SMT system in future European Parliament corpus. An approach based on
work. Our results show that using English as a pivot phrase table multiplication is discussed in Wu and
Wang (2007) .Phrase table is formed for the training
language is possible with partially comparable
process. Scores of the new phrase table are computed
corpora and produces reasonable results. We discover by combining corresponding translation probabilities
that sentence translation strategy outperforms phrase in the source-pivot and pivot-target phrase-tables.
translation strategy, especially when none parallel or They also focused on phrase pivoting. They proposed
common resources are available. We compare our a framework with two phrase tables: one extracted
experiments results with Google Translate to judge from a small amount of direct parallel data; and the
system performance. Finally we discuss future other extracted from large amounts of indirect data
with a third pivoting language. Their results were
research directions we find interesting to enhance our
compared with many different European language as
baseline performance. In the next section we well as Chinese-Japanese translation using English as
describe related work. Section 3 presents our system a pivoting language. Their results show that simple
description. In section 4 we describe our data and pivoting does not improve over direct MT. Utiyama
present our pivot experiments details. We present our and Isahara (2007) inspected many phrase pivoting
system performance results in section 5. Finally we strategies using three European languages (Spanish,
discuss our conclusions and future work in section 6. French and German). Their results showed that
pivoting does not work as well as direct translation.
Bertoldi et. al (2008) compare between various
2. Related Work approaches of PBSMT models with pivot languages.
There has been a lot of work on translation from Their experiments were on Chinese-Spanish
Danish to English Koehn (2009), and from Arabic to translation via disjoint or overlapped English as pivot
English Sadat and Habash( 2006) , Al-Onaizan and language. We believe that we are the first to explore
Papineni, (2006).Many efforts were spent to the Danish-Arabic language pair directly in MT. We
overcome the lack of parallel corpora with pivot also apply pivoting techniques on none parallel text
methods. For example, Resnik and Smith (2003) corpora.
developed a technique for mining the web to collect 3. System Description
parallel corpora for low-density language pairs.
Munteanu and Marcu (2005) extract parallel In our work we develop two base lines for each
sentences from large Chinese, Arabic, and English experiment, Danish English and English Arabic.
non-parallel newspaper corpora. Statistical machine Translation direction is from Danish to Arabic.
translation with pivot approach was investigated by Moses 1 package is used for training the base lines.
many researchers. For example Gispert and Mario The system partition the source sentence into phrases.
(2006) used Spanish as a bridge for their Catalan- Each phrase is translated into a target language
English translation. They compared two coupling
strategies: cascading of two translation systems phrase. We use GIZA++ Och and Ney (2003) for
versus training of system from parallel texts whose word alignment.
target part has been automatically translated from
pivot to target. In their work they showed that the 1: Moses Package http://www.statmt.org/moses/
109/119
We use Pharaoh System suite to build the phrase We pass the translation with maximum feature score
table and decode (Koehn, 2004). Our language as input to the English-Arabic system.
models for both systems were built using the SRILM
toolkit Stolcke( 2002).We use a maximum phrase 4.2 Phrase Translation Experiment
length of 6 to account for the increase in length of the In the phrase translation strategy we need to construct
segmented Arabic. Our distortion limit set to 6. And a phrase table to train the phrase-based SMT system.
We need a Danish-English phrase table and an
finally we use BLEU metric Papineni et al. (2001) to
English-Arabic phrase-table. From these tables, we
measure performance. construct a Danish-Arabic phrase table. We use a
matching algorithm that identifies parallel sentences
4. Pivot Strategy pairs among the tables. This process is explained in
We use the phrase-based SMT system described in Munteanu and Marcu (2005). We identify candidate
the previous section to deploy our pivot methods. We sentence pairs using a word-overlap filter tool 1 .
inspect two pivot strategies phrase translation and Finally we use a classifier to decide if the sentences
sentence translation. In both strategies we use in each pair are a good translation for each other and
English as the pivot language. Danish and Arabic update our Danish-Arabic phrase table with the
represent source and target languages. In phrase selected pair.
translation strategy we directly construct a Danish- 4.3 Data
Arabic phrase translation table from a Danish- Data collection was a great challenge for this
English and an English-Arabic phrase-table. In experiment. Our data resources are from two groups;
sentence translation strategy we first translate a Arabic-English and English-Danish. Table 1 shows a
Danish sentence into n English sentences and brief description of our data resources. English-
translate these n sentences into Arabic separately. We Arabic corpora domain intercrosses with the English-
select the highest scoring sentence from the Arabic Danish corpora domain to some reasonable degree.
sentences.
Name Direction Domain Size
4.1 Sentence Translation Experiment (words)
The sentence translation strategy uses two Acquis Danish- Legal 7.0 M
independently trained SMT systems: a direct Danish- English issues /
English system and a direct English-Arabic system. News
We translate every Danish sentence d into n English UN Arabic- Legal 3.2 M
sentences e {e1, e2, ..., en} using a Danish-English multilingual English issues /
SMT system. Then we translate each e sentence into corpus News
Arabic sentences a {a1, a2,..,an}. We estimate Meedan Arabic- News 0.5 M
sentence pair feature according to formula 1 below. English
LDC2004T17 Arabic- News 0.5 M
8
S s, t = n=1 αsn βsn + αt n βt n .. 1 English
Table 1: Corpus resources
αsn βsn , αt n βt n is the feature functions for the
source and target (s, t) sentences respectively. Feature
functions represents: a trigram language model Sample Lines Words
probability of the target language, two phrase
Small 30 K 1M
translation probabilities (both directions), two lexical
Training
110/119
It enjoys a good quality of translation and it contains
about 3.2 M lines of data and about 7 M words. The
second resource was Meedan 1 corpus, which is a Training Data DA-EN EN-AR
newly developed Arabic English corpus mainly Size
compiled from the internet and news agencies, it Small 20.3 25.1
contains more than 0.5 M Arabic words. The third Medium 21.4 26.3
resource was provided by LDC 2 (catalog no. Large 23.1 27.1
LDC2004T17), it contains more than 0.5 M words, it Table3: BLEU Scores for Direct Sentence Based
also cover news domain. For the English Danish SMT systems.
category we selected the Acquis 3 Corpus, it contains
more than 8 K documents and more than 7 M words. Our direct system for DA-EN system BLEU score
Acquis contain many legal documents that cover was 23 which is (64%) of Google system BLEU
many domains. English Arabic resources were scores while for the EN-AR system BLEU score was
extracted and aligned using Okapi 4 translation 27.1 which is (40%) of Google system BLEU scores.
memory editor. With the Acquis corpus we used the
available tools that are available at the Acquis
website for extracting and aligning Danish English DA-EN EN-AR DA-AR DA-EN-AR
text. All data were tokenized and lowercased Test 36.0 67.0 30.0 30.0
separately. In order to inspect the size factor on our Sample
SMT system data were compiled into three sets: Table 4 describe the BLEU scores for Google
Large, Medium and Small. Table 2 illustrates the translate web service on our test sample
training data size for each set. For testing data we
collected a parallel Arabic-English-Danish text from In Table 5 we present the results of the sentence
the UN Climate Change conference 2009 which was pivoting system and the phrase pivoting system.
held in Copenhagen 5 . We extracted 1 K sentences Sentence based strategy outperform Phrase based
for each language. Table 2 illustrates the training strategy. For the large size training data set the
data size for each experiment. The English Arabic system achieved a score of 19.1 for the sentence
corpora domain intercrosses with the English Danish based system compared with 12.9 to the phrased
corpora domain to some reasonable degree. We are based strategy .This results differs from previous
aware that there might be some bias among data similar studies like Utiyama and Isahara (2007) and
resources coverage, but due to data availability our Habash and Hu (2009) where pivot strategy
corpora can still serve our experiments objectives. outperform sentence strategy. Pivot system was not
Given the expense involved in creating direct Arabic- better because of the quality and quantity of the DA-
Danish parallel text and given the large amounts of EN-AR phrase table entries which was received from
Arabic-English and English-Danish data, we think the matching algorithm. Pivot system is dependent on
our approach in collecting data for our experiment is the matching algorithm and enhancing it will enhance
still valid and interesting. system performance. Google DA-EN and DA-EN-
AR results were the same. This is a good indicator
that Google uses pivot approach between languages
5. Results and Evaluation with limited resources like the case of Arabic and
We measure our system performance using BLEU Danish. Figure 1 represents a sample of our best
scores Papineni et al. (2002). We compare our system performing system results, compared with Google
performance with Google Translate web service. translate web service. The sample shows both
Comparison with Google provides us with a general original text and its translation, and our system
performance indicator for our system. Table 3 translation results for the same text.
presents our direct translation system results for DA-
EN and EN-AR baselines. As expected BLEU scores
will increase when we increase the training data size.
We use the same testing data described in section 4.3 1:Meedan http://github.com/anastaw/Meedan-Memory
with Google Translate; results are described in 2: LDC http://www.ldc.upenn.edu/
Table 4. Google outperforms our direct system 3: Acquis http://langtech.jrc.it/JRC-Acquis.html
4: Okapi http://okapi.sourceforge.net/
results especially for the EN-AR direct translation 5: Cop15 http://en.cop15.dk
111/119
Size Sentence Based Phrase Based Pivot corpora are not available. We compared our system
Pivot Strategy Strategy results with Google translate web service to estimate
(Da- En- Ar) (Da- En - Ar) relative progress and results were promising. In the
Small 15.0 11.4 future we plan to enhance our pivoting techniques.
Medium 16.9 12.3 Phrase pivot strategy is still a promising technique
Large 19.1 12.9 we need to utilize with our baseline. Phrase Pivot
Table5: BLEU Scores for Phrase based and Sentence strategy performs better when more parallel data
Based SMT systems. resources are available, so we plan to collect more
parallel training data for our baseline. We also plan to
apply state of the art alignment technique and to use
6. Conclusion and Future work word reorder tools on our system training data. This
Developing a SMT system between two language will enhance our SMT system learning process. We
pairs that don’t share many linguistic resources Like also plan to train our SMT system to fit domain
Danish and Arabic language pairs is a quite specific areas like weather, or climate domains. We
challenging task. We presented a comparison target high quality pivot techniques that will help us
between two common pivot strategies; phrase outperform available commercial tools like Google
translation and sentence translation. Our initial Translate especially for domain specific SMT areas
results show that sentence pivot strategy outperforms
phrase strategy especially when common parallel
Jeg tror, at en af de store mangler ved Kyoto var, at den officielle delegation kom tilbage med en
DA
Reference
Google اعخقذ ان احذ العٍُة الرئٍسٍت فً كٍُحُ ٌُ أن الُفذ الرسمً عبد الى احفبق مع أوٍم ٌعرفُن له ٌخم اعخمبدي فً مجلس الشٍُخ
industrial activities ,de Boer does not believe it will lessen the pressure on countries to act and sign a
new treaty.
AR دي بٌُر ال ٌعخقذ أن، َ على الرغم مه االوبعبثبث الغبزٌت لبٍج الذفٍئت مه المخُقع أن حىخفط وخٍجت الوخفبض األوشطت الصىبعٍت
رلك سُف ٌقلل مه الضغظ على الذَل للعمل َالخُقٍع على معبٌذة جذٌذة
System َسخحذد الضغُط على البالد لعمل على, ًاس الىشبط الخىبزل
حخى على الرغم مه اوبعبثبث غبزاث الحرارة مه المخُقع حىخفط على أس
. االحفبقٍت جذٌذة
Figure 1: Selected samples of system translation result
112/119
References
A. de Gispert and J. B. Mario, “Catalan-english Kishore Papineni, Salim Roukos, Todd Ward, and
statistical machine translation without parallel WeiJing Zhu. 2002. BLEU: a method for
corpus: bridging through spanish,” in Proc. of 5th automatic evaluation of machine translation. In
International Conference on Language Resources ACL.
and Evaluation (LREC), Genoa, Italy, 2006.
Masao Utiyama and Hitoshi Isahara. 2007. A
Andreas Stolcke. 2002. SRILM - an extensible comparison of pivot methods for phrase-based
language modeling toolkit. In ICSLP. statistical machine translation. In Proceedings of
NAACL-HLT’07, Rochester, NY, USA
Bogdan Babych, Anthony Hartley, and Serge
Sharoff. 2007. Translating from underresourced Munteanu and Marcu (2005)Dragos Stefan
languages: comparing direct transfer against pivot Munteanu and Daniel Marcu. 2005. Improving
translation. In Proceedings of MT Summit XI, machine translation performance by exploiting
Copenhagen, Denmark. non-parallel corpora. Computational
Linguistics,31(4):477–504.
Callison-Burch et al. (2006) Chris Callison-Burch,
Philipp Koehn, Amittai Axelrod, Alexandra Birch Nicola Bertoldi, Madalina Barbaiani, Marcello
Mayne, Miles Osborne, and David Talbot. 2005. Federico, Roldano Cattoni, 2008 ,Phrase-Based
Edinburgh system description for the 2005 Statistical Machine Translation with Pivot
IWSLT speech translation evaluation. InIWSLT. Languages, Proceedings of IWLST , USA.
Callison-Burch et al (2006)Chris Callison-Burch, Nizar Habash and Jun Hu.2009. Improving Arabic-
Philipp Koehn, and Miles Osborne. 2006. Chinese Statistical Machine Translation using
Improved statistical machine translation using English as Pivot Language. Proceedings of the
paraphrases. In Proceedings of HLT-NAACL’06. Fourth Workshop on Statistical Machine
New York, NY, USA Translation , pages 173–181.
113/119
Using a Hybrid Word Alignment Approach for Automatic Construction and
Updating of Arabic to French Lexicons
Nasredine Semmar, Laib Meriama
CEA, LIST, Vision and Content Engineering Laboratory,
18 route du Panorama, Fontenay-aux-Roses, F-92265, France
nasredine.semmar@cea.fr, meriama.laib@cea.fr
Abstract
Translation lexicons are vital in machine translation and cross-language information retrieval. The high cost of lexicon development
and maintenance is a major entry barrier for adding new languages pairs. The integration of automatic building of bilingual lexicons
has the potential to improve not only cost-efficiency but also accuracy. Word alignment techniques are generally used to build bilingual
lexicons. We present in this paper a hybrid approach to align simple and complex words (compound words and idiomatic expressions)
from a parallel corpus. This approach combines linguistic and statistical methods in order to improve word alignment results. The
linguistic improvements taken into account refer to the use of an existing bilingual lexicon, named entities detection and the use of
grammatical tags and syntactic dependency relations between words. The word aligner has been evaluated on the MD corpus of the
ARCADE II project which is composed of the same subset of sentences in Arabic and French. Arabic sentences are aligned to their
French counterparts. Experimental results show that this approach achieves a significant improvement of the bilingual lexicon with
simple and complex words.
We present in section 2 the state of the art of aligning Machine translation systems based on IBM statistical
words from parallel text corpora. In section 3, the main models do not use any linguistic knowledge. They use
steps to prepare parallel corpora for word alignment are parallel corpora to extract translation models and they use
described; we will focus, in particular, on the linguistic target monolingual corpora to learn target language model.
processing of Arabic text. We present in section 4 single The translation model is built by using a word alignment
and multi-word alignment approaches. We discuss in tool applied on a sentence-to-sentence aligned corpus.
section 5 results obtained after aligning simple and This model can be represented as a matrix of probabilities
complex words of a part of the ARCADE II MD (Monde that relies target and source words. The Giza++ tool (Och,
Diplomatique) corpus. Section 6 concludes our study and 2003) implements this kind of approach but its
presents our future work. performance is proved only for aligning simple words.
Approaches and tools for complex words alignment are at
2. Related work experimental stage (DeNero & Klein, 2008).
There are mainly three approaches for word alignment
using parallel corpora:
3. Pre-processing the bilingual parallel
• Statistical approaches are generally based on IBM
corpus
models (Brown et al., 1993). A bilingual parallel corpus is an association of two texts in
• Linguistic approaches for simple words and two languages, which represent translations of each other.
compound words alignment use bilingual lexicons In order to use this corpus in word alignment, two
114/119
pre-processing tasks are involved on the two texts: sentences, one for each language. The sentence aligner
sentence alignment and linguistic analysis. uses a cross-language search to identify the link between
the sentence in the source language and the translated
3.1 Sentence alignment sentence in the target language (Figure 1).
Sentence alignment consists in mapping sentences of the
source language with their translations in the target
language. A number of sentence alignment approaches
have been proposed (Brown et al., 1991; Gale & Church, Arabic sentences to align
1991; Kay & Röscheisen, 1993).
115/119
idiomatic expressions is performed by applying a set with a post nominal adjective. These relations are
of rules that are triggered on specific words and restricted to the same nominal chain and are used to
tested on left and right contexts of the trigger. These compute compound words. For example, in the
rules can recognize contiguous expressions as "َْ ا nominal chain “12 ا3'#” (water transportation), the
َْ َ ( "اthe white house). syntactic analyzer considers this nominal chain as a
• A Part-Of-Speech (POS) tagger which searches valid compound word “1+_3'#” composed of the words
paths through all the possible tags paths using “3'#” (transportation) and “1+” (water).
attested trigrams and bigrams sequences. The trigram • A named entity recognizer which uses name triggers
and bigram sequences are generated from a manually to identify named entities. For example, the
annotated training corpus. They are extracted from a expression “َرس+ ِ0ْ"
َ ِ+ ل9( ”ا َوThe first of March)
hand-tagged corpora. If no continuous trigram full is recognized as a date and the expression “0”
path is found, the POS tagger tries to use bigrams at (Qatar) is recognized as a location.
the points where the trigrams were not found in the • A module to eliminate empty words which consists in
sequence. If no bigrams allow completing the path, identifying words that should not be used as search
the word is left undisambiguated. The following criteria and removing them. These empty words are
example shows the result of the linguistic analysis identified using only their Part-Of-Speech tags (such
after Part-Of-Speech tagging of the Arabic sentence as prepositions, articles, punctuations and some
“ ا ادت اء ا اع ا adverbs). For example, the preposition “( ”لfor) in
!"# $% & &%'! ا*)اب ا+ن ز. /0+ 0 '0” (In the agglutinated word “3'” (for transportation) is
Italy, the order of things has persuaded in an invisible considered as an empty word.
manner a majority of voters that the time of • A module to normalize words by their lemmas. In the
traditional political parties was completed). Each case the word has several lemmas, only one of these
word is represented as a string (token) with its lemma lemmas is taken as normalization. Each normalized
and morpho-syntactic tag (token: lemma, word is associated with its morpho-syntactic tag. For
morpho-syntactic tag). example, normalization of the word “; #”أ
(pipelines) which is the plural of the word “ُْ=ب#ُ”أ
(1) : ِ
, Preposition (pipeline) is represented by the couple (ُْ=ب#ُأ, Noun).
(2) ا: ِ َ إ, Proper Noun
(3) ادت: َد
أ, Verb 4. Word alignment
(4) : َ, Common Noun Our approach to align simple and complex words adapts
(5) ال: ال, Definite Article and enriches the methods developed by:
(6) اء: ء َ
, Common Noun • (Debili & Zribi, 1996) (Bisson, 2001) which consist
(7) ا: َإ, Preposition to use, in one hand, a bilingual lexicon and the
(8) ا ع: َع ْإ, Common Noun linguistic properties of named entities and cognates
(9) ": # َِ", Common Noun to align simple words, and on the other hand,
(10) ال: ال, Definite Article syntactic dependency relations to align complex
(11) $%&: 'ِ %َ&, Common Noun words.
(12) : ِ, Preposition • (Giguet & Apidianaki, 2005) which consist to use
(13) ( ): َ ( )َ , Common Noun sequences of words repeated in the bilingual corpora
(14) *": )َ", Adverb and their occurrences to align compound words and
(15) +),: - ِ
+.),, Adjective idiomatic expressions.
(16) ب: ب, Preposition
(17) أن: َّن
أ, Conjunction 4.1 Single-word alignment
(18) 2,ز: 2,َ ز, Common Noun
Single-word alignment is composed of the following
(19) ال: ال, Definite Article
steps:
(20) اب45ا: ْب 4ِ5, Common Noun
• Alignment using the existing bilingual lexicon.
(21) ال: ال, Definite Article
• Alignment using the detection of named entities.
(22) 67(8: - ِي
6ِ7ْ
(َ8, Adjective
• Alignment using grammatical tags of words.
(23) 6 : 6َ, Preposition
• Alignment using Giza++.
(24) :7;: َ:َ7;, Verb
(25) < =: >َ
&, Common Noun
4.1.1. Bilingual lexicon look-up
(26) ?: ?, Pronoun Alignment using the existing bilingual lexicon consists in
extracting for each word of the source sentence the
• A syntactic analyzer which is used to split graph of appropriate translation in the bilingual lexicon. The result
words into nominal and verbal chains and recognize of this step is a list of lemmas of source words for which
dependency relations by using a set of syntactic rules. one or more translations were found in the bilingual
We developed a set of dependency relations to link lexicon. The Arabic to French lexicon used in this step
nouns to other nouns, a noun with a proper noun, a contains 124 581 entries.
proper noun with a post nominal adjective and a noun
116/119
Table 1 shows results of this step for the Arabic sentence is taken directly from the parallel corpus. Table 2
“ '0 ا ادت اء ا اع ا illustrates results after running the four steps of
!"# $% & &%'! ا*)اب ا+ن ز. /0+ 0” and its French single-word alignment.
translation “En Italie, l'ordre des choses a persuadé de
manière invisible une majorité d'électeurs que le temps Lemmas of words of the Translations returned by
des partis traditionnels était terminé”. source sentence single-word alignment
ََِِ إ Italie
Lemmas of words of the Translations found in the َِ َ ordre
source sentence bilingual lexicon
ْء َ chose
ْء َ chose
اع persuasion
9ِ َِ majorité
9ِ َِ majorité
;َِ# électeur
;َِ# électeur
ِ0َ َ' manière
َ'ِ0 َ manière
َ+َز temps
َ+َز temps
*)ْب ِ parti
*)ْب ِ parti
ِّ&ِي%ْ'Aَ traditionnel
ِّ&ِي%ْ'Aَ traditionnel
Table 1: Single-word alignment with the existing
Table 2: Single-word alignment results.
bilingual lexicon.
117/119
source sentence and the compound words of the target built or updated automatically by these methods.
sentences. First, a syntactic analysis is applied on the
source and target sentences in order to extract dependency We have established a score for each type of alignment to
relations between words and to recognize compound facilitate the cleaning process of the bilingual lexicon
words structures. Then, reformulation rules are applied on built or updated automatically from the parallel corpus:
these structures to establish correspondences between the • A link alignment between single words found in the
compound words of the source sentence and the bilingual corpus and validated in the bilingual
compound words of the target sentence. For example, the dictionary has a score equal to 1.
rule Translation(A.B) = Translation(A).Translation(B) • A link alignment between single words found by the
allows to align the Arabic compound word “ِّ&ِي%ْ'Aَ *)ْب
ِ” detection of named entities (proper nouns and
with the French compound word “parti traditionnel” as numerical expressions) has a score equal to 0.99.
follows: • A link alignment between single words found by
matching grammatical tags has a score equal to 0.98.
Translation(ِّ&ِي%ْ'Aَ .*)ْب
ِ )= • A link alignment between single words produced by
Translation(*)ْب ِ ).Translation(ِّ&ِي%ْ'Aَ ) = parti. traditionnel GIZA++ has a score equal to 0.97.
• A link alignment between compound words that are
In the same manner, this step aligns the compound word translated literally from one to the other has a score
“ ْء
َ _َِ
َ ” with the compound word “ordre_chose” equal to 0.96.
even if the word “ordre” is not proposed as a translation of • A link alignment between compound words that are
the word “ََِ ” in the bilingual lexicon. not translated word for word or idiomatic expressions
has a score equal to 0.95.
4.2.1. Idiomatic expressions alignment
In order to translate missed compound words and Table 3 presents results after running all the steps of word
idiomatic expressions, we used a statistical approach alignment process for simple and complex words.
which consists in:
• identifying the sequences of words which are Simple and complex words Translations returned by Score
candidate for the alignment: for the two texts of the of the source sentence word alignment
bilingual corpus, we compute the sequences of
ََِِ إ Italie 0.99
repeated words and their number of occurrences.
• representing these sequences with vectors: for each َِ َ ordre 0.98
sequence, we indicate numbers of segments in which ْء َ chose 1
the sequence appears. اع persuasion 0.97
• aligning the sequences: for each sequence of the 9ِ َِ majorité 1
source text and each sequence of the target text, we ;َِ# électeur 1
estimate the value of the translation relation with the
َ'ِ0 َ manière 1
following formula:
َ+َز temps 1
*)ْب ِ parti 1
ِّ&ِي%ْ'Aَ traditionnel 1
;َِ#_9ِ َِ majorité_électeur 0.96
ِ&ِ%ْ'Aَ ّ*)ْب_ي ِ parti_traditionnel 0.96
This step results in a list of single words, compound
words and idiomatic expressions of the source sentence ِّ&ِي%ْ'Aَ _*)ْبِ _َ+َز temps_parti_traditionnel 0.96
and their translations. For example, for the previous ْء َ _َِ
َ ordre_chose 0.96
Arabic sentence and its French translation, the multi-word /0+ 0 '0 manière invisible 0.95
aligner founds that the expression “manière invisible” is a
translation of the Arabic expression “/0+ 0 '0”. Table 3: Single-word and multi-word alignment results.
118/119
Precision Recall F-measure Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R. L.
0.85 0.80 0.82 (1993). The mathematics of statistical machine
translation : parameter estimation. Computational
Linguistics 19(3).
Table 4: Word alignment performance.
Daille, B., Gaussier, E., Lange, J. M. (1994). Towards
automatic extraction of monolingual and bilingual
Analysis of the alignment results of the previous sentence
terminology. In Proceedings of the 15th International
(Table 3) shows, in one hand, that 10 simple words
Conference on Computational Linguistics
(among 14), 4 compound words and 1 idiomatic
(COLING'94).
expression are correctly aligned, and on other hand, 7
Debili, F., Zribi, A. (1996). Les dépendances syntaxiques
simple words are aligned with the bilingual lexicon, 1
au service de l’appariement des mots. In Proceedings of
simple word is aligned with named entities detection, 1
the 10ème Congrès Reconnaissance des Formes et
simple word is aligned by using grammatical tag
Intelligence Artificielle.
matching and 1 simple word is aligned with Giza++.
DeNero, J., Klein, D. (2008). The Complexity of Phrase
Alignment Problems. In Proceedings of the of ACL
For the whole corpus, 53% of words are aligned with the
2008.
bilingual lexicon, 9% are aligned with named entities
Gale, W.A., Church, K. W. (1991). A program for aligning
detection, 15% are aligned by using grammatical tags and
sentences in bilingual corpora. In Proceedings of the
4% are aligned as compound words or idiomatic
29th Annual Meeting of Association for Computational
expressions. Consequently, 28% of the words of the
Linguistics.
source sentence and their translations are added to the
Gaussier, E., Lange, J.M. (1995). Modèles statistiques
bilingual lexicon.
pour l’extraction de lexiques bilingues. Traitement
Automatique de la Langue 36.
6. Conclusion Giguet, E., Apidianaki, M. (2005). Alignement d'unités
In this paper, we have presented a hybrid approach to textuelles de taille variable. In Proceedings of the
word alignment combining statistical and linguistic 4èmes Journées de la Linguistique de Corpus.
sources of information (bilingual lexicon, named entities Kay, M., Röscheisen, M. (1993). Text translation
detection, use of grammatical tags and syntactic alignment. Computational Linguistics, Special issue on
dependency relations, number of occurrences of word using large corpora, Volume 19, Issue 1.
sequences). The results we obtained showed that this Melamed, I.D. (2001). Empirical Methods for Exploiting
approach improves word alignment precision and recall, Parallel Texts. MIT Press.
and achieves a significant enrichment of the bilingual Och, F.J. (2003). GIZA++: Training of statistical
lexicon with simple and complex words. In future work, translation models. MIT Press
we plan to develop strategies and techniques, in one hand, http://www.fjoch.com/GIZA++.htm.
to filter word alignment results in order to clean the Ozdowska, S. (2004). Appariement bilingue de mots par
bilingual lexicons built or updated automatically, and on propagation syntaxique à partir de corpus
other hand, to improve the recall of the statistical français/anglais alignés. In Proceedings of the 11ème
approach by using the existing bilingual lexicon and the conférence TALN-RECITAL.
results of the morpho-syntactic analysis of the parallel Smadja, F., Mckeown, K., Hatzivassiloglou, V. (1996).
corpus. Translation Collocations for Bilingual Lexicons: A
Statistical Approach. Computational Linguistics 22(1).
7. Acknowledgements Semmar, N., Fluhr, C. (2007). Arabic to French Sentence
This research work is supported by WEBCROSSLING Alignment: Exploration of A Cross-language
(ANR - Programme Technologies Logicielles - 2007) and Information Retrieval Approach. In Proceedings of the
MEDAR (Support Action FP7 – ICT – 2007 - 1) projects. 2007 Workshop on Computational Approaches to
Semitic Languages: Common Issues and Resources.
8. References Veronis, J., Hamon, O., Ayache, C., Belmouhoub, R.,
Barbu, A.M. (2004). Simple linguistic methods for Kraif, O., Laurent, D., Nguyen, T. M. H., Semmar, N.,
improving a word alignment. In Proceedings of the 7th Stuck, F., Zaghouani, W. (2008). Arcade II Action de
International Conference on the Statistical Analysis of recherche concertée sur l'alignement de documents et
Textual. son évaluation. Chapitre 2, Editions Hermès.
Bisson, F. (2000). U Méthodes et outils pour
l’appariement de textes bilingues. Thèse de Doctorat en
Informatique. Université Paris VII.
Blank, I. (2000). Parallel Text Processing : Terminology
extraction from parallel technical texts. Dordrecht:
Kluwer.
Brown, P.F., Mercier, L. (1991). Aligning Sentences in
Parallel Corpora. In Proceedings of ACL 1991.
119/119