0% found this document useful (0 votes)

99 views7 pages

Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9

This document summarizes a conference paper about OSAC (Open Source Arabic Corpora). It discusses the complexity of the Arabic language and the lack of freely accessible Arabic text corpora, which poses challenges for Arabic linguistic research. It then introduces OSAC, the largest freely accessible Arabic text corpus collected by the authors, which covers different text genres and can serve as a benchmark resource. The corpus aims to help address the shortage of high-quality Arabic digital materials to support research in the field of Arabic linguistics.

Uploaded by

Eman Asem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views7 pages

Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9

Uploaded by

Eman Asem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234136250

OSAC: Open Source Arabic Corpora

Conference Paper · November 2010

DOI: 10.13140/2.1.4664.9288

CITATIONS READS

84 2,559

2 authors:

Motaz Saad Wesam Ashour

Islamic University of Gaza Islamic University of Gaza
24 PUBLICATIONS 365 CITATIONS 64 PUBLICATIONS 630 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Content-Based Image Retrieval View project

Comparability and Social Networks View project

All content following this page was uploaded by Motaz Saad on 27 May 2014.

The user has requested enhancement of the downloaded file.

Published at the 6th International Conference on Electrical and Computer Systems (EECS’10), Nov
25-26, 2010, Lefke, North Cyprus.

OSAC: Open Source Arabic Corpora

Motaz K. Saad Wesam Ashour

Faculty of Information Technology
Computer Engineering Department
Islamic University of Gaza
Islamic University of Gaza
Gaza, Palestine
e-mail msaad@iugaza.edu.ps
Gaza, Palestine
e-mail washour@iugaza.edu.ps

Abstract—Arabic Linguistics is promising research field. normal Arabic text does not provide enough
The acute lack of free public accessible Arabic corpora information about the correct pronunciation, the main
is one of the major difficulties that Arabic linguistics purpose of tashkil (and ḥarakat) is to provide a
researches face. The effort of this paper is a step phonetic guide or a phonetic aid; i.e. show the correct
towards supporting Arabic linguistics research field. pronunciation (double the word in pronunciation or to
This paper presents the complex nature of Arabic act as short vowels). The ḥarakat, which literally
language, pose the problems of: (1) lacking free public means "motions", are the short vowel marks [7].
Arabic corpora, (2) the lack of high-quality, well- Arabic diacritics include Fatha, Kasra, Damma,
structured Arabic digital contents. The paper finally
Sukūn, Shadda, and Tanwin. The pronunciations of
presents OSAC, the largest free accessible that we
collected.
diacritics aforementioned are presented in Table 1.
Arabic words may also have Tatweel or kasheeda as
Keywords: Arabic Language, Arabic corpora, Arabic shown in figure 1.
Digital contents. Arabic words have two genders, masculine (‫)ٍظمغ‬
and feminine (‫ ;)ٍؤّس‬three numbers, singular (‫)ٍفغص‬,
1. INTRODUCTION
dual (ْٚ‫)ٍث‬, and plural (‫ ;)جَع‬and three grammatical
Arabic Language is the 5th widely used languages cases, nominative (‫)اىغفع‬, accusative (‫)اىْظة‬, and
in the world. It is spoken by more than 422 million genitive (‫)اىجغ‬. A noun has the nominative case when it
people as a first language and by 250 million as a is subject (‫ ;)فاعو‬accusative when it is the object of a
second language [8]. Arabic has 3 forms; Classical verb (‫ ;)ٍفع٘ه‬and the genitive when it is the object of a
Arabic (CA), Modern Standard Arabic (MSA), and preposition (‫)ٍجغٗع تذغف جغ‬. Words are classified into
Dialectal Arabic (DA). CA includes classical historical three main parts of speech, nouns (‫( )اسَاء‬including
liturgical text, MSA includes news media and formal adjectives (‫ )طفاخ‬and adverbs (‫))ظغٗف‬, verbs (‫)افعاه‬,
speech, and DA includes predominantly spoken and particles (‫)اصٗاخ‬.
vernaculars and has no written standards. Arabic
alphabet consists of the following 28 letters ( ‫أ ب خ ز ج‬ Despite Arabic language is widespread, there is
ٛ ٗ ٓ ُ ً ‫ )ح ر ص ط ع ػ س ش ص ع ط ظ ع غ ف ق ك ه‬in acute luck of well-structured and high quality Arabic
addition, the Hamza (‫)ء‬. There is no upper or lower digital contents. There are also a lack of free and
case for Arabic letters like English letters. The letters Public Arabic corpora. This paper presents OSAC,
(ٛ ٗ ‫ )أ‬are vowels, and the rest are constants. Unlike open source Arabic corpora that cover different text
Latin-based alphabets, the orientation of writing in genres which can be used in the future as a
Arabic is from right to left. benchmark.
Table 1: Diacritics The rest of this paper is organized as follows:
section 2 presents the complexity of Arabic language,
Double No
Constant Vowel
Nunation Vowel section 3 talks about the problem of lacking of Arabic
ّ‫ب‬ ْ‫ب‬ ْ‫ب‬ ْ‫ب‬ ْ‫ب‬ ْ‫ب‬ ْ‫ب‬ ْ‫ب‬ corpora and Arabic digital contents, describes corpora
/bb/ /b/ /bin/ /bun/ /ban/ /bi/ /bu/ /ba/ building steps, and presents the collected corpora, and
finally section 4 draw the conclusion.
ُ‫دق٘ق االّسا‬
ُ‫دقــــــــــ٘ق االّســـــــــا‬ 2. COMPLEXITY OF ARABIC LANGUAGE
ُ‫دقـــــــــــــــــــــــــــ٘ق االّســـــــــــــــــــــــــــا‬
Arabic is a challenging language for a number of
ُ‫دقـــــــــــــــــــــــــــــــــــــــــــــ٘ق االّســــــــــــــــــــــــــــــــــــا‬
Fig. 1: Tatweel (kasheeda) reasons:

The Arabic script has numerous diacritics,  Orthographic (‫ )االٍالء‬with diacritics is less
including I’jam (ً‫)إعجا‬, consonant pointing, and tashkil ambiguous and more phonetic in Arabic, certain
(‫و‬ٞ‫)ذشن‬, supplementary diacritics. The latter include the combinations of characters can be written in
ḥarakat (‫دغماخ‬, singular haraka ‫)دغمح‬, vowel marks. different ways [7].
The literal meaning of tashkil is "forming". As the

1
 Arabic language has short vowels which give The stem consists of a consonantal root (‫خ‬ٞ‫)جظع طذ‬
different pronunciation. Grammatically they are and a pattern morpheme (ْٚ‫)اطغغ ميَح طاخ ٍع‬. The
required but omitted in written Arabic texts [7]. affixes include inflectional markers ( ‫عالٍاخ اٗ دغماخ‬
‫ح‬ٞ‫ )اعغات‬for tense, gender, and/or numbers. The clitics
 Arabic has a very complex morphology as include some prepositions (‫)دغٗف جغ‬, conjunctions
compare to English language [1, 9, 12, 13]. (‫)دغٗف اىعطف‬, determiners (‫)ٍذضصاخ‬, possessive
 Synonyms are widespread. Arabic is a highly pronouns (‫ح‬ٞ‫ )ضَائغ اىَين‬and pronouns (‫)ضَائغ‬. The
inflectional and derivational language [1, 8, 12, clitics attached to the beginning of a stem are called
13]. proclitic and the ones attached to the end of it are
called enclitics. Most Arabic morphemes are defined
 Lack of publically freely accessible Arabic by three consonants, to which various affixes can be
Corpora [3, 4, 5, 6, 13]. attached to create a word. For example, from the tri-
 Lack of Arabic digital contents [11, 13]. consonant "ktb" (‫)مرة‬, we can inflect (‫ظغف‬ٝ) several
different words concerning the idea of writing as
In the following, we shall discuss these points in (wrote ‫)مَرة‬, (book ‫) ِمراب‬, (the book ‫)اى ِنراب‬, (books ‫) ُمرُة‬,
details. (he writes ‫ ْنرُة‬ٝ), (writer ‫)ماذِة‬, (library ‫)ٍ ْنرَثَح‬. Moreover
Word meanings: It is possible to identify the an Arabic word may correspond to several English
different meanings associated with a word, due to one words. Another example is the Arabic word (‫)ٗتْف٘طٕا‬
word may have more than one meaning in different and its equivalence in English “and with her
contexts.. Table 2 shows the Arabic word (‫ )قية‬which influences”. This makes segmentation of Arabic
has 3 meaning as a noun. textual data different and more difficult than Latin
languages.
Table 2: The meaning of word (‫ )قية‬as a noun
Affixes set in Arabic are shown in Table 4, and
Word meaning Sentence Arabic patterns (ُ‫ )األٗػا‬and roots are shown in Table
core ‫ قلة االدضاز‬ٜ‫ف‬ 5. The word (ٌ‫ )عي‬may give various meanings by
heart ‫ح قلة ٍفر٘ح‬ٞ‫ عَي‬ٙ‫اجغ‬
adding different affixes (prefixes, infixes, or suffixes)
center, middle ‫ قلة اىَيعة‬ٜ‫اىنغج ف‬
as shown in Table 6. Other morphological variations
example is the word (‫ظٕة‬ٝ) which means (go) are
Variations in lexical category: One word may presented in Table 7.
have more than lexical category (noun, verb, adjective, Table 4: Affix set in Arabic Language
etc.) in different contexts as shown in Table 3.
Morphological analysis of a given corpus includes Affixes in Arabic Examples
investigating word frequency of a word as a lexical Prefixes of length 3 ‫ تاه‬، ‫ ماه‬، ‫ ٗاه‬، ‫ٗىو‬
category. Length 2 prefixes ‫ ىو‬، ‫اه‬
Length 1 prefixes ‫ا‬،ُ،‫خ‬،ٙ،ٗ،‫س‬،‫ف‬،‫ب‬،‫ه‬
Table 3: The Lexical Category of word (ِٞ‫)ع‬ Length 3 suffixes ‫ مَو‬، ِٞ‫ ذ‬، ُ‫ ذا‬، ‫ َٕو‬، ‫ذَو‬
Length 2 suffixes ‫ ّا‬، ِٕ ، ٌ‫ م‬، ِ‫ ذ‬، ِٝ ، ُ‫ ا‬، ‫ اخ‬، ُٗ
Word meaning Word Category Sentence
ٌٕ ، ‫ ٍا‬، ‫ ٗا‬، ّٜ ، ِ‫ م‬، ٌ‫ ذ‬، ‫ ٕا‬، ‫ا‬ٝ ،
Ain Proper-Noun ‫عين جاى٘خ‬ Length 1 suffixes ُ،‫ا‬،‫خ‬،‫ك‬،ٛ،ٓ،‫ج‬
wellspring Noun ‫عين اىَاء‬ Table 5: Arabic Patterns and Roots
eye Noun ُ‫عين االّسا‬
delimitate/be delimitate Verb/passive Verb ‫ح‬ٞ‫غا ىيشاعج‬ٝ‫عين ٗػ‬ Arabic Pattern and roots
Examples
)‫(األوزان‬
Length 4 pattern ‫فاعو فعيح فعاه ٍفعو‬
Synonyms: Languages have many words that are ‫ذفاعو افرعو افعاه فعاىح فعالُ فع٘ىح ذفعيح‬
considered synonymous. Through a given corpus, the Length 5 pattern and length 3 ‫و ٍفعيح ٍفع٘ه فاع٘ه ف٘اعو ٍفاعو‬ٞ‫ذفع‬
researchers can use morphological analysis tools to roots ‫و افعيح فعائو ٍْفعو ٍفرعو فاعيح‬ٞ‫ٍفع‬
know synonyms of a word, the frequency of each ‫ اّفعو‬ٜ‫فرعو ذفرعو فعالى‬ٝ ‫ٍفاعو فَالع‬
Length 5 pattern and length 4
word of those synonyms and which one of them is ‫ذفعيو افعيو ٍفعيو فعييح فعالُ فعاىو‬
roots
more common. Examples of synonyms in Arabic are Length 6 pattern and length 3 ‫اسرفعو ٍفاعيح افرعاه افع٘عو اّفعاه‬
(‫ ٕٗة‬ٚ‫ )تظه ٍْخ اعط‬which means (give), (‫)اسغج عائيح‬ roots ‫ٍسرفعو‬
which means (family), and (‫ )فظو طف‬which means Length 6 pattern and length 4
‫افرعيو افعاله ٍرفعيو‬
(classroom). roots

The word form according to its case: The form

of some Arabic words may change according to their Stemming usually used to convert words to root
case modes (nominative, accusative or genitive). For form, it dramatically reduces the complexity of Arabic
instance, the plural of word (‫ )ٍسافغ‬which means language morphology by reducing the number of
(traveler) may be in the form (ُٗ‫ )ٍسافغ‬in the case of feature / keywords in corpora. The reason for using
nominative (‫ )ٍغف٘عح‬and the form (ِٝ‫ )ٍسافغ‬in the case stemming as feature / keywords reduction technique is
of accusative/genitive (‫ٍجغٗعج‬/‫)ٍْظ٘تح‬. Arabic light that all morphology of words mostly has the same
stemming can handle these cases. context meaning, but the case is not always true. Table
8 shows some of these cases. There is another
Morphological characteristics: An Arabic word
approach for morphology reduction that just removes
may be composed of a stem plus affixes and clitics.

2
affixes and does not convert the word to bas/root form. Encoding Problem: Arabic Language has display
This approach is called light stemming [13, 14]. Problems (encoding issues) because it has different
Table 6: Versions of the word (ٌ‫ )عي‬and its meaning when adding
encoding according to machine platform. Figure 2
affixes shows the problem of using incorrect encoding where
all circled cells are displayed correctly while the other
Meaning Suffix Infix Prefix Word cells are displayed incorrectly. Text preprocessing,
Scientific ‫ح‬ٝ *** *** ‫ح‬َٞ‫عي‬ mining, and information retrieval with incorrect
Learned us ‫ذْا‬ *** *** ‫عيَرْا‬ encoding may lead to incorrect results. Table 9
His science ٓ *** *** َٔ‫عي‬ presents the characteristics of two common Arabic
Scientists ‫اء‬ *** *** ‫عيَاء‬
encoding systems; Unicode and code page 1256 CP-
Teaching *** ٛ ‫خ‬ ٌٞ‫ذعي‬
1256 Arabic windows.
Sciences *** ٗ *** ً٘‫عي‬
Informative ٔٝ ‫ا‬ ‫اسد‬ ‫ح‬ٍٞ‫اسرعال‬ 3. ARABIC CORPORA
Table 7: Morphological variation of word (‫)طٕة‬
Corpus-based approaches to language have
verb time # of participants Gender of subjects introduced new dimensions to linguistic description
‫رهة‬ Past 1 Male and various applications by permitting some degree of
‫رهثت‬ Past 1 Female automatic analysis of text. The identification, counting
‫رهثا‬ Past 2 Male
and sorting of words, collocations and grammatical
‫رهثتا‬ Past 3 Female
‫رهثوا‬ Past 3 or more Male
structures which occur in a corpus can be carried out
‫رهثن‬ Past 3 or more Female
quickly and accurately by computer, thus greatly
‫يزهة‬ Present 1 Male reducing some of the human drudgery sometimes
‫تزهة‬ Present 1 Female associated with linguistic description and vastly
‫سيزهة‬ Future 1 Male expanding the empirical basis [3, 4]. Linguistic
‫ستزهة‬ Future 1 Female research has become heavily reliant on text corpora
‫سيزهثوا‬ Future 3 or more Male over the past ten years. Text data mining is a
‫سيزهثن‬ Future 3 or more Female multidisciplinary field involving information retrieval,
Table 8: Different meaning of morphology of the same root in text analysis, information extraction, clustering,
Arabic categorization and linguistics. Text mining is
Meaning Root Word becoming of more significance, and efforts have been
Class room ٜ‫الفصل اىضعاس‬ multiplied in studies to provide for fetching the
‫فظو‬ increasingly available information efficiently [3, 4].
Apartheid ٛ‫الفصل اىعْظغ‬
Goes out of house ‫د‬ٞ‫يخرج ٍِ اىث‬ْ
‫سغج‬ ّ Due to the increasing need of an Arabic corpus to
Graduate from university ‫تخرج ٍِ اىجاٍعح‬
The fisherman twist the cord ‫اص اىذثو‬ٞ‫جذل اىظ‬ represent the Arabic language and because of the trials
‫جضه‬ to build an Arabic corpus in the last few years were
The student argued with the teacher ‫جادل اىطاىة اىَضعس‬
He focuses the arrow ٌٖ‫أّ يصوب اىس‬ not enough to consider that the Arabic language has a
‫ط٘ب‬
The man lost his mind ‫فقض اىغجو صواته‬ real, representative and reliable corpus, it was
necessary to build such an Arabic corpus to support
various linguistic research on Arabic [3, 4]. Thus, one
of the difficulties that encountered Arabic Language
researches is the lack of publicly available Arabic
corpus [3, 4, 5, 6]. Arabic corpus problem was posed
by [3, 4, 5, 6]. A survey by [3, 4] confirms that
existing corpora are too narrowly limited in source-
type and genre, and that there is a need for a freely-
accessible Corpus of Contemporary Arabic (CCA)
covering a broad range of text-types. Due to the
Arabic language lacking of corpora, it is difficult to
display textual content and quantitative data of Arabic.
Fig. 2: Arabic Encoding Problem Al-Nasray et. al. [3, 4] discussed three axes in their
Table 9: Unicode vs. cp-1256 Arabic windows encoding paper; the 1st axes is a survey of the importance of
corpora in language studies e.g. lexicography,
Unicode CP-1256 Arabic windows grammar, semantics, Natural Language Processing and
Becoming the standard more
Commonly used
other areas. The 2nd axis demonstrates how the Arabic
and more language lacks textual resources, such as corpora and
2-byte characters 1-byte characters tools for corpus analysis and the effected of this lack
Widely supported on the quality of Arabic language applications. There
Widely supported input/display
input/display
are rarely successful trials in compiling Arabic
Supports extended Arabic Minimal support for extended
characters Arabic characters corpora, therefore, the 3rd axis presents the technical
Multi-script representation bi-script support (Roman/Arabic) design of the International Corpus of Arabic (ICA), a
Supports presentation forms Tri-lingual support: Arabic, newly established representative corpus of Arabic that
(shapes and ligatures) French, English (ala ANSI) is intended to cover the Arabic language as being used
all over the Arab world. The corpus is planned to

3
support various Arabic studies that depends on improve the Arabic content on Wikipedia by
authentic (‫يح‬ٞ‫ )اط‬data, in addition to building Arabic promoting translation of high quality articles i
Natural Language Processing Applications. different subject areas, including Nanotechnology,
Biotechnology and Public Health. It aims to translate
International Corpus of Arabic (ICA) is a big 2,000 articles within these areas in its first phase.
project initiated by Bibliotheca Alexandrina (BA). BA
is one of the international Egyptian organizations that Major web players are looking to boost Arabic-
play a noticeable role in disseminating culture and language content online in a bid to meet demand from
knowledge, and in supporting scientific research. ICA a rapidly growing Arab audience [11]. The Arab world
is a real trial to build a representative Arabic corpus as has been facing a digital conundrum for the past few
being used all over the Arab world to support research years – not enough users online creating content in
on Arabic [3, 4]. ICA corpus has been analyzed by Al- Arabic; not enough content in Arabic to push internet
Nasry et. al. in [4], they shed light on the levels of penetration [11]. Although there are more than 422
corpus analysis e.g. morphological analysis, lexical million Arabic speakers worldwide and Arabic is the
analysis, syntactic analysis and semantic analysis. Al- seventh-most popular language on the web, less than
Nasry also demonstrates different available tools for one per cent of all online content is in Arabic and there
Arabic morphological analysis (Xerox, Tim is just a 17.5 per cent internet penetration across the
Buckwalter, Sakhr and RDI). The morphological region’s population.
analysis of ICA includes: selecting and describing the
model of analysis, pre-analysis stage and full text Google has been working on several initiatives to
analysis stages. ICA is not publically available now help increase Arabic-language content. It tied up with
Wikipedia after observing the Arabic portal of the
and it expected to be released soon [3, 4].
online encyclopedia carried 120,000 pages compared
with the 2 million pages of its Catalan equivalent. This
3.1 Arabic Digital Content is despite the disproportionate number of potential
Yet Arabic is the fastest-growing language on the Arabic-speaking users, 422 million, compared with 6
internet, with Arabic-speaking internet users million Catalan speakers [11]. About 10 million words
increasing 2,298 per cent from 2000-2009, according have now been translated into Arabic from English on
to the Internet World Statistics Report the site and 6 million from Arabic to English [11].
internetworldstats.com. The number of internet users
in the Middle East and North Africa (Mena) region has The search giant has also been educating small
leapt from 3.2 million users in 2000 to 60.25 million in businesses to build their own websites using Google
2009 and it is estimated that at least another 55 million Sites – or to at least put their business directory
new users will come online in the next five years. If information on Google Maps. It has built Ejabat, a
mobile internet users are included, that figure soars user-generated question and answer system, which
even further to 150 million. now has 600,000 questions and 2 million answers
from 300,000 registered users [11]. With 20-25 per
The content problem is of both quantity and cent of Mena users in the past year being completely
quality. There is a lack of high-quality, well-structured new to the web and a third of them under the age of
websites managed by companies creating digital 18, Google launched educational video site Ahlan
content for Arabic-speaking users. For example, if you (google.com/intl/ar/ahlanonline) to introduce users to
search in English for a specific mobile phone model, the world of online learning. Within three months
you will land on a specialized portal with there were 1.2 million views of the Ahlan training
specifications, reviews and photos. In Arabic, you will videos.
probably end up in a forum where a question is being
asked about that phone. It is unlikely in Arabic US giant internet portal Yahoo, meanwhile, took a
searches that the first page of results would not have a big leap into the Arabic content arena in 2009 when it
forum. There is a regional need for real local content acquired Maktoob, the region’s largest community
and generally users in the region prefer Arabic today. site. Maktoob is currently the 157th biggest site on the
internet, according to web information company
However, while Arabic content may have had a Alexa’s listings (alexa.com). This makes it the 2nd
growth spurt in the past year, the content that has most popular Arabic site behind Google Saudi Arabia
grown is still primarily user-generated and often at number 104 and way ahead of the third Arabic site
machine translated. There is still a lack of original, in the world rankings, sports site Koora. Maktoob was
localized, high-quality content. founded in 2000 as the world’s 1st free Arabic/English
email service, but discussion forums quickly became
3.2 Creating Arabic contenet online its biggest traffic and content driver, with the women’s
There are many Arabic digital content enrichment forum one of the largest. Other popular areas include
initiatives. United Nations Economic and Social games, matrimonial, blogs and sports.
Commission for Western Asia - ESCWA released a
project in 2007 to develop the industry of Arabic 3.3 Bulding Arabic Corpora
digital content. Wiki Arabi is a project initiated by Different corpora are available in English. Reuter’s
King Abdulaziz City for Science and Technology collections of news stories are popular and typical
(KACST) within the framework of King Abdullah's example. The Linguistic Data Consortium (LDC)
Initiative for Arabic Content. The project aims to provides two non-free Arabic corpora, the Arabic

4
NEWSWIRE and Arabic Gigaword corpus. Both
corpora contain newswire stories.
There is a need for a freely-accessible corpus of
Arabic. There are no standard or benchmark corpora.
Thus, all researchers conduct their researches on their
own compiled corpus. Arabic language is highly
inflectional and derivational language which makes
text mining / Information Retrieval a complex task. In
Arabic text mining research field, there are some
published experimental results, but these results came
from different datasets, it is hard to compare classifiers
because each research used different datasets for
training and testing [15]. Sebastiani stated at [15] "We
have to bear in mind that comparisons are reliable only Fig. 3: Corpora building steps
when based on experiments performed by the same
author under carefully controlled conditions".
One of the aims of this paper is to compile
representative Arabic corpora that cover different text
genres which can be in the future as a benchmark.
Therefore, three different datasets were compiled
covering different genres and subject domains.
Corpus sizes for the same topics written in Arabic
and other different languages are not the same. In fact,
the size of the corpus extracted from the French
newspaper “Le monde” from the period of 4 years, is Fig 4: Dictionary size (# of keywords) for each corpus in OSAC
80 million words [1, 2]. Moreover, the size of corpus
extracted from the period of almost 7 years of
Associated French Press (AFP) Arabic Newswire, and
released in 2001 by LDC is 76 million tokens [1, 2].
This gap between the two sizes is justified by the
compact form of the Arabic words. Formally speaking,
the English word “write” is equivalent to one Arabic
word “‫”مرة‬. But the group “He writes”, made up of
two words, and also corresponds to one Arabic word
“‫نرة‬ٝ“. And the Arabic equivalent of the sentence “He
will write” is the only one word “‫نرة‬ٞ‫”س‬. Moreover,
the word “ٔ‫نرث‬ٞ‫ ”س‬amounts to the group of words “He
will write it”. Another example is the Arabic word
(‫ )ٗتْف٘طٕا‬and its equivalence in English (4 words) “and Fig 5: Number of text documents for each corpus in OSAC
with her influences”. This makes segmentation of BBC Arabic corpus: We collected BBC Arabic
Arabic textual data different and more difficult than corpus from BBC Arabic website bbcarabic.com, the
Latin languages. This gives an explanation of the gap corpus includes 4,763 text documents. Each text
between the two corpuses size, if we make into document belongs to 1 of 7 categories (Middle East
consideration the difference of data extraction period News 2356, World News 1489, Business & Economy
[1, 2]. On the other hand, the required amount of 296, Sports 219, International Press 49, Science &
storage (disk or RAM) for Arabic corpus is twice of Technology 232, Art & Culture 122). The corpus
English corpus for the same number of characters for contains 1,860,786 (1.8M) words and 106,733 district
both corpora because Arabic characters require 2 bytes keywords after stopwords removal.
to be saved in Unicode format. This implies that
feature/keyword reduction for Arabic text is necessary CNN Arabic corpus: We collected CNN Arabic
to consider storage limit. corpus from CNN Arabic website cnnarabic.com, the
corpus includes 5,070 text documents. Each text
Corpora Building Steps involves compiling and document belongs to 1 of 6 categories (Business 836,
labeling text documents into corpus. We collect web Entertainments 474, Middle East News 1462, Science
documents from internet using the open source offline & Technology 526, Sports 762, World News 1010).
explorer, HTTrack. The process also includes The corpus contains 2,241,348 (2.2M) words and
converting corpus html/xml files into UTF-8 encoding 144,460 district keywords after stopwords removal.
using “Text Encoding Converter” by WebKeySoft. The
final step is to strip/remove html/xml tags as shown in OSAc corpus: We collected OSAC Arabic corpus
Figure 3. We developed a Java program that strip / from multiple websites as presented in Table 10, the
remove html/xml tags. The program is available corpus includes 22,429 text documents. Each text
publically at [10]. document belongs to 1 of 10 categories (Economics,

5
History, Entertainments, Education & Family, In the future works, we shall work on extending
Religious and Fatwas, Sports, Heath, Astronomy, and elaborating OSAC. Elaborations include
Low, Stories, Cooking Recipes). The corpus contains performing extensive corpus analysis and tag them
about 18,183,511 (18M) words and 449,600 district with Part of speech tags. We also open the door for
keywords after stopwords removal. other researchers and contributors to elaborate the
open source corpora.
All collected corpora were converted the corpus to
utf-8 encoding, html tags were removed. The corpora REFERENCES
are available publically at [10]. OSAC were used by
[1]. Abbas M., Smaili K., Berkani D.: Comparing TR-Classifier
Saad [13] to address the impact of text preprocessing and KNN by using Reduced Sizes of Vocabularies. The 3rd
on the Arabic text classification. Int. Conf. on Arabic Language Processing, CITALA2009,
Mohammadia School of Engineers, Rabat, Morroco. 2009.
Table 10: OSAC corpus
[2]. Abdelali, A., Cowie, J., Soliman, H.: Building a modern
# of text standard corpus, Workshop on Computational Modeling of
Category Sources Lexical Acquisition. The Split Meeting, Split, 2005
docs
[3]. Al-Ansary, S. Nagi, M., Adly N.: Building an International
bbcarabic.com - cnnarabic.com -
Corpus of Arabic (ICA): Progress of Compilation Stage.
Economic 3102 aljazeera.net - khaleej.com -
Bibliotheca Alexandrina. 2008.
banquecentrale.gov.sy
[4]. Al-Ansary, S., Nagi, M., Adly N.: Towards analyzing the
ً‫ز اىذنا‬ٝ‫ ذاع‬www.hukam.net - International Corpus of Arabic: Progress of Morphological
History 3233 moqatel.com - ‫ز‬ٝ‫ اىراع‬altareekh.com - Stage. Bibliotheca Alexandrina. 2008.
ً‫ز االسال‬ٝ‫ ذاع‬islamichistory.net [5]. Al-Sulaiti L, Atwell E.: Designing and developing a corpus
‫ض اىف٘ائض‬ٞ‫ ط‬saaid.net - ‫ّظائخ ىيسعاصج‬ of contemporary Arabic. Int. Journal of Corpus Linguistics.
Education
3608 ‫ح‬ٝ‫ االسغ‬naseh.net - ٜ‫اىَغت‬ pp.: 1 – 36. 2006.
and family
almurabbi.com [6]. Al-Sulaiti L, Atwell E.: Designing and developing a corpus
CCA corpus - EASC corpus of contemporary Arabic. Proc. of the 6th TALC conference,
Religious and moqatel.com - 2004.
3171
Fatwas ‫ح‬ٞ‫ اىشغع‬ٙٗ‫ شثنح اىفرا‬islamic-fatwa.com [7]. Arabic diacritics - Wikipedia, the free encyclopedia,
- ‫ض اىف٘ائض‬ٞ‫ ط‬saaid.net http://en.wikipedia.org/wiki/Arabic_diacritics
bbcarabic.com - cnnarabic.com - [8]. Arabic language - Wikipedia, the free encyclopedia,
Sport 2419
khaleej.com http://ar.wikipedia.org/wiki/‫ح‬ٞ‫ىغح_عغت‬
‫ح‬ّٞٗ‫اصج االىنرغ‬ٞ‫ اىع‬dr-ashraf.com - [9]. Khoja S., Garside R.: Stemming Arabic text. Computer
CCA corpus - EASC corpus - W Science Department, Lancaster University, Lancaster, UK,
Health 2296
corpus - ‫ طذح اىطفو‬kids.jo - 1999.
ٜ‫و اىعغت‬ٝ‫ اىعالج اىثض‬arabaltmed.com [10]. Motaz K. Saad: Open Source Arabic Language and Text
ٜ‫ اىفيل اىعغت‬arabastronomy.com - Mining Tools. 2010. http://sourceforge.net/projects/ar-text-
‫ اىنُ٘ ّد‬alkawn.net - mining
Astronomy 557 ‫ح‬ٞ‫ ت٘اتح اىفيل اىَغغت‬bawabatalfalak.com - [11]. Locke S., The push for Arabic content,
ٜ‫ اىفيل – ٍ٘س٘عح اىْاتيس‬nabulsi.com - http://www.meed.com/sectors/telecoms-and-it/telecoms/the-
www.alkoon.alnomrosi.net - push-for-arabic-content-online/3007704.article Issue No 28
ٜ‫ث‬ٞ‫ اىقاُّ٘ اىي‬lawoflibya.com 9-15 July 2010.
Low 944 [12]. Saad M. K., Ashour W., Arabic Text Classification Using
ً٘‫ قاُّ٘ م‬qnoun.com
CCA corpus - ‫ قظض االطفاه‬kids.jo - Decision Trees, Proceedings of the 12th international
Stories 726 workshop on computer science and information technologies
‫ض اىف٘ائض‬ٞ‫ ط‬saaid.net
CSIT’2010, Moscow – Saint-Petersburg, Russia, 2010.
Cooking
2373 aklaat.com - fatafeat.com [13]. Saad M. K., The Impact of Text Preprocessing and Term
Recipes
Weighting on Arabic Text Classification, MSc. Thesis
TOTAL 22,429 Dissertation, Computer Engineering Dept., Islamic
University of Gaza, Palestine, 2010.
[14]. Saad M, K., and Ashour W., Arabic Morphological Tools for
4. CONCLUSION Text Mining, 6th ArchEng Int. Symposiums, EEECS’10 the
6th Int. Symposium on Electrical and Electronics
Linguistic research has become heavily reliant on Engineering and Computer Science, European University of
text corpora over the past ten years. Due to the Lefke, Cyprus, 2010.
increasing need of an Arabic corpus to represent the [15]. Sebastiani, F.: Machine learning in automated text
Arabic language and because of the trials to build an categorization. ACM Computing Surveys, 34(1), 1–47.
Arabic corpus in the last few years were not enough to 2002.
consider that the Arabic language has a real,
representative and reliable corpus, it was necessary to
build OSAC to contribute supporting various linguistic
research on Arabic.
Arabic language has complex morphology. The
lack of well structured, high quality Arabic digital
contents and the lack of the free accessible Arabic
corpora were one of the major obstacles to Arabic
linguistics research field. This paper is a step towards
tackling these obstacles by collecting the largest free
accessible Arabic corpus, OSAC, which contains about
18M words and about 0.5M district keywords.

View publication stats

Introduction To Arabic NLP
100% (1)
Introduction To Arabic NLP
87 pages
Natural Language Processing PDF
No ratings yet
Natural Language Processing PDF
170 pages
Tutorial Arabic
100% (10)
Tutorial Arabic
120 pages
101 Rules in Arabic Grammar
No ratings yet
101 Rules in Arabic Grammar
69 pages
Arabic Grammar
No ratings yet
Arabic Grammar
80 pages
Arabic Grammar in English
No ratings yet
Arabic Grammar in English
68 pages
A Survey On Arabic Character Recognition
No ratings yet
A Survey On Arabic Character Recognition
27 pages
Arabization and Derivation in Quadrilateral Verbal Nouns
No ratings yet
Arabization and Derivation in Quadrilateral Verbal Nouns
139 pages
The International Corpus of Arabic Compi
No ratings yet
The International Corpus of Arabic Compi
10 pages
Acoustic Analysis of Iraqi Arabic Simple Vowels
No ratings yet
Acoustic Analysis of Iraqi Arabic Simple Vowels
16 pages
القواعد غير خاضعة لللسياق
No ratings yet
القواعد غير خاضعة لللسياق
7 pages
Python NLP for Arabic Challenges
No ratings yet
Python NLP for Arabic Challenges
11 pages
2 A General Outline of The Arabic Language
No ratings yet
2 A General Outline of The Arabic Language
27 pages
Standard Arabic Grammar Part1
No ratings yet
Standard Arabic Grammar Part1
242 pages
Arabic Script
No ratings yet
Arabic Script
5 pages
Easy Arabic Grammar
No ratings yet
Easy Arabic Grammar
156 pages
The MADAR Arabic Dialect Corpus and Lexicon
No ratings yet
The MADAR Arabic Dialect Corpus and Lexicon
10 pages
Arabic Grammar Book
No ratings yet
Arabic Grammar Book
80 pages
BVCChapter1 Smaller PDF
No ratings yet
BVCChapter1 Smaller PDF
18 pages
Arabic Learning Exe 6 Arabic Parts of Speech (Noun, Pronoun, Verb, Adjective Adverb, Preposition, Conjunction)
No ratings yet
Arabic Learning Exe 6 Arabic Parts of Speech (Noun, Pronoun, Verb, Adjective Adverb, Preposition, Conjunction)
43 pages
Information Structure in Spoken Arabic Jonathan Owens PDF Available
100% (13)
Information Structure in Spoken Arabic Jonathan Owens PDF Available
151 pages
Bel-Arabi Advanced Arabic Grammar
No ratings yet
Bel-Arabi Advanced Arabic Grammar
7 pages
Arabic Alphabet
No ratings yet
Arabic Alphabet
24 pages
Arabic Alphabet & Calligraphy
100% (12)
Arabic Alphabet & Calligraphy
61 pages
Arabic Language Resources in HIAST: Oumayma Al-Dakkak, Nada Ghneim, Afaf Alshalaby, Riad Sonbol, Mhd. Said Desouki
No ratings yet
Arabic Language Resources in HIAST: Oumayma Al-Dakkak, Nada Ghneim, Afaf Alshalaby, Riad Sonbol, Mhd. Said Desouki
5 pages
Arabic Language - Easy Arabic Grammar
100% (7)
Arabic Language - Easy Arabic Grammar
156 pages
Arabic Language 3rd - Sem
50% (2)
Arabic Language 3rd - Sem
13 pages
Let's Study Arabic
100% (1)
Let's Study Arabic
12 pages
02 Arabic An Essential Grammar PDF
100% (3)
02 Arabic An Essential Grammar PDF
366 pages
Arabic Political Discourse Book Chapter
100% (1)
Arabic Political Discourse Book Chapter
28 pages
Handbook Arabic Linguistics
No ratings yet
Handbook Arabic Linguistics
599 pages
Algerian Arabic Speech Database
No ratings yet
Algerian Arabic Speech Database
10 pages
A Leveled Reading Corpus of Modern Standard Arabic: Muhamed Al Khalil, Hind Saddiki, Nizar Habash, Latifa Alfalasi
No ratings yet
A Leveled Reading Corpus of Modern Standard Arabic: Muhamed Al Khalil, Hind Saddiki, Nizar Habash, Latifa Alfalasi
5 pages
Arabic Grammar
No ratings yet
Arabic Grammar
230 pages
Yow Arabic Lang
100% (1)
Yow Arabic Lang
412 pages
Teach Yourself Arabic PDF
No ratings yet
Teach Yourself Arabic PDF
412 pages
Arabic MS2 BKLT MP3 - 9781442376588
100% (1)
Arabic MS2 BKLT MP3 - 9781442376588
69 pages
Arabic For Nerds 2
No ratings yet
Arabic For Nerds 2
9 pages
Oxford Essential Arabic Dictionary
No ratings yet
Oxford Essential Arabic Dictionary
420 pages
A New Model For Learning Arabic Script M
No ratings yet
A New Model For Learning Arabic Script M
16 pages
Basic Arabic - Rev
No ratings yet
Basic Arabic - Rev
30 pages
Why Study Arabic
100% (1)
Why Study Arabic
44 pages
Tensesin Arabic Language
No ratings yet
Tensesin Arabic Language
8 pages
TY Arabic
100% (1)
TY Arabic
412 pages
OSMAN: Arabic Readability Metric
No ratings yet
OSMAN: Arabic Readability Metric
6 pages
Arabic Grammar For Dummies
No ratings yet
Arabic Grammar For Dummies
15 pages
Arabizi - A Contemporary Style of Arabic Slang
No ratings yet
Arabizi - A Contemporary Style of Arabic Slang
15 pages
Rash Wan 2011
No ratings yet
Rash Wan 2011
10 pages
Arabic
100% (2)
Arabic
855 pages
The Lexical Semantics of The Arabic Verb
100% (3)
The Lexical Semantics of The Arabic Verb
216 pages
Arabic
No ratings yet
Arabic
855 pages
Obp 0411 04
No ratings yet
Obp 0411 04
38 pages
Arabic Course Lesson 10
No ratings yet
Arabic Course Lesson 10
7 pages
About Arabic Language
No ratings yet
About Arabic Language
6 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
47 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
46 pages
POS Tagging with Hidden Markov Models
No ratings yet
POS Tagging with Hidden Markov Models
37 pages
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
No ratings yet
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
40 pages
CPIT 110 First Semester 2020 Schedule
No ratings yet
CPIT 110 First Semester 2020 Schedule
4 pages
Test bank حاسب
No ratings yet
Test bank حاسب
2 pages
Computer Skills - CPIT 100
No ratings yet
Computer Skills - CPIT 100
4 pages
Hotel Industry Basics & Language Skills
No ratings yet
Hotel Industry Basics & Language Skills
5 pages
Review and Final Listening and Speaking Tests I. Review
No ratings yet
Review and Final Listening and Speaking Tests I. Review
3 pages
English Grammar - Active Voice Passive Voice - All in One
No ratings yet
English Grammar - Active Voice Passive Voice - All in One
9 pages
SP3 Fall Final Exam - Review
No ratings yet
SP3 Fall Final Exam - Review
3 pages
Bahan Ajar TOEFL - Structure
No ratings yet
Bahan Ajar TOEFL - Structure
45 pages
7.telling The Time
No ratings yet
7.telling The Time
7 pages
English Guide 1
No ratings yet
English Guide 1
38 pages
Cognitive-Functional Syntax Models
No ratings yet
Cognitive-Functional Syntax Models
50 pages
CBSE Class 10 English Grammar - Modals - Learn CBSE
No ratings yet
CBSE Class 10 English Grammar - Modals - Learn CBSE
1 page
Assessment Test 1 Answers - CGP 11+ English Practice Book
No ratings yet
Assessment Test 1 Answers - CGP 11+ English Practice Book
1 page
Noun
100% (2)
Noun
18 pages
Business English Level1
100% (1)
Business English Level1
108 pages
Annals of UVAN 1981 1983
100% (1)
Annals of UVAN 1981 1983
386 pages
Adjectives Guided Notes Complete
No ratings yet
Adjectives Guided Notes Complete
2 pages
Morphology - HANDOUT - Term I - 42-1443
No ratings yet
Morphology - HANDOUT - Term I - 42-1443
49 pages
Class VI Grammar
No ratings yet
Class VI Grammar
3 pages
The History of The English Language Factfile Stage 4 Activities
50% (2)
The History of The English Language Factfile Stage 4 Activities
3 pages
Catenative Verbs
No ratings yet
Catenative Verbs
4 pages
Scared To Be Lonely
No ratings yet
Scared To Be Lonely
3 pages
Libro Ingles
No ratings yet
Libro Ingles
108 pages
5000 Common English Words PDF
No ratings yet
5000 Common English Words PDF
2 pages
Accusative of Proper Nouns
No ratings yet
Accusative of Proper Nouns
1 page
NLP Morphology for Linguists
No ratings yet
NLP Morphology for Linguists
8 pages
Intensive English Course: Upt - Pengembangan Bahasa
No ratings yet
Intensive English Course: Upt - Pengembangan Bahasa
45 pages
The First Oration of Cicero Against Cataline 1000742782
100% (1)
The First Oration of Cicero Against Cataline 1000742782
296 pages
English Activities Guide
No ratings yet
English Activities Guide
4 pages
Phrases, Clauses, and Sentences Guide
No ratings yet
Phrases, Clauses, and Sentences Guide
8 pages
Exploring Kennings for Students
100% (2)
Exploring Kennings for Students
2 pages
Quantifiers in Arabic
No ratings yet
Quantifiers in Arabic
4 pages
Determiners: A Comprehensive Guide
No ratings yet
Determiners: A Comprehensive Guide
10 pages

Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9

Uploaded by

Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

OSAC: Open Source Arabic Corpora

Conference Paper · November 2010

Motaz Saad Wesam Ashour

SEE PROFILE SEE PROFILE

Content-Based Image Retrieval View project

Comparability and Social Networks View project

The user has requested enhancement of the downloaded file.

OSAC: Open Source Arabic Corpora

Motaz K. Saad Wesam Ashour

The word form according to its case: The form

View publication stats

You might also like