Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9
Mksaad OSAC OpenSourceArabicCorpora EECS10 Rev9
net/publication/234136250
CITATIONS READS
84 2,559
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Motaz Saad on 27 May 2014.
Abstract—Arabic Linguistics is promising research field. normal Arabic text does not provide enough
The acute lack of free public accessible Arabic corpora information about the correct pronunciation, the main
is one of the major difficulties that Arabic linguistics purpose of tashkil (and ḥarakat) is to provide a
researches face. The effort of this paper is a step phonetic guide or a phonetic aid; i.e. show the correct
towards supporting Arabic linguistics research field. pronunciation (double the word in pronunciation or to
This paper presents the complex nature of Arabic act as short vowels). The ḥarakat, which literally
language, pose the problems of: (1) lacking free public means "motions", are the short vowel marks [7].
Arabic corpora, (2) the lack of high-quality, well- Arabic diacritics include Fatha, Kasra, Damma,
structured Arabic digital contents. The paper finally
Sukūn, Shadda, and Tanwin. The pronunciations of
presents OSAC, the largest free accessible that we
collected.
diacritics aforementioned are presented in Table 1.
Arabic words may also have Tatweel or kasheeda as
Keywords: Arabic Language, Arabic corpora, Arabic shown in figure 1.
Digital contents. Arabic words have two genders, masculine ()ٍظمغ
and feminine ( ;)ٍؤّسthree numbers, singular ()ٍفغص,
1. INTRODUCTION
dual (ْٚ)ٍث, and plural ( ;)جَعand three grammatical
Arabic Language is the 5th widely used languages cases, nominative ()اىغفع, accusative ()اىْظة, and
in the world. It is spoken by more than 422 million genitive ()اىجغ. A noun has the nominative case when it
people as a first language and by 250 million as a is subject ( ;)فاعوaccusative when it is the object of a
second language [8]. Arabic has 3 forms; Classical verb ( ;)ٍفع٘هand the genitive when it is the object of a
Arabic (CA), Modern Standard Arabic (MSA), and preposition ()ٍجغٗع تذغف جغ. Words are classified into
Dialectal Arabic (DA). CA includes classical historical three main parts of speech, nouns (( )اسَاءincluding
liturgical text, MSA includes news media and formal adjectives ( )طفاخand adverbs ())ظغٗف, verbs ()افعاه,
speech, and DA includes predominantly spoken and particles ()اصٗاخ.
vernaculars and has no written standards. Arabic
alphabet consists of the following 28 letters ( أ ب خ ز ج Despite Arabic language is widespread, there is
ٛ ٗ ٓ ُ ً )ح ر ص ط ع ػ س ش ص ع ط ظ ع غ ف ق ك هin acute luck of well-structured and high quality Arabic
addition, the Hamza ()ء. There is no upper or lower digital contents. There are also a lack of free and
case for Arabic letters like English letters. The letters Public Arabic corpora. This paper presents OSAC,
(ٛ ٗ )أare vowels, and the rest are constants. Unlike open source Arabic corpora that cover different text
Latin-based alphabets, the orientation of writing in genres which can be used in the future as a
Arabic is from right to left. benchmark.
Table 1: Diacritics The rest of this paper is organized as follows:
section 2 presents the complexity of Arabic language,
Double No
Constant Vowel
Nunation Vowel section 3 talks about the problem of lacking of Arabic
ّب ْب ْب ْب ْب ْب ْب ْب corpora and Arabic digital contents, describes corpora
/bb/ /b/ /bin/ /bun/ /ban/ /bi/ /bu/ /ba/ building steps, and presents the collected corpora, and
finally section 4 draw the conclusion.
ُدق٘ق االّسا
ُدقــــــــــ٘ق االّســـــــــا 2. COMPLEXITY OF ARABIC LANGUAGE
ُدقـــــــــــــــــــــــــــ٘ق االّســـــــــــــــــــــــــــا
Arabic is a challenging language for a number of
ُدقـــــــــــــــــــــــــــــــــــــــــــــ٘ق االّســــــــــــــــــــــــــــــــــــا
Fig. 1: Tatweel (kasheeda) reasons:
The Arabic script has numerous diacritics, Orthographic ( )االٍالءwith diacritics is less
including I’jam (ً)إعجا, consonant pointing, and tashkil ambiguous and more phonetic in Arabic, certain
(وٞ)ذشن, supplementary diacritics. The latter include the combinations of characters can be written in
ḥarakat (دغماخ, singular haraka )دغمح, vowel marks. different ways [7].
The literal meaning of tashkil is "forming". As the
1
Arabic language has short vowels which give The stem consists of a consonantal root (خٞ)جظع طذ
different pronunciation. Grammatically they are and a pattern morpheme (ْٚ)اطغغ ميَح طاخ ٍع. The
required but omitted in written Arabic texts [7]. affixes include inflectional markers ( عالٍاخ اٗ دغماخ
حٞ )اعغاتfor tense, gender, and/or numbers. The clitics
Arabic has a very complex morphology as include some prepositions ()دغٗف جغ, conjunctions
compare to English language [1, 9, 12, 13]. ()دغٗف اىعطف, determiners ()ٍذضصاخ, possessive
Synonyms are widespread. Arabic is a highly pronouns (حٞ )ضَائغ اىَينand pronouns ()ضَائغ. The
inflectional and derivational language [1, 8, 12, clitics attached to the beginning of a stem are called
13]. proclitic and the ones attached to the end of it are
called enclitics. Most Arabic morphemes are defined
Lack of publically freely accessible Arabic by three consonants, to which various affixes can be
Corpora [3, 4, 5, 6, 13]. attached to create a word. For example, from the tri-
Lack of Arabic digital contents [11, 13]. consonant "ktb" ()مرة, we can inflect (ظغفٝ) several
different words concerning the idea of writing as
In the following, we shall discuss these points in (wrote )مَرة, (book ) ِمراب, (the book )اى ِنراب, (books ) ُمرُة,
details. (he writes ْنرُةٝ), (writer )ماذِة, (library )ٍ ْنرَثَح. Moreover
Word meanings: It is possible to identify the an Arabic word may correspond to several English
different meanings associated with a word, due to one words. Another example is the Arabic word ()ٗتْف٘طٕا
word may have more than one meaning in different and its equivalence in English “and with her
contexts.. Table 2 shows the Arabic word ( )قيةwhich influences”. This makes segmentation of Arabic
has 3 meaning as a noun. textual data different and more difficult than Latin
languages.
Table 2: The meaning of word ( )قيةas a noun
Affixes set in Arabic are shown in Table 4, and
Word meaning Sentence Arabic patterns (ُ )األٗػاand roots are shown in Table
core قلة االدضازٜف 5. The word (ٌ )عيmay give various meanings by
heart ح قلة ٍفر٘حٞ عَيٙاجغ
adding different affixes (prefixes, infixes, or suffixes)
center, middle قلة اىَيعةٜاىنغج ف
as shown in Table 6. Other morphological variations
example is the word (ظٕةٝ) which means (go) are
Variations in lexical category: One word may presented in Table 7.
have more than lexical category (noun, verb, adjective, Table 4: Affix set in Arabic Language
etc.) in different contexts as shown in Table 3.
Morphological analysis of a given corpus includes Affixes in Arabic Examples
investigating word frequency of a word as a lexical Prefixes of length 3 تاه، ماه، ٗاه، ٗىو
category. Length 2 prefixes ىو، اه
Length 1 prefixes ا،ُ،خ،ٙ،ٗ،س،ف،ب،ه
Table 3: The Lexical Category of word (ِٞ)ع Length 3 suffixes مَو، ِٞ ذ، ُ ذا، َٕو، ذَو
Length 2 suffixes ّا، ِٕ ، ٌ م، ِ ذ، ِٝ ، ُ ا، اخ، ُٗ
Word meaning Word Category Sentence
ٌٕ ، ٍا، ٗا، ّٜ ، ِ م، ٌ ذ، ٕا، اٝ ،
Ain Proper-Noun عين جاى٘خ Length 1 suffixes ُ،ا،خ،ك،ٛ،ٓ،ج
wellspring Noun عين اىَاء Table 5: Arabic Patterns and Roots
eye Noun ُعين االّسا
delimitate/be delimitate Verb/passive Verb حٞغا ىيشاعجٝعين ٗػ Arabic Pattern and roots
Examples
)(األوزان
Length 4 pattern فاعو فعيح فعاه ٍفعو
Synonyms: Languages have many words that are ذفاعو افرعو افعاه فعاىح فعالُ فع٘ىح ذفعيح
considered synonymous. Through a given corpus, the Length 5 pattern and length 3 و ٍفعيح ٍفع٘ه فاع٘ه ف٘اعو ٍفاعوٞذفع
researchers can use morphological analysis tools to roots و افعيح فعائو ٍْفعو ٍفرعو فاعيحٍٞفع
know synonyms of a word, the frequency of each اّفعوٜفرعو ذفرعو فعالىٝ ٍفاعو فَالع
Length 5 pattern and length 4
word of those synonyms and which one of them is ذفعيو افعيو ٍفعيو فعييح فعالُ فعاىو
roots
more common. Examples of synonyms in Arabic are Length 6 pattern and length 3 اسرفعو ٍفاعيح افرعاه افع٘عو اّفعاه
( ٕٗةٚ )تظه ٍْخ اعطwhich means (give), ()اسغج عائيح roots ٍسرفعو
which means (family), and ( )فظو طفwhich means Length 6 pattern and length 4
افرعيو افعاله ٍرفعيو
(classroom). roots
2
affixes and does not convert the word to bas/root form. Encoding Problem: Arabic Language has display
This approach is called light stemming [13, 14]. Problems (encoding issues) because it has different
Table 6: Versions of the word (ٌ )عيand its meaning when adding
encoding according to machine platform. Figure 2
affixes shows the problem of using incorrect encoding where
all circled cells are displayed correctly while the other
Meaning Suffix Infix Prefix Word cells are displayed incorrectly. Text preprocessing,
Scientific حٝ *** *** حَٞعي mining, and information retrieval with incorrect
Learned us ذْا *** *** عيَرْا encoding may lead to incorrect results. Table 9
His science ٓ *** *** َٔعي presents the characteristics of two common Arabic
Scientists اء *** *** عيَاء
encoding systems; Unicode and code page 1256 CP-
Teaching *** ٛ خ ٌٞذعي
1256 Arabic windows.
Sciences *** ٗ *** ً٘عي
Informative ٔٝ ا اسد حٍٞاسرعال 3. ARABIC CORPORA
Table 7: Morphological variation of word ()طٕة
Corpus-based approaches to language have
verb time # of participants Gender of subjects introduced new dimensions to linguistic description
رهة Past 1 Male and various applications by permitting some degree of
رهثت Past 1 Female automatic analysis of text. The identification, counting
رهثا Past 2 Male
and sorting of words, collocations and grammatical
رهثتا Past 3 Female
رهثوا Past 3 or more Male
structures which occur in a corpus can be carried out
رهثن Past 3 or more Female
quickly and accurately by computer, thus greatly
يزهة Present 1 Male reducing some of the human drudgery sometimes
تزهة Present 1 Female associated with linguistic description and vastly
سيزهة Future 1 Male expanding the empirical basis [3, 4]. Linguistic
ستزهة Future 1 Female research has become heavily reliant on text corpora
سيزهثوا Future 3 or more Male over the past ten years. Text data mining is a
سيزهثن Future 3 or more Female multidisciplinary field involving information retrieval,
Table 8: Different meaning of morphology of the same root in text analysis, information extraction, clustering,
Arabic categorization and linguistics. Text mining is
Meaning Root Word becoming of more significance, and efforts have been
Class room ٜالفصل اىضعاس multiplied in studies to provide for fetching the
فظو increasingly available information efficiently [3, 4].
Apartheid ٛالفصل اىعْظغ
Goes out of house دٞيخرج ٍِ اىثْ
سغج ّ Due to the increasing need of an Arabic corpus to
Graduate from university تخرج ٍِ اىجاٍعح
The fisherman twist the cord اص اىذثوٞجذل اىظ represent the Arabic language and because of the trials
جضه to build an Arabic corpus in the last few years were
The student argued with the teacher جادل اىطاىة اىَضعس
He focuses the arrow ٌٖأّ يصوب اىس not enough to consider that the Arabic language has a
ط٘ب
The man lost his mind فقض اىغجو صواته real, representative and reliable corpus, it was
necessary to build such an Arabic corpus to support
various linguistic research on Arabic [3, 4]. Thus, one
of the difficulties that encountered Arabic Language
researches is the lack of publicly available Arabic
corpus [3, 4, 5, 6]. Arabic corpus problem was posed
by [3, 4, 5, 6]. A survey by [3, 4] confirms that
existing corpora are too narrowly limited in source-
type and genre, and that there is a need for a freely-
accessible Corpus of Contemporary Arabic (CCA)
covering a broad range of text-types. Due to the
Arabic language lacking of corpora, it is difficult to
display textual content and quantitative data of Arabic.
Fig. 2: Arabic Encoding Problem Al-Nasray et. al. [3, 4] discussed three axes in their
Table 9: Unicode vs. cp-1256 Arabic windows encoding paper; the 1st axes is a survey of the importance of
corpora in language studies e.g. lexicography,
Unicode CP-1256 Arabic windows grammar, semantics, Natural Language Processing and
Becoming the standard more
Commonly used
other areas. The 2nd axis demonstrates how the Arabic
and more language lacks textual resources, such as corpora and
2-byte characters 1-byte characters tools for corpus analysis and the effected of this lack
Widely supported on the quality of Arabic language applications. There
Widely supported input/display
input/display
are rarely successful trials in compiling Arabic
Supports extended Arabic Minimal support for extended
characters Arabic characters corpora, therefore, the 3rd axis presents the technical
Multi-script representation bi-script support (Roman/Arabic) design of the International Corpus of Arabic (ICA), a
Supports presentation forms Tri-lingual support: Arabic, newly established representative corpus of Arabic that
(shapes and ligatures) French, English (ala ANSI) is intended to cover the Arabic language as being used
all over the Arab world. The corpus is planned to
3
support various Arabic studies that depends on improve the Arabic content on Wikipedia by
authentic (يحٞ )اطdata, in addition to building Arabic promoting translation of high quality articles i
Natural Language Processing Applications. different subject areas, including Nanotechnology,
Biotechnology and Public Health. It aims to translate
International Corpus of Arabic (ICA) is a big 2,000 articles within these areas in its first phase.
project initiated by Bibliotheca Alexandrina (BA). BA
is one of the international Egyptian organizations that Major web players are looking to boost Arabic-
play a noticeable role in disseminating culture and language content online in a bid to meet demand from
knowledge, and in supporting scientific research. ICA a rapidly growing Arab audience [11]. The Arab world
is a real trial to build a representative Arabic corpus as has been facing a digital conundrum for the past few
being used all over the Arab world to support research years – not enough users online creating content in
on Arabic [3, 4]. ICA corpus has been analyzed by Al- Arabic; not enough content in Arabic to push internet
Nasry et. al. in [4], they shed light on the levels of penetration [11]. Although there are more than 422
corpus analysis e.g. morphological analysis, lexical million Arabic speakers worldwide and Arabic is the
analysis, syntactic analysis and semantic analysis. Al- seventh-most popular language on the web, less than
Nasry also demonstrates different available tools for one per cent of all online content is in Arabic and there
Arabic morphological analysis (Xerox, Tim is just a 17.5 per cent internet penetration across the
Buckwalter, Sakhr and RDI). The morphological region’s population.
analysis of ICA includes: selecting and describing the
model of analysis, pre-analysis stage and full text Google has been working on several initiatives to
analysis stages. ICA is not publically available now help increase Arabic-language content. It tied up with
Wikipedia after observing the Arabic portal of the
and it expected to be released soon [3, 4].
online encyclopedia carried 120,000 pages compared
with the 2 million pages of its Catalan equivalent. This
3.1 Arabic Digital Content is despite the disproportionate number of potential
Yet Arabic is the fastest-growing language on the Arabic-speaking users, 422 million, compared with 6
internet, with Arabic-speaking internet users million Catalan speakers [11]. About 10 million words
increasing 2,298 per cent from 2000-2009, according have now been translated into Arabic from English on
to the Internet World Statistics Report the site and 6 million from Arabic to English [11].
internetworldstats.com. The number of internet users
in the Middle East and North Africa (Mena) region has The search giant has also been educating small
leapt from 3.2 million users in 2000 to 60.25 million in businesses to build their own websites using Google
2009 and it is estimated that at least another 55 million Sites – or to at least put their business directory
new users will come online in the next five years. If information on Google Maps. It has built Ejabat, a
mobile internet users are included, that figure soars user-generated question and answer system, which
even further to 150 million. now has 600,000 questions and 2 million answers
from 300,000 registered users [11]. With 20-25 per
The content problem is of both quantity and cent of Mena users in the past year being completely
quality. There is a lack of high-quality, well-structured new to the web and a third of them under the age of
websites managed by companies creating digital 18, Google launched educational video site Ahlan
content for Arabic-speaking users. For example, if you (google.com/intl/ar/ahlanonline) to introduce users to
search in English for a specific mobile phone model, the world of online learning. Within three months
you will land on a specialized portal with there were 1.2 million views of the Ahlan training
specifications, reviews and photos. In Arabic, you will videos.
probably end up in a forum where a question is being
asked about that phone. It is unlikely in Arabic US giant internet portal Yahoo, meanwhile, took a
searches that the first page of results would not have a big leap into the Arabic content arena in 2009 when it
forum. There is a regional need for real local content acquired Maktoob, the region’s largest community
and generally users in the region prefer Arabic today. site. Maktoob is currently the 157th biggest site on the
internet, according to web information company
However, while Arabic content may have had a Alexa’s listings (alexa.com). This makes it the 2nd
growth spurt in the past year, the content that has most popular Arabic site behind Google Saudi Arabia
grown is still primarily user-generated and often at number 104 and way ahead of the third Arabic site
machine translated. There is still a lack of original, in the world rankings, sports site Koora. Maktoob was
localized, high-quality content. founded in 2000 as the world’s 1st free Arabic/English
email service, but discussion forums quickly became
3.2 Creating Arabic contenet online its biggest traffic and content driver, with the women’s
There are many Arabic digital content enrichment forum one of the largest. Other popular areas include
initiatives. United Nations Economic and Social games, matrimonial, blogs and sports.
Commission for Western Asia - ESCWA released a
project in 2007 to develop the industry of Arabic 3.3 Bulding Arabic Corpora
digital content. Wiki Arabi is a project initiated by Different corpora are available in English. Reuter’s
King Abdulaziz City for Science and Technology collections of news stories are popular and typical
(KACST) within the framework of King Abdullah's example. The Linguistic Data Consortium (LDC)
Initiative for Arabic Content. The project aims to provides two non-free Arabic corpora, the Arabic
4
NEWSWIRE and Arabic Gigaword corpus. Both
corpora contain newswire stories.
There is a need for a freely-accessible corpus of
Arabic. There are no standard or benchmark corpora.
Thus, all researchers conduct their researches on their
own compiled corpus. Arabic language is highly
inflectional and derivational language which makes
text mining / Information Retrieval a complex task. In
Arabic text mining research field, there are some
published experimental results, but these results came
from different datasets, it is hard to compare classifiers
because each research used different datasets for
training and testing [15]. Sebastiani stated at [15] "We
have to bear in mind that comparisons are reliable only Fig. 3: Corpora building steps
when based on experiments performed by the same
author under carefully controlled conditions".
One of the aims of this paper is to compile
representative Arabic corpora that cover different text
genres which can be in the future as a benchmark.
Therefore, three different datasets were compiled
covering different genres and subject domains.
Corpus sizes for the same topics written in Arabic
and other different languages are not the same. In fact,
the size of the corpus extracted from the French
newspaper “Le monde” from the period of 4 years, is Fig 4: Dictionary size (# of keywords) for each corpus in OSAC
80 million words [1, 2]. Moreover, the size of corpus
extracted from the period of almost 7 years of
Associated French Press (AFP) Arabic Newswire, and
released in 2001 by LDC is 76 million tokens [1, 2].
This gap between the two sizes is justified by the
compact form of the Arabic words. Formally speaking,
the English word “write” is equivalent to one Arabic
word “”مرة. But the group “He writes”, made up of
two words, and also corresponds to one Arabic word
“نرةٝ“. And the Arabic equivalent of the sentence “He
will write” is the only one word “نرةٞ”س. Moreover,
the word “ٔنرثٞ ”سamounts to the group of words “He
will write it”. Another example is the Arabic word
( )ٗتْف٘طٕاand its equivalence in English (4 words) “and Fig 5: Number of text documents for each corpus in OSAC
with her influences”. This makes segmentation of BBC Arabic corpus: We collected BBC Arabic
Arabic textual data different and more difficult than corpus from BBC Arabic website bbcarabic.com, the
Latin languages. This gives an explanation of the gap corpus includes 4,763 text documents. Each text
between the two corpuses size, if we make into document belongs to 1 of 7 categories (Middle East
consideration the difference of data extraction period News 2356, World News 1489, Business & Economy
[1, 2]. On the other hand, the required amount of 296, Sports 219, International Press 49, Science &
storage (disk or RAM) for Arabic corpus is twice of Technology 232, Art & Culture 122). The corpus
English corpus for the same number of characters for contains 1,860,786 (1.8M) words and 106,733 district
both corpora because Arabic characters require 2 bytes keywords after stopwords removal.
to be saved in Unicode format. This implies that
feature/keyword reduction for Arabic text is necessary CNN Arabic corpus: We collected CNN Arabic
to consider storage limit. corpus from CNN Arabic website cnnarabic.com, the
corpus includes 5,070 text documents. Each text
Corpora Building Steps involves compiling and document belongs to 1 of 6 categories (Business 836,
labeling text documents into corpus. We collect web Entertainments 474, Middle East News 1462, Science
documents from internet using the open source offline & Technology 526, Sports 762, World News 1010).
explorer, HTTrack. The process also includes The corpus contains 2,241,348 (2.2M) words and
converting corpus html/xml files into UTF-8 encoding 144,460 district keywords after stopwords removal.
using “Text Encoding Converter” by WebKeySoft. The
final step is to strip/remove html/xml tags as shown in OSAc corpus: We collected OSAC Arabic corpus
Figure 3. We developed a Java program that strip / from multiple websites as presented in Table 10, the
remove html/xml tags. The program is available corpus includes 22,429 text documents. Each text
publically at [10]. document belongs to 1 of 10 categories (Economics,
5
History, Entertainments, Education & Family, In the future works, we shall work on extending
Religious and Fatwas, Sports, Heath, Astronomy, and elaborating OSAC. Elaborations include
Low, Stories, Cooking Recipes). The corpus contains performing extensive corpus analysis and tag them
about 18,183,511 (18M) words and 449,600 district with Part of speech tags. We also open the door for
keywords after stopwords removal. other researchers and contributors to elaborate the
open source corpora.
All collected corpora were converted the corpus to
utf-8 encoding, html tags were removed. The corpora REFERENCES
are available publically at [10]. OSAC were used by
[1]. Abbas M., Smaili K., Berkani D.: Comparing TR-Classifier
Saad [13] to address the impact of text preprocessing and KNN by using Reduced Sizes of Vocabularies. The 3rd
on the Arabic text classification. Int. Conf. on Arabic Language Processing, CITALA2009,
Mohammadia School of Engineers, Rabat, Morroco. 2009.
Table 10: OSAC corpus
[2]. Abdelali, A., Cowie, J., Soliman, H.: Building a modern
# of text standard corpus, Workshop on Computational Modeling of
Category Sources Lexical Acquisition. The Split Meeting, Split, 2005
docs
[3]. Al-Ansary, S. Nagi, M., Adly N.: Building an International
bbcarabic.com - cnnarabic.com -
Corpus of Arabic (ICA): Progress of Compilation Stage.
Economic 3102 aljazeera.net - khaleej.com -
Bibliotheca Alexandrina. 2008.
banquecentrale.gov.sy
[4]. Al-Ansary, S., Nagi, M., Adly N.: Towards analyzing the
ًز اىذناٝ ذاعwww.hukam.net - International Corpus of Arabic: Progress of Morphological
History 3233 moqatel.com - زٝ اىراعaltareekh.com - Stage. Bibliotheca Alexandrina. 2008.
ًز االسالٝ ذاعislamichistory.net [5]. Al-Sulaiti L, Atwell E.: Designing and developing a corpus
ض اىف٘ائضٞ طsaaid.net - ّظائخ ىيسعاصج of contemporary Arabic. Int. Journal of Corpus Linguistics.
Education
3608 حٝ االسغnaseh.net - ٜاىَغت pp.: 1 – 36. 2006.
and family
almurabbi.com [6]. Al-Sulaiti L, Atwell E.: Designing and developing a corpus
CCA corpus - EASC corpus of contemporary Arabic. Proc. of the 6th TALC conference,
Religious and moqatel.com - 2004.
3171
Fatwas حٞ اىشغعٙٗ شثنح اىفراislamic-fatwa.com [7]. Arabic diacritics - Wikipedia, the free encyclopedia,
- ض اىف٘ائضٞ طsaaid.net http://en.wikipedia.org/wiki/Arabic_diacritics
bbcarabic.com - cnnarabic.com - [8]. Arabic language - Wikipedia, the free encyclopedia,
Sport 2419
khaleej.com http://ar.wikipedia.org/wiki/حٞىغح_عغت
حّٞٗاصج االىنرغٞ اىعdr-ashraf.com - [9]. Khoja S., Garside R.: Stemming Arabic text. Computer
CCA corpus - EASC corpus - W Science Department, Lancaster University, Lancaster, UK,
Health 2296
corpus - طذح اىطفوkids.jo - 1999.
ٜو اىعغتٝ اىعالج اىثضarabaltmed.com [10]. Motaz K. Saad: Open Source Arabic Language and Text
ٜ اىفيل اىعغتarabastronomy.com - Mining Tools. 2010. http://sourceforge.net/projects/ar-text-
اىنُ٘ ّدalkawn.net - mining
Astronomy 557 حٞ ت٘اتح اىفيل اىَغغتbawabatalfalak.com - [11]. Locke S., The push for Arabic content,
ٜ اىفيل – ٍ٘س٘عح اىْاتيسnabulsi.com - http://www.meed.com/sectors/telecoms-and-it/telecoms/the-
www.alkoon.alnomrosi.net - push-for-arabic-content-online/3007704.article Issue No 28
ٜثٞ اىقاُّ٘ اىيlawoflibya.com 9-15 July 2010.
Low 944 [12]. Saad M. K., Ashour W., Arabic Text Classification Using
ً٘ قاُّ٘ مqnoun.com
CCA corpus - قظض االطفاهkids.jo - Decision Trees, Proceedings of the 12th international
Stories 726 workshop on computer science and information technologies
ض اىف٘ائضٞ طsaaid.net
CSIT’2010, Moscow – Saint-Petersburg, Russia, 2010.
Cooking
2373 aklaat.com - fatafeat.com [13]. Saad M. K., The Impact of Text Preprocessing and Term
Recipes
Weighting on Arabic Text Classification, MSc. Thesis
TOTAL 22,429 Dissertation, Computer Engineering Dept., Islamic
University of Gaza, Palestine, 2010.
[14]. Saad M, K., and Ashour W., Arabic Morphological Tools for
4. CONCLUSION Text Mining, 6th ArchEng Int. Symposiums, EEECS’10 the
6th Int. Symposium on Electrical and Electronics
Linguistic research has become heavily reliant on Engineering and Computer Science, European University of
text corpora over the past ten years. Due to the Lefke, Cyprus, 2010.
increasing need of an Arabic corpus to represent the [15]. Sebastiani, F.: Machine learning in automated text
Arabic language and because of the trials to build an categorization. ACM Computing Surveys, 34(1), 1–47.
Arabic corpus in the last few years were not enough to 2002.
consider that the Arabic language has a real,
representative and reliable corpus, it was necessary to
build OSAC to contribute supporting various linguistic
research on Arabic.
Arabic language has complex morphology. The
lack of well structured, high quality Arabic digital
contents and the lack of the free accessible Arabic
corpora were one of the major obstacles to Arabic
linguistics research field. This paper is a step towards
tackling these obstacles by collecting the largest free
accessible Arabic corpus, OSAC, which contains about
18M words and about 0.5M district keywords.