This paper compares Arabic and English speech rhythms to increase awareness of this neglected and... more This paper compares Arabic and English speech rhythms to increase awareness of this neglected and often misunderstood topic in foreign language acquisition. Unlike previous studies, we adopt a phonological view of speech rhythm rather than an isochrony-based phonetic view. We detail the components of speech rhythm at the word and utterance levels in Arabic and English focusing on the rhythmical differences that would affect the learners’ rhythm of both languages negatively. Findings suggest that Modern Standard Arabic (MSA) and Jordanian-Ammani Arabic (JAA), unlike English, should be placed at the lower end of the rhythmic continuum. The study opens new directions for future research and concludes with pedagogical implications for learners of Arabic and English.
2021 International Conference on Asian Language Processing (IALP)
The distributed representation of words, as in Word2Vec, FastText, and GloVe, results in the prod... more The distributed representation of words, as in Word2Vec, FastText, and GloVe, results in the production of a single vector for each word type regardless of the polysemy or homonymy that many words may have. Context-sensitive representation as implemented in deep learning neural networks, on the other hand, produces different vectors for the multiple senses of a word. Several contextualized word embeddings have been produced for the Arabic language (e.g., AraBERT, QARiB, AraGPT, etc.). The majority of these were tested on a few NLP tasks but there was no direct comparison between them. As a result, we do not know which of these is most efficient and for which tasks. This paper is a first step in an endeavor to establish evaluation criteria for them. It describes 24 such embeddings, then conducts exploratory intrinsic and extrinsic evaluation of them. Afterwards, it tests relational knowledge in them, covering four semantic relations: colors of fruits, capitals of countries, causation, and general information. It also evaluates the utility of these models in Named Entity Recognition and Sentiment Analysis tasks. It has been demonstrated here that AraBERT-v02 and MARBERT are the best on both types of evaluation; therefore, both are recommended for fine-tuning Arabic NLP tasks. The ultimate conclusion is that it is feasible to test higher order reasoning relations in these embeddings.
Using a mixed-method approach, this study examines the pragmatic functions of the discourse marke... more Using a mixed-method approach, this study examines the pragmatic functions of the discourse marker walak and its variants in Spoken Jordanian Arabic. It also explores the differences in the use of this discourse marker according to the speakers’ gender. The data was collected from a sample of 200 native speakers of Jordanian Arabic, using informal interviews and a validation questionnaire. The results showed that walak and its variants perform six language functions: warning, insulting, addressing/vocative, endearment, threatening, and denial. As far as gender differences are concerned, the findings indicated that there were statistically significant differences between males and females in the use of walak and its variants in favour of males. This indicates that males agreed more with the sentences expressing each pragmatic function in the validation questionnaire. The study concludes with some pedagogical implications for learners of Arabic as a second language, teachers and syll...
The ultimate purpose of teaching a foreign language is to enable learners to understand others an... more The ultimate purpose of teaching a foreign language is to enable learners to understand others and to make themselves understood. Foreign language teaching is primarily about enabling learners to be communicatively competent at both receptive and productive levels. To accomplish this, teachers should seek to teach all language components in natural settings. More often than not, foreign language teaching not only takes place in unnatural contexts, but it has also become compartmentalized, with each language skill taught separately. This article supports an integrative approach to language teaching where language arts can be taught while teaching literary works, including short stories, novels, poetry, and drama. This means that appreciating literature and developing language skills should go hand in hand. Therefore, this article advocates a content-based approach to language teaching where learners receive more attention, and literature is the content around which language activitie...
Modelling the distributional semantics of such a morphologically rich language as Arabic needs to... more Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this ben...
Several tools are developed to facilitate the quantitative analysis of interpretation style, a ma... more Several tools are developed to facilitate the quantitative analysis of interpretation style, a matter that has hitherto been discussed only in vague terms. These tools can allow the investigation of questions such as: How does an interpreter divide up a source language input, to what extent does he mirror a source language speaker, and to what degree does he practise reformulation? Furthermore, an adaptive monitoring instrument is devised to facilitate the graphic representation of the linear developments of a source language discourse and its simultaneous interpretation equivalent. Not only does it allow the assessment of convergence and divergence between the two discourses, but this also permits commenting on an interpreter's tempo by characterising the narrow and broad periodicity within his discourse, and on his composure and tribulation by describing his consistency and fluency in the discourse.
It is becoming increasingly difficult to know who is working on what and how in computational stu... more It is becoming increasingly difficult to know who is working on what and how in computational studies of Dialectal Arabic. This study comes to chart the field by conducting a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends. It is a review that is guided by the norms of systematic reviews. It has taken account of all the research that adopted a computational approach to dialectal Arabic identification and detection and that was published between 2000 and 2020. It collected, analyzed, and collated this research, discovered its trends, and identified research gaps. It revealed, inter alia, that our research effort has not been directed evenly between speech and text or between the vernaculars; there is some bias favoring text over speech, regional varieties over in...
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus... more This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus is used by graduate students at the University of Jordan (UJ) to conduct experiments on many useful applications. Earlier, we were not able to verify these experiments because of the lack of reliable data. We are working on annotating and tagging the corpus texts and making it available for researchers in XML format.
This study reports on the construction of a one million word English-Arabic Political Parallel Co... more This study reports on the construction of a one million word English-Arabic Political Parallel Corpus (EAPPC), which will be a useful resource for research in translation studies, language learning and teaching, bilingual lexicography, contrastive studies, political science studies and cross-language information retrieval. It describes the phases of corpus compilation and explores the corpus, by way of illustration, to discover the translation strategies used in rendering the Arabic and Islamic culture-specific terms takfīr and takfīrī from Arabic into English and from English into Arabic. The Corpus consists of 351 English and Arabic original documents and their translations. A total of 189 speeches, 80 interviews and 68 letters, translated by anonymous translators in the Royal Hashemite Court, were selected and culled from King Abdullah II's official website, in addition to the textual material of the English and Arabic versions of His Majesty's book, Our Last Best Chance:...
In this paper, we present a set of corpus linguistic tools for conducting historical semantic res... more In this paper, we present a set of corpus linguistic tools for conducting historical semantic research in the Arabic language. We compiled a Historical Arabic Corpus (HAC) that spans more than 1500 years of continuous language use. With techniques from the field of Natural Language Processing (NLP), the tools we presented here have been used to create the HAC and to explore lexical semantic change. The development of these tools is aimed at offering a catalyst to the ambitions goal of compiling an Arabic dictionary on historical principles. HAC and the tools can also be used for conducting research in a variety of areas of linguistics.
It is impossible to perform root-based searching, concordancing, and grammar checking in Arabic w... more It is impossible to perform root-based searching, concordancing, and grammar checking in Arabic without a method to match words with roots and vice versa. A comprehensive word list is essential for incremental searching, predictive SMS messaging, and spell checking, but due to the derivational and inflectional nature of Arabic, a comprehensive word list is taxing on storage space and access speed. This paper describes a method for compactly storing and efficiently accessing an extensive dictionary of Arabic words by their morphological properties and roots. Compression of the dictionary is based on T-Code encoding, which follows the Huffman encoding model. The special characteristics inherent in the recursive augmentation method with which codes are created allow compact storage on disk and in memory. They also facilitate the efficient use of bandwidth, for Arabic text transmission, over intranets and the Internet.
This paper compares Arabic and English speech rhythms to increase awareness of this neglected and... more This paper compares Arabic and English speech rhythms to increase awareness of this neglected and often misunderstood topic in foreign language acquisition. Unlike previous studies, we adopt a phonological view of speech rhythm rather than an isochrony-based phonetic view. We detail the components of speech rhythm at the word and utterance levels in Arabic and English focusing on the rhythmical differences that would affect the learners’ rhythm of both languages negatively. Findings suggest that Modern Standard Arabic (MSA) and Jordanian-Ammani Arabic (JAA), unlike English, should be placed at the lower end of the rhythmic continuum. The study opens new directions for future research and concludes with pedagogical implications for learners of Arabic and English.
2021 International Conference on Asian Language Processing (IALP)
The distributed representation of words, as in Word2Vec, FastText, and GloVe, results in the prod... more The distributed representation of words, as in Word2Vec, FastText, and GloVe, results in the production of a single vector for each word type regardless of the polysemy or homonymy that many words may have. Context-sensitive representation as implemented in deep learning neural networks, on the other hand, produces different vectors for the multiple senses of a word. Several contextualized word embeddings have been produced for the Arabic language (e.g., AraBERT, QARiB, AraGPT, etc.). The majority of these were tested on a few NLP tasks but there was no direct comparison between them. As a result, we do not know which of these is most efficient and for which tasks. This paper is a first step in an endeavor to establish evaluation criteria for them. It describes 24 such embeddings, then conducts exploratory intrinsic and extrinsic evaluation of them. Afterwards, it tests relational knowledge in them, covering four semantic relations: colors of fruits, capitals of countries, causation, and general information. It also evaluates the utility of these models in Named Entity Recognition and Sentiment Analysis tasks. It has been demonstrated here that AraBERT-v02 and MARBERT are the best on both types of evaluation; therefore, both are recommended for fine-tuning Arabic NLP tasks. The ultimate conclusion is that it is feasible to test higher order reasoning relations in these embeddings.
Using a mixed-method approach, this study examines the pragmatic functions of the discourse marke... more Using a mixed-method approach, this study examines the pragmatic functions of the discourse marker walak and its variants in Spoken Jordanian Arabic. It also explores the differences in the use of this discourse marker according to the speakers’ gender. The data was collected from a sample of 200 native speakers of Jordanian Arabic, using informal interviews and a validation questionnaire. The results showed that walak and its variants perform six language functions: warning, insulting, addressing/vocative, endearment, threatening, and denial. As far as gender differences are concerned, the findings indicated that there were statistically significant differences between males and females in the use of walak and its variants in favour of males. This indicates that males agreed more with the sentences expressing each pragmatic function in the validation questionnaire. The study concludes with some pedagogical implications for learners of Arabic as a second language, teachers and syll...
The ultimate purpose of teaching a foreign language is to enable learners to understand others an... more The ultimate purpose of teaching a foreign language is to enable learners to understand others and to make themselves understood. Foreign language teaching is primarily about enabling learners to be communicatively competent at both receptive and productive levels. To accomplish this, teachers should seek to teach all language components in natural settings. More often than not, foreign language teaching not only takes place in unnatural contexts, but it has also become compartmentalized, with each language skill taught separately. This article supports an integrative approach to language teaching where language arts can be taught while teaching literary works, including short stories, novels, poetry, and drama. This means that appreciating literature and developing language skills should go hand in hand. Therefore, this article advocates a content-based approach to language teaching where learners receive more attention, and literature is the content around which language activitie...
Modelling the distributional semantics of such a morphologically rich language as Arabic needs to... more Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this ben...
Several tools are developed to facilitate the quantitative analysis of interpretation style, a ma... more Several tools are developed to facilitate the quantitative analysis of interpretation style, a matter that has hitherto been discussed only in vague terms. These tools can allow the investigation of questions such as: How does an interpreter divide up a source language input, to what extent does he mirror a source language speaker, and to what degree does he practise reformulation? Furthermore, an adaptive monitoring instrument is devised to facilitate the graphic representation of the linear developments of a source language discourse and its simultaneous interpretation equivalent. Not only does it allow the assessment of convergence and divergence between the two discourses, but this also permits commenting on an interpreter's tempo by characterising the narrow and broad periodicity within his discourse, and on his composure and tribulation by describing his consistency and fluency in the discourse.
It is becoming increasingly difficult to know who is working on what and how in computational stu... more It is becoming increasingly difficult to know who is working on what and how in computational studies of Dialectal Arabic. This study comes to chart the field by conducting a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends. It is a review that is guided by the norms of systematic reviews. It has taken account of all the research that adopted a computational approach to dialectal Arabic identification and detection and that was published between 2000 and 2020. It collected, analyzed, and collated this research, discovered its trends, and identified research gaps. It revealed, inter alia, that our research effort has not been directed evenly between speech and text or between the vernaculars; there is some bias favoring text over speech, regional varieties over in...
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus... more This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus is used by graduate students at the University of Jordan (UJ) to conduct experiments on many useful applications. Earlier, we were not able to verify these experiments because of the lack of reliable data. We are working on annotating and tagging the corpus texts and making it available for researchers in XML format.
This study reports on the construction of a one million word English-Arabic Political Parallel Co... more This study reports on the construction of a one million word English-Arabic Political Parallel Corpus (EAPPC), which will be a useful resource for research in translation studies, language learning and teaching, bilingual lexicography, contrastive studies, political science studies and cross-language information retrieval. It describes the phases of corpus compilation and explores the corpus, by way of illustration, to discover the translation strategies used in rendering the Arabic and Islamic culture-specific terms takfīr and takfīrī from Arabic into English and from English into Arabic. The Corpus consists of 351 English and Arabic original documents and their translations. A total of 189 speeches, 80 interviews and 68 letters, translated by anonymous translators in the Royal Hashemite Court, were selected and culled from King Abdullah II's official website, in addition to the textual material of the English and Arabic versions of His Majesty's book, Our Last Best Chance:...
In this paper, we present a set of corpus linguistic tools for conducting historical semantic res... more In this paper, we present a set of corpus linguistic tools for conducting historical semantic research in the Arabic language. We compiled a Historical Arabic Corpus (HAC) that spans more than 1500 years of continuous language use. With techniques from the field of Natural Language Processing (NLP), the tools we presented here have been used to create the HAC and to explore lexical semantic change. The development of these tools is aimed at offering a catalyst to the ambitions goal of compiling an Arabic dictionary on historical principles. HAC and the tools can also be used for conducting research in a variety of areas of linguistics.
It is impossible to perform root-based searching, concordancing, and grammar checking in Arabic w... more It is impossible to perform root-based searching, concordancing, and grammar checking in Arabic without a method to match words with roots and vice versa. A comprehensive word list is essential for incremental searching, predictive SMS messaging, and spell checking, but due to the derivational and inflectional nature of Arabic, a comprehensive word list is taxing on storage space and access speed. This paper describes a method for compactly storing and efficiently accessing an extensive dictionary of Arabic words by their morphological properties and roots. Compression of the dictionary is based on T-Code encoding, which follows the Huffman encoding model. The special characteristics inherent in the recursive augmentation method with which codes are created allow compact storage on disk and in memory. They also facilitate the efficient use of bandwidth, for Arabic text transmission, over intranets and the Internet.
Uploads
Papers by Sane Yagi