Synthesis on ‘Tunisian Dialect Sentiment
Analysis: A Natural Language Processing-based
Approach’
In this paper, the performance of Tunisian SA has been enhanced in comparison to
the previous works of Tunisian dialect by introducing named entities tagging as a
preprocessing task and investigating its impact on the sentiment classification
performance when it is combined with other preprocessing tasks.
These preprocessing tasks are removing the non-sentimental content such as URLs,
usernames, dates, digits, hashtags symbols, and punctuation as well as performing
both light stemming and Farasa stemmer because this latter yields lower
segmentation errors than existing Arabic stemmers. Moreover, the authors used a list
of 1661 MSA stopwords provided by the NLP group KACST. Common emotions were
taken into consideration by replacing any emoji by its corresponding label. Negation
was inferred from the negative words like مفملش، ماكمش، مانيش، ماهمش،ال، ليسto replace them
by the tag ‘NegWord’.
After preprocessing data, three N-grams schemes including unigrams, bigrams and
trigrams were adopted as they can capture information about the local word order
and save the training time consumed by supervised methods. In addition, term
frequency property was employed to reduce the feature size according to predefined
frequency thresholds. Regarding the lexicon-base model, unigrams and a
combination of unigrams and bigrams were used in order to cover single and
compound phrases of the used lexicon.
Named entities were processed using the NER system provided by ‘Character-Aware
Neural Networks for Arabic Named Entity Recognition for Social Media’. The
produced named entities were then classified into positive or negative in order to be
tagged in the preprocessing step.
As far as the supervised model, NB and SVM were used as SVM can handle high-
dimensional feature vectors effectively as well as a straight forward sum method
(‘Polarity analysis of non figurative tweets: Tw-StAR participation on DEFT 2017’) to
determine the polarity of a tweet via the lexicon-based model. Besides, a manually-
built Tunisian sentiment lexicon of 5382 entries was used with non-stemmed data
while for input data being stemmed or light stemmed, this lexicon was extended to
include the stemmed/light-stemmed variations of its words and phrases such that the
lexicon size was increased into 14345 single and compound entries.
During the experiments, three publicly available datasets with a content harvested
from Tunisian and MSA social media platforms have been used and reduced, which
are Tunisian election corpus (3043), Tunisian sentiment analysis corpus (7366) and
Tunisian Arabic corpus (746) along with a combinations of preprocessing tasks,
these experiments are compared against the systems of (‘Tunisian dialect and MSA
datasets for sentiment analysis’), (‘Sentiment analysis of Tunisian dialect: Linguistic
resources and experiments’) and (‘Tunisian Arabic customer’s reviews processing
and analysis for an internet supervision system’).
According to the results, SVM always performs better than NB for large-sized
datasets such as TSAC, whereas NB is better for medium and small-sized datasets.
The lexicon-based performances listed in table 3, 5 and 7 emphasize the role of NEs
in improving SA performance. In addition, combining NEs with negation and
stemming scored the best performances.
Some references:
Bootstrapping Sentiment Labels For
Unannotated Documents With
Polarity PageRan
Bootstrapping Sentiment Labels For
Unannotated Documents With
Polarity PageRan
Bootstrapping Sentiment Labels For
Unannotated Documents With
Polarity PageRan
1) Arabic dialect identification using a parallel multi-dialectal corpus by
Malmasi
2) Tw-StAR at SemEval-2017 task 4: sentiment classification of Arabic tweets
by Mulki
3) Data and text mining techniques for classification Arabic tweet polarity by
Brahimi
4) Improving stemming for Arabic information retrieval: light stemming and co-
occurrence analysis by Larkey
5) A neural architecture for dialectal Arabic segmentation by Samih