Fake Reviews Detection Using Supervised Machine Learning
Fake Reviews Detection Using Supervised Machine Learning
net/publication/348962950
CITATION                                                                                                  READS
1                                                                                                         556
4 authors:
Some of the authors of this publication are also working on these related projects:
A Heterogeneous Scheme to Detect Intruder Attacks in Wireless Sensor Networks View project
All content following this page was uploaded by Ammar Mohammed on 18 June 2021.
in the Fake reviews detection research; textual and behavioral                             III.   BACKGROUND
features. Textual features refer to the verbal characteristic
of review activity. In other words, textual features depend             Machine learning is one of the most important techno-
mainly on the content of the reviews. Behavioral features           logical trends which lies behind many critical applications.
refer to the nonverbal characteristics of the reviews. They         The main power of machine learning is helping machines
depend mainly on the behaviors of the reviewers such as             to automatically learn and improve themselves from previous
writing style, emotional expressions, and the frequent times the    experience [16].There are several types of machine learning
reviewers write the reviews. Although tackling textual features     algorithms [17]; namely supervised, semi supervised and un-
is challenging and crucial, behavioral features are also very       supervised machine learning. In the surprised approach, both
important and cannot be ignored as they have a high impact on       input and output data are provided and the training data must
the performance of the fake review detection process. Textual       be labeled and classified [18]. In the unsupervised learning
features have extensively been seen in several fake reviews         approach, only the data is given without any classification
detection research papers. In [7], the authors used supervised      or labels and the role of the approach is to find the best
machine learning approaches for fake reviews detection. Five        fit clustering or classification of the input data. Thus, in
classifiers are used which are SVM, Naive-bayes, KNN, k-star        unsupervised learning, all data are unlabeled and the role of
and decision tree. Simulation experiments have been done on         the approach is to label them. Finally, in the semi supervised
three versions of labeled movie reviews dataset [8] consisting      approach, some data are labeled but the most are unlabeled. In
of 1400, 2000, and 10662 movie reviews respectively. Also,          this part, we introduce a summary of the supervised learning
in [9], the authors used Naive Bayes, Decision tree, SVM,           algorithms as they are the main focus of this paper.
Random forest and Maximum entropy classifiers in detecting               Several classification algorithms are developed for super-
fake reviews on the dataset that they have collected. The           vised machine learning.The main objective of these algorithms
collected dataset is around 10,000 negative tweets related to       is to find a proper model that disseminates the training
Samsung products and their services. In [10], the authors used      data. For example, Support Vector Machines (SVM) is a
both SVM and Naive base classifiers. The authors worked             discriminated classifier that basically separates the given data
on yield dataset which consists of 1600 reviews collected           into classes by finding the best separable hyper-plane which
from 20 popular hotels in Chicago. In [11], the authors used        categorizes the given training data[19]. Another Common
the neural and discrete models with Average, CNN, RNN,              supervised learning algorithm is Naive Bayes (NB). The key
GRNN, Average GRNN and Bi-directional Average GRNN                  idea of NB relies on Bayes theorem; the probability of event
deep learning classifiers to detect deceptive opinion spamming.     A to happen given the probability of event B which is formed
They used dataset from [12] which contains truthful and             as P(A—B) = P(B—A)*P(A) P(B) [20]. NB calculates a set
deceptive reviews in three domains; namely hotels, restaurants      of probabilities by counting the frequency and the combined
and doctors. All the above research works have only considered      values in a given dataset. NB has been successfully applied
the textual features without any effort towards the behavioral      in several application domains like text classification, spam
features.                                                           filtering and recommendation systems.
                                                                       The K-Nearest Neighbors algorithm (or KNN) [21] is
                                                                    one of the most simple yet powerful classification algorithms.
    Other articles have considered behavioral features in the       KNN has been used mostly in statistical estimation and pattern
fake reviews detection process. In [13], some behavioral            recognition. The key idea behinds KNN is to classify instance
features have been considered on Amazon reviews such as             query based on voting of a group of similar classified instances.
average rating, and ratio of the number of reviews that the         The similarity is usually calculated using distance function [22]
reviewer wrote. In another work [14], the authors investigated
the impact of both textual and behavioural features on the fake         Decision-tree [23] is another machine learning classifier
review detection process focusing on the restaurant and hotel       that relies on building a tree that represents a decision of
domain. Also, In[15], an iterative computation framework plus       instances training data. The Algorithm starts to construct the
plus (ICF++) is proposed integrating textual and behavioral         tree iteratively based on best possible split among features. The
features. They detected fake reviews based on measuring the         selection process of the best features relies on a predefined
honesty value of a review, the trustiness value of the reviewers    functions like, entropy,information gain, gain ratio, or gini
and the reliability value of a product.                             index. Random Forest [24] is a successful method that
                                                                    handles the overfitting problems that occur in the decision tree.
                                                                    The key essence of random forest is to construct a bag of trees
                                                                    from different samples of the dataset. Instead of constructing
    From the above discussion and to the best of our knowl-         the tree from all features, Random forest generates small
edge, no approaches have dived deeply in extracting features        random number of features while constructing each tree in the
that reflect the reviewers’ behaviors. These features will highly   forest. Logistic regression [25] is another simple supervised
influence the effectiveness of the fake reviews detection pro-      machine learning classifier. It relies on finding a hyperplane
cess. In this paper a machine learning approach to identify         that classifies the data.
fake reviews is presented. In addition to the features extraction
process of the reviews, the presented approach applies several                       IV.     P ROPOSED A PPROACH
features engineering to extract various behaviors of the review-
ers. Some new behavioral features are created. The created             This section explains the details of the proposed approach
features are used as inputs to the proposed system besides the      shown in figure 1. The proposed approach consists of three
textual features for fake reviews detection task.                   basic phases in order to get the best model that will be used
                                                        www.ijacsa.thesai.org                                         602 | P a g e
                                                    (IJACSA) International Journal of Advanced Computer Science and Applications,
                                                                                                             Vol. 12, No. 1, 2021
for fake reviews detection. These phases are explained in the             Several approaches have been developed in the literature to
following:                                                            extract features for fake reviews detection. Textual features is
                                                                      one popular approach [31]. It contains sentiment classification
A. Data Preprocessing                                                 [32] which depends on getting the percent of positive and
                                                                      negative words in the review; e.g. “good”, “weak”. Also,
    The first step in the proposed approach is data                   the Cosine similarity is considered. The Cosine similarity is
preprocessing [26]; one of the essential steps in machine             the cosine of the angle between two n-dimensional vectors
learning approaches. Data preprocessing is a critical activity        in an n-dimensional space and the dot product of the two
as the world data is never appropriate to be used. A sequence         vectors divided by the product of the two vectors’ lengths
of preprocessing steps have been used in this work to prepare         (ormagnitudes)[33]. TF-IDF is another textual feature method
the raw data of the Yelp dataset for computational activities.        that gets the frequency of both true and false (TF) and the
This can be summarized as follows:                                    inverse document (IDF). Each word has a respective TF and
                                                                      IDF score and the product of the TF and IDF scores of a term is
                                                                      called the TF-IDF weight of that term [34]. A confusion matrix
    1) Tokenization: Tokenization is one of the most common
                                                                      is used to classify the reviews into four results; True Negative
natural language processing techniques. It is a basic step
                                                                      (TN): Real events are classified as real events, True Positive
before applying any other preprocessing techniques. The text
                                                                      (TP): Fake events are classified as fake, False Positive (FP):
is divided into individual words called tokens. For example,
                                                                      Real events are classified as fake events, and False Negative
if we have a sentence (“wearing helmets is a must for pedal
                                                                      (FN): Fake events are classified as real.
cyclists”), tokenization will divide it into the following tokens
(“wearing” , “helmets” , “is” , “a”, “must”, “for” , “pedal” ,            Second there are user personal profile and behavioral
“cyclists”) [27].                                                     features. These features are the two ways used to identify
                                                                      spammers Whether by using time-stamp of user’s comment
    2) Stop Words Cleaning: Stop words [28] are the words
                                                                      is frequent and unique than other normal users or if the user
which are used the most yet they hold no value. Common
                                                                      posts a redundant review and has no relation to domain of
examples of the stop words are (an, a, the, this). In this paper,
                                                                      target.
all data are cleaned from stop words before going forward in
the fake reviews detection process.                                       In this paper, We apply TF-IDF to extract the features
                                                                      of the contents in two languages models; mainly bi-gram
    3) Lemmatization: Lemmatization method is used to con-
                                                                      and tri-gram. In both language models, we apply also the
vert the plural format to a singular one. It is aiming to remove
                                                                      extended dataset after extracting the features representing the
inflectional endings only and to return the base or dictionary
                                                                      users behaviors.
form of the word. For example: converting the word (“plays”)
to (“play”) [29].
                                                                      C. Feature Engineering
                                                                          Fake reviews are known to have other descriptive features
                                                                      [35] related to behaviors of the reviewers during writing their
                                                                      reviews. In this paper, we consider some of these feature
                                                                      and their impact on the performance of the fake reviews
                                                                      detection process. We consider caps-count, punct-count, and
                                                                      emojis behavioral features. caps-count represents the total
                                                                      capital character a reviewer use when writing the review,
                                                                      punct-count represents the total number of punctuation that
                                                                      found in each review, and emojis counts the total number of
                                                                      emojis in each review. Also, we have used statistical analysis
                                                                      on reviewers’ behaviours by applying “groupby” function, that
                                                                      gets the number of fake or real reviews by each reviewer that
                                                                      are written on a certain date and on each hotel. All these
                                                                      features are taken into consideration to see the effect of the
                  Fig. 1. The Proposed Framework.                     users behaviors on the performance of the classifiers.
                                                                                     V.   E XPERIMENTAL R ESULTS
B. Feature Extraction
                                                                          We evaluated our proposed system on Yelp dataset [5].
    Feature extraction is a step which aims to increase the           This dataset includes 5853 reviews of 201 hotels in Chicago
performance either for a pattern recognition or machine               written by 38, 063 reviewers. The reviews are classified into
learning system. Feature extraction represents a reduction            4, 709 review labeled as real and 1, 144 reviews labeled as
phase of the data to its important features which yields in           fake. Yelp has classified the reviews into genuine and fake.
feeding machine and deep learning models with more valuable           Each instance of the review in the dataset contains the review
data. It is mainly a procedure of removing the unneeded               date, review ID, reviewer ID, product ID, review label and star
attributes from data that may actually reduce the accuracy of         rating. The statistics of dataset is summarized in Table I. The
the model [30].                                                       maximum review length in the data contains 875 word, the
                                                                      minimum review length contains 4 words, the average length
                                                          www.ijacsa.thesai.org                                        603 | P a g e
                                                      (IJACSA) International Journal of Advanced Computer Science and Applications,
                                                                                                               Vol. 12, No. 1, 2021
of all the reviews is 439.5 word, the total number of tokens             TABLE II. ACCURACY OF BI - GRAM AND TRI - GRAM IN THE A BSENCE OF
of the data is 103052 word, and the number of unique words                              E XTRACTED F EATURES B EHAVIORS
is 102739 word.
                                                                                                          Accuracy%   Accuracy%   Average
                                                                               Classification Algorithm
                                                                                                          Bigram      Trigram     Accuracy
                TABLE I. S UMMARY OF THE DATASET                                 Logistic Regression       87.87%      87.87%     87.87%
                                                                                    Naive bayes             86.76%      87.30%     87.03%
            Total number of reviews     5853 review                                 KNN (K=7)               86.34%     87.87%      87.82%
            Number of fake reviews      1144 review                                     SVM                 87.82%      87.82%     87.82%
            Number of real reviews      4709 review                                Random Forest            87.82%      87.82%     87.82%
           Number of distinct words    102739 word
            Total number of tokens     103052 token
          The Maximum review length      875 word
          The Minimum review length       4 word
           The Average review length    439.5 word
indicators when the data is unbalanced. Similar to the previous,                                           TABLE V. R ECALL , P RECISION , AND F1- SCORE IN P RESENCE OF
Table IV represents the recall, precision, and hence f1-score                                                          E XTRACTED B EHAVIORAL F EATURES
in the absence of the extracted features behaviors of the users
                                                                                                                           Bi-gram                         Tri-gram                      Avg F-score
in the two language models. For the trade-off between recall                                                                Recall   Precision   F-score    Recall Precision   F-score
and precision, f1-score is taken into account as the evaluation                                      Logistic Regression   86.90%     75.53%      82%      86.90%   75.53%     80.82%      81.41%
criterion of each classifier. In Bi-gram, KNN(k=7) outperforms                                          Naive Bayes        85.82%      76%       80.38%    86.34%   76.59%     80.64%      80.51%
                                                                                                        KNN(K=7)           86.56%      80%       81.26%    85.30%   78.50%     86.20%      83.73%
all other classifiers with f1-score value of 82.40%. Whereas, in                                            SVM            86.90%     75.50%     80.82%    84.90%   75.53%     81.82%      81.32%
Tri-gram, both logistic regression and KNN(K-7) outperform                                             Random Forest       86.85%     75.50%     80.79%    87.90%   74.53%     81.90%      81.34%
other classifiers with f1-score value of 82.20%. To evaluate
the overall performance of the classifiers in both language
models, the average f1-score is calculated. It is found that,
KNN outperforms the overall classifiers with average f1-score
of 82.30%. Fig. 4 depicts the the overall performance of all
classifiers.
                                                                                      www.ijacsa.thesai.org                                                                    605 | P a g e
                                                               (IJACSA) International Journal of Advanced Computer Science and Applications,
                                                                                                                        Vol. 12, No. 1, 2021
 [2]       S. Tadelis, “The economics of reputation and feedback systems in e-        [19]   T. Joachims, “Text categorization with support vector machines: Learn-
           commerce marketplaces,” IEEE Internet Computing, vol. 20, no. 1, pp.              ing with many relevant features.” 1998.
           12–19, 2016.                                                               [20]   T. R. Patil and S. S. Sherekar, “Performance analysis of naive bayes
 [3]       M. J. H. Mughal, “Data mining: Web data mining techniques, tools and              and j48 classification algorithm for data classification,” pp. 256–261,
           algorithms: An overview,” Information Retrieval, vol. 9, no. 6, 2018.             2013.
 [4]       C. C. Aggarwal, “Opinion mining and sentiment analysis,” in Machine        [21]   M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to
           Learning for Text. Springer, 2018, pp. 413–434.                                   multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048,
 [5]       A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, “What yelp                  2007.
           fake review filter might be doing?” in Seventh international AAAI          [22]   N. Suguna and K. Thanushkodi, “An improved k-nearest neighbor clas-
           conference on weblogs and social media, 2013.                                     sification using genetic algorithm,” International Journal of Computer
 [6]       N. Jindal and B. Liu, “Review spam detection,” in Proceedings of the              Science Issues, vol. 7, no. 2, pp. 18–21, 2010.
           16th International Conference on World Wide Web, ser. WWW ’07,             [23]   M. A. Friedl and C. E. Brodley, “Decision tree classification of land
           2007.                                                                             cover from remotely sensed data,” Remote sensing of environment,
 [7]       E. Elmurngi and A. Gherbi, Detecting Fake Reviews through Sentiment               vol. 61, no. 3, pp. 399–409, 1997.
           Analysis Using Machine Learning Techniques. IARIA/DATA ANA-                [24]   A. Liaw, M. Wiener et al., “Classification and regression by random-
           LYTICS, 2017.                                                                     forest,” R news, vol. 2, no. 3, pp. 18–22, 2002.
 [8]       V. Singh, R. Piryani, A. Uddin, and P. Waila, “Sentiment analysis          [25]   D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein, Logistic
           of movie reviews and blog posts,” in Advance Computing Conference                 regression. Springer, 2002.
           (IACC), 2013, pp. 893–898.                                                 [26]   G. G. Chowdhury, “Natural language processing,” Annual review of
 [9]       A. Molla, Y. Biadgie, and K.-A. Sohn, “Detecting Negative Deceptive               information science and technology, vol. 37, no. 1, pp. 51–89, 2003.
           Opinion from Tweets.” in International Conference on Mobile and            [27]   J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,”
           Wireless Technology. Singapore: Springer, 2017.                                   in Proceedings of the 14th conference on Computational linguistics-
[10]       S. Shojaee et al., “Detecting deceptive reviews using lexical and                 Volume 4. Association for Computational Linguistics, 1992, pp. 1106–
           syntactic features.” 2013.                                                        1110.
[11]       Y. Ren and D. Ji, “Neural networks for deceptive opinion spam              [28]   C. Silva and B. Ribeiro, “The importance of stop word removal on recall
           detection: An empirical study,” Information Sciences, vol. 385, pp. 213–          values in text categorization,” in Neural Networks, 2003. Proceedings
           224, 2017.                                                                        of the International Joint Conference on, vol. 3. IEEE, 2003, pp.
                                                                                             1661–1666.
[12]       H. Li et al., “Spotting fake reviews via collective positive-unlabeled
           learning.” 2014.                                                           [29]   J. Plisson, N. Lavrac, D. Mladenić et al., “A rule based approach to
                                                                                             word lemmatization,” 2004.
[13]       N. Jindal and B. Liu, “Opinion spam and analysis,” in Proceedings of
           the 2008 International Conference on Web Search and Data Mining,           [30]   C. Lee and D. A. Landgrebe, “Feature extraction based on decision
           ser. WSDM ’08, 2008, pp. 219–230.                                                 boundaries,” IEEE Transactions on Pattern Analysis & Machine Intel-
                                                                                             ligence, no. 4, pp. 388–400, 1993.
[14]       D. Zhang, L. Zhou, J. L. Kehoe, and I. Y. Kilic, “What online reviewer
           behaviors really matter? effects of verbal and nonverbal behaviors on      [31]   N. Jindal and B. Liu, “Opinion spam and analysis.” in Proceedings
           detection of fake online reviews,” Journal of Management Information              of the 2008 international conference on web search and data mining.
           Systems, vol. 33, no. 2, pp. 456–481, 2016.                                       ACM, 2008.
[15]       E. D. Wahyuni and A. Djunaidy, “Fake review detection from a product       [32]   M. Hu and B. Liu, “Mining and summarizing customer reviews.” 2004.
           review using modified method of iterative computation framework.”          [33]   R. Mihalcea, C. Corley, C. Strapparava et al., “Corpus-based and
           2016.                                                                             knowledge-based measures of text semantic similarity,” in AAAI, vol. 6,
[16]       D. Michie, D. J. Spiegelhalter, C. Taylor et al., “Machine learning,”             2006, pp. 775–780.
           Neural and Statistical Classification, vol. 13, 1994.                      [34]   J. Ramos et al., “Using tf-idf to determine word relevance in document
[17]       T. O. Ayodele, “Types of machine learning algorithms,” in New ad-                 queries,” in Proceedings of the first instructional conference on machine
           vances in machine learning. InTech, 2010.                                         learning, vol. 242, 2003, pp. 133–142.
[18]       F. Sebastiani, “Machine learning in automated text categorization,” ACM    [35]   G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh,
           computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.                         “Exploiting burstiness in reviews for review spammer detection,” in
                                                                                             Seventh international AAAI conference on weblogs and social media,
                                                                                             2013.
www.ijacsa.thesai.org 606 | P a g e