[go: up one dir, main page]

Academia.eduAcademia.edu
Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Fake News Identification on Social Media Moin Khan Amisha Jain Department of Computer Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India. Department of Computer Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India. Rishi Chouhan Department of Computer Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India. Abstract— Fake news proves to have a great impact on society as well as the public. It not only affects people’s perception but also fails to preserve the traditional news ecosystem based on the pillars of truth and reality. Considering this situation that affects the public worldwide, here we propose an application that can identify any false information that gets circulated through social media. Our system is proposed with a goal to identify the fake news by making comparisons with the existing facts and data which are available in our datasets. The text information given by the user as an input to the system can be easily distinguished either as fake or real with respective tags attached in the output. Our proposed model enables the ability to identify fake and misleading information and thus retain the trust of the public, leading to the protection of society from the negative impacts of fake news. Keywords— Word embeddings, sentence semantics, fake message, word2vec, doc2vec. I. INTRODUCTION In the past decade, social media has become more and more trendy for news utilization due to its easy access, fast propagation, and low cost. However, social media also enables the ample proliferation of "fake news," i.e., news with purposely false information. Fake news on social media can have noteworthy negative societal effects. Therefore, fake news discovery on social media in recent times has become a promising research area that is attracting great attention. The spread of fake news through various message sharing applications and social media has been increasing on a tremendous scale. During natural calamities, news spread’s without any authentication thus affecting the trust of common people as they tend to believe all the contents forwarded to them. Social media has become one of the best media for the widespread of fake news. This extensive propagation of information may have a negative impact on individuals as well the society breaking the authenticity balance of the news ecosystem. It not only changes the way of interpretation but also intentionally persuades IJERTV9IS010183 Sakeeb. H. Sheikh Department of Computer Engineering, Sinhgad College of Engineering, Savitribai Phule Pune University, Pune, India. consumers to accept biased or false beliefs. Some news is intentionally created to trigger people’s distrust and make them confused, impeding their abilities to differentiate what is wrong and what is true. The propagation of misleading data in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it exigent to identify truthful news sources, thus increasing the need for computational tools able to endow with insights into the reliability of online content. In this paper, the spotlight is on the automatic identification of fake content in online news. First, we introduce data in the database via various online sources eg. timesofindia.com, news.google.co.in for the task of fake news detection, covering the domain of the politics. If in case the relevant information is not found, then the content is checked on the Search Engine and topmost five to seven results are taken into consideration. Secondly, we perform a set of learning experiments to build accurate fake news detectors where different NLP operations are performed to standardize the message and to transform the message into the vector form. By comparing the given transformed message vector with the stored vectors in the database, a score is generated. Based on this score, ‘fake’ or ‘real’ tag is attached to the message and shown along with the message. II. LITERATURE SURVEY In this section ,we are going to discussed some past research that have been done in fake message identification ,their benefits. limitations and technologies used. www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 365 Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Table 1:Summary of Literature Survey SR. No. 01 TITLE PUBLISHER DESCRIPTION BENEFITS LIMITATIONS Design Exploration of Fake News: A Transdisciplinary Methodological Approach to Understanding Content Sharing and Trust on Social Media[1] Jaigris Hodson, Royal Roads University; Brian Traynor, Mount Royal University (2018 IEEE International Professional Communication Conference) In this paper, Publishers recommended a transdisciplinary transfer in the direction of researching fake news that takes into account algorithmic tactics, psychometric data, and qualitative explorations of user actions. The shortcoming of the experiment is that it requires both labelled datasets and communities of experts to help train applications to recognize and categorize contents. So, work is still essential to recognize how human judgement of fake news takes place. 02 Fake and Spam Messages: Detecting Misinformation during Natural Disasters on Social Media[2] 1) In this paper, Publishers conducted a case study of 2013 Moore Tornado and Hurricane Sandy. 2) Pilot grades showed that the projected tactics classify spam and fake messages with 96.43% accuracy and 0.961 F-measure 03 Classifying Fake News Articles Using Natural Language Processing to Identify In-Article Attribution as a Supervised Learning Estimator.[3] Meet Rajdev and Kyumin Lee, Department of Computer Science, Utah State University (2015 IEEE/WIC/ACM International Conferenceon Web Intelligence and Intelligent Agent Technology) Terry Traylor, U.S. Marine Corps, Fargo, ND Jeremy Straub, Gurmeet, Nicholas Snell (Department of Computer Science North Dakota State University), Fargo, ND (2019 IEEE 13th International Conference on Semantic Computing (ICSC)) 1) Provided fresh tactics to identifying fake news, via both algorithmically and user experience research. 2) Planned how diverse tools and methods may need to be employed collectively, in a transdisciplinary tactic to indulgent user habits, in order to fully address the stuff of trust, news sharing, and the stretch of misrepresentation. 1) In this paper, Publishers conducted a case study of 2013 Moore Tornado and Hurricane Sandy. 2) Pilot grades showed that the projected tactics classify spam and fake messages with 96.43% accuracy and 0.961 Fmeasure 1) In this paper, the research process, methodical analysis, technical semantics work, and classifier performance and results are offered. 2) The paper concludes with a discourse of how the current system will advance into an impact mining system. 1) Influence mining technique is offered as a mode that can be used to enable fake news and even advertising detection 2) This paper presented the consequences of a study that produced a restricted fake news detection system. 3) The initial performance of system is usually encouraging, because fake news is intended to deceive human targets, so a initial classification tool with only one removal feature seems to do well. 04 Credibility Assessment of Textual Claims on the Web[4] Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, Gerhard Weikum (Max Planck Institute for Informatics, Saarbrücken, Germany) 1) This paper recommends the use of a method leverages the joint communication between the language of articles about the claim and the reliability of the basic web sources. 2) Experiments with claims from the popular website snopes.com and from reported cases of Wikipedia hoaxes prove the viability of the projected methods and their superior accuracy over various baselines. 05 Message Authentication System for Mobile Messaging Applications[5] Ankur Gupta, Purnendu Prabhat, Rishi Gupta, Sumant Pangotra, Suave Bajaj, (Department of Computer Science and This paper proposes a system enabling users to confirm the authenticity of messages recognised through message sharing applications. 1)This paper addresses the assessing the integrity of arbitrary claims made in natural-language text - in an open-domain setting deprived of any assumptions about the structure of the claim, or the community where it is made. 2) Solution is based on automatically finding sources in news and social media, and feeding these into a far supervised classifier for measuring the credibility of a claim (i.e., true or fake). 3) The work in this paper aims to replace this blue-collar confirmation with an automated system. 1) The Message Authentication System (MAS) builds a hierarchical database of faithful information through mining a multitude of trusted sources on the internet and social media feeds. 1) The attribution-based fake news discovery tool that uses the quote ascription classifier, however, like the attribution classifier, it did not perform well enough for production use. 2) Upon review, some of the missed labels were attributable to fake news forms with no quotations, fake news documents with credited quotes of imprecise statements, and fake news documents that quoted or cited other fake news documents. 3) The overall act results for this system are not as robust as desired. Can't examine the role of attribution or speaker information, refined linguistic aspects like denial, and understanding the article's view about the claim. IJERTV9IS010183 One mislaid study is when would be a right prompt to develop fake and spam tweet forecaster while streaming data is looming. The proposed system works as a third-party data authentication service capable of working with a wide-variety of message sharing requests or social network platforms which www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 366 Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Engineering, Model Institute of Engineering and Technology, Jammu, India) 2017 International Conference on Next Generation Computing and Information Systems (ICNGCIS) III. GAP ANALYSIS In [2]. User experience, more specifically how user gets engaged and how they tend to recognize fake news is one concept which are focused only by a few. This paper thus proposes a comparatively different approach for analyzing user behaviour. It uses factors such as trust, loyalty, appearance as well as the usability inorder to identify fake news. It also provides results of their analysis performed on new sites. In [3], Natural Language Processing is a novel method used for detection of fake news. In 2019 Terry Taylor et.al proposed methods for detection of fake news using natural language processing. The paper benefits us the future researchers with detailed technical analysis. Also it uses textblob, sciPy Toolkit to develop a unique classifier that could later detect fake and misleading information. It also provides results based on the performance of classifier as well as the precision value of the proposed system. In [4],Credibility of messages has been a crucial problem .The paper published in 2016 by Kashyap Popat et.al concentrates on the credibility of messages that are being spread. It proposes methods in which the sources of news are feeded to a supervised classifier and thus checking the credibility. The paper succeeds by providing various results extracted by performing practical implications on sites such as snopes.com and Wikipedia. In [5], Rajdev et. al (2015) proposed a system which basically focuses on detection of spam messages during natural calamities using classification as well as feature detection of information that gets tweeted during natural calamities. The classification approach consists of flat as well as hierarchical classification of information as either legitimate or fake but the feature detection focuses on identifying the piece of information based on specific features. The proposed system is claimed to be successful by giving an accuracy of 96.43percent. IV. PROPOSED SYSTEM In this we have proposed a system where the user will be provided with the messaging application which have an extra feature of identifying the legitimacy of the message it has received by verifying it over our server. Our server contains a machine learning model which is based on word embedding based machine learning algorithm trained on data gathered from various sources like news sites, fact checking sites and search engine in case if the data relevant to users’ data is not available in our database. Our model IJERTV9IS010183 2) The user can choose to forward the authenticated message to her contacts with the appended legitimacy index thereby helping prevent the spread of spin. involve large-scale data sharing and forwarding. transforms the message sent by the user into numerical vector using word2vec and doc2vec word embedding techniques .Next, it identifies the legitimacy of the message using score generator module which takes help of distance calculator techniques like cosine similarity. Our proposed system then attaches the tag of fake or real to the user message and sends back to the user. Hence Our proposed mechanism doesn’t intend to stop the extensive spread of false news but instead, it makes the user aware of the authenticity of the information, thus developing an application that can identify the fake news. Fig. 1. Workflow Model V. ARCHITECTURE In our proposed system,we have three main modules which are shown in the fig . namely user application side ,web server side and database server.The modules interacts with each other ,the flow is initiated by actor(user) which interacts with the user application(an android application) ,it sends a message for verification over the HTTP network to the web server where different NLP operations are performed to standardize the message and to transform the message into the vector form which is required to generate the score by comparing given transformed message vector with the stored vectors in the database.Based on the score ‘fake’ or ‘real’ tag is attached to the message and sent back to the actor. www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 367 Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 H. Data Pre-processing This module deals with pre-processing of the data .Different operations like tokenization and standardization are performed to convert all the data into standard form. I. Data Vectors The web embedding techniques like word2vec and USE are used to convert the data in text form into numerical vector form. This transformed data is then stored into the database. VI. Fig 2. System Architecture A. User application This module is nothing but a messaging application developed in android. It provides basic chatting functionality to the user like sending ,receiving with extra functionality of verification of the message. B. System Server This module is responsible for identification of received message either ‘real’ or ‘fake’ before reverting back the message to the user. Before sending back ,the message goes through number of stages as follows: C. Tokenization and Standardization In this module, the received message which is in text form ,which is first broken in tokens and then it is converted into standardized form by performing different NLP techniques. D. Vectorization The converted message into standardized form is transformed into numerical vector using pre-trained machine learning model based on word embedding technique namely word2vec and USE(Universally Sentence Encoder). E. Score generator And Tag Attacher Message vector of given message is compared with the vectors of data stored in the database. It makes use of vector similarity algorithms like cosine similarity, which gives results in 0 to 1.The average of given scores are find out and if the average is greater than 0.5 then the message is identified as real otherwise fake. The respective tag of real or fake is attached to the message and it is sent back to the actor(user). F. Database Server This module is responsible for storing the data in vector form. The data gathered from different sources are converted into vectors before storing into the database. METHODOLOGY There are different approaches proposed by the experts to identify the authenticity of the message. But we found out that semantics of the message plays an important role in finding the context of the messages. Context of the message gives the intent of the overall messages ,which is essential for identification of the given message as real or fake. It is difficult to compare two sentence in their original form to find the semantics difference in them, so we make use of a method which is based on finding the relation between neighbouring words called as word embedding, which transforms given text into numerical vectors .It is easy to compare two or more data in vector form than text form. So we used a methodology where we used word embedding based techniques to find the semantics of the given data. A. Word Embeddings Word embedding[6] is a form of word representation that is used to bridge the gap between human language and machine. It represents the words, phrases or vectors in ndimensional space called as vectors or encoded vectors. Word embedding is an approach which learns from corpus of data and finds the encoded vectors for each word in such a manner that words with similar meaning or context have similar vector representation. This is contrasted to the traditional bag of words where each unique word has different distributed representation , regardless of how they are used. Here it is based on deeper linguistic theory , called as “distributional hypothesis” by Zellig Harris[10] that could be summarized as: words that have similar context will have similar meanings. B. Word2Vec Word2Vec is a method to construct word-embedding between the words. It was developed by Tomas Mikolov, et al. [7]at Google in 2013 as a response to make the neural-network-based training of the embedding more efficient. It intends to give that: a numeric representation for each word, that will be able to capture such relations . There can be different relation for each distributed representation between such word like synonyms, antonyms, or analogies, such as this one: G. Data Collection The data is collected from different sources like fact checking sites like snopes.com ,news sites by web scraping and search engine. IJERTV9IS010183 www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 368 Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Fig 5.Skip Gram Fig 3. Word Embedding Here, woman = man – king + queen. Word2Vec forms embedding uses Skip-gram[8] and Common Bag of Words (CBOW). • Common Bag of Words (CBOW) CBOW learns embedding for current words by creating a sliding a window around the current word to predict it context from surrounding words. Consider this example: Word2Vec is a word embedding technique. Let the input to the Neural Network be the word, WORD2VEC. Notice that here we are trying to predict a target word (Word2vec) using a single context input words{word, embedding, technique }. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (Word2Vec). In the process of predicting the target word, we learn the vector representation of the target word. Both have their own advantages and disadvantages. According to Mikolov, Skip Gram works well with small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words. C. Doc2Vec Doc2vec [9] can be considered as an extension to word2vec ,here we are finding the hot encoded vector for each document ,paragraph or sentence. But compared to words, representing documents in numerical form is difficult as they don’t always come in structural manner, so there was a need to handle such unstructured documents. The concept that Mikilov and Le have used was simple: they have used the word2vec model, and added another vector (Paragraph ID below). Similar to word2vec ,Doc2vec also has two methods. • Distributed Memory version of Paragraph Vector (PV-DM) It’s similar to CBOW but instead of using just words to predict the next word, we also added another feature vector, which is document-unique. So, when training the word vectors W, the document vector D is trained as well, and in the end of training, it holds a numeric representation of the document. Fig 4. CBOW Algorithm Sketch • Skip Gram This approach is the opposite of CBOW ,instead of predicting current word from surrounding words, it predicts surrounding words vectors from current word. Like here we are predicting vectors for {word, embedding ,technique} from input word {word2vec}. Fig 6. .PV-DM • IJERTV9IS010183 Distributed Bag of Words version of Paragraph Vector (PV-DBOW) www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 369 Published by : http://www.ijert.org International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 It’s similar to skip gram model of word2vec Here, this algorithm is actually faster as compared to word2vec and consumes less memory, since there is no need to save the word vectors. The doc2vec models may be used in the following way: for training, a set of documents is required. A word vector W is generated for each word, and a document vector D is generated for each document. REFERENCES [1] [2] [3] [4] [5] [6] Fig 7. PVDBOW VII. CONCLUSION In this paper, we conducted a literature survey of existing fake detection systems which were proposed earlier. Observation of gaps from existing studies in the domain led to the revised system architecture which is explained in detail. A conceptual framework is proposed based on the gaps in fake news detection for public. The framework compares the already existing methods for detecting political news or any other fake news on social media. Focusing on these results, fake news identification application is proposed and practical implications are also discussed. IJERTV9IS010183 [7] [8] [9] [10] [11] [12] [13] Jaigris Hodson,Brian Traynor, “Design Exploration of Fake News: A Transdisciplinary Methodological Approach to Understanding Content Sharing and Trust on Social Media” 2018 IEEE International Professional Communication Conference Terry Traylor, Jeremy Straub,Gurmeet and Nicholas Snell, “Classifying Fake News Articles Using Natural Language Processing to Identify In-Article Attribution as aSupervised Learning Estimator,” in IEEE (ICSC),2019. Terry Traylor, Jeremy Straub,Gurmeet and Nicholas Snell, “Classifying Fake News Articles Using Natural Language Processing to Identify In-Article Attribution as aSupervised Learning Estimator,” in IEEE (ICSC),2019. KashyapPopat, Subhabrata Mukherjee, JannikStrotgen and Gerhard Weikum, “Credibility Assessment of Textual Claims on the Web, ” in ACM,2016. Meet Rajdev and KyuminLee , “Fake and Spam Messages: Detecting Misinformation during Natural Disasters on Social Media ,” in IEEE/WIC/ACM International Conference , 2015. Zhang, Ye & Rahman, Md Mustafizur & Braylan, Alex & Dang, Brandon & Chang, Heng-Lu & Kim, Henna & McNamara, Quinten & Angert, Aaron & Banner, Edward & Khetan, Vivek & McDonnell, Tyler & Nguyen, An & Xu, Dan & Wallace, Byron & Lease, Matthew. (2016). Neural Information Retrieval: A Literature Review. T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proc. Workshop at ICLR, 2013. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, "Distributed Representations of Words and Phrases and their Compositionality", Proc. NIPS, 2013. Quoc V. Le and Tomas Mikolov ”Distributed Representations of Sentences and Documents” Proc NIPS, 2014. Harris, Zellig. Distributional structure. Word, 1954. Mikolov, Tomas. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012. Mikolov, Tomas, Le, Quoc V., and Sutskever, Ilya. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013b. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Distributed representations of phrases and their compositionality. In Advances on Neural Information Processing Systems, 2013c. www.ijert.org (This work is licensed under a Creative Commons Attribution 4.0 International License.) 370