[go: up one dir, main page]

0% found this document useful (0 votes)
5 views17 pages

FakeNewsNet Big Data

The document presents FakeNewsNet, a comprehensive data repository designed to facilitate research on fake news in social media by providing datasets that include news content, social context, and spatiotemporal information. It addresses the limitations of existing datasets by offering multi-dimensional data that can enhance the detection, understanding, and intervention of fake news. The repository aims to support various research applications and is continuously updated with new features and sources.

Uploaded by

fazlul.maruf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

FakeNewsNet Big Data

The document presents FakeNewsNet, a comprehensive data repository designed to facilitate research on fake news in social media by providing datasets that include news content, social context, and spatiotemporal information. It addresses the limitations of existing datasets by offering multi-dimensional data that can enhance the detection, understanding, and intervention of fake news. The repository aims to support various research applications and is continuously updated with new features and sources.

Uploaded by

fazlul.maruf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341886409

FakeNewsNet: A Data Repository with News Content, Social Context, and


Spatiotemporal Information for Studying Fake News on Social Media

Article in Big Data · June 2020


DOI: 10.1089/big.2020.0062

CITATIONS READS

1,292 4,492

5 authors, including:

Kai Shu Deepak Mahudeswaran


Illinois Institute of Technology 6 PUBLICATIONS 1,700 CITATIONS
188 PUBLICATIONS 13,673 CITATIONS
SEE PROFILE
SEE PROFILE

Suhang Wang Dongwon Lee


Pennsylvania State University Frankfurt University of Applied Sciences
273 PUBLICATIONS 20,532 CITATIONS 334 PUBLICATIONS 10,635 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Kai Shu on 09 June 2020.

The user has requested enhancement of the downloaded file.


FakeNewsNet: A Data Repository with News Content, Social
Context and Spatiotemporal Information for Studying Fake News
on Social Media
Kai Shu1 , Deepak Mahudeswaran1 , Suhang Wang2 , Dongwon Lee2 and Huan Liu1
1
Arizona State University, Tempe, 85281, USA
2
Pennsylvania State University, University Park, PA, 16802, USA
{kai.shu, dmahudes, huan.liu}@asu.edu, {szw494, dongwon}@psu.edu

Abstract
Social media has become a popular means for people to consume and share news. At the same
time, however, it has also enabled the wide dissemination of fake news, i.e., news with intentionally false
information, causing significant confusions and disruptions on society. To mitigate this problem, the
research of (computational) fake news detection has recently received a lot of attention. Despite several
existing computational solutions on the detection of fake news, however, the lack of comprehensive and
community-driven fake news benchmark datasets has become one of major roadblocks. Not only existing
datasets are scarce, they do not contain a myriad of features often required in the study such as news
content, social context, and spatiotemporal information. Therefore, in this paper, to facilitate fake news
related research, we present a fake news benchmark data repository, named as FakeNewsNet, which con-
tains two comprehensive datasets with diverse features in news content, social context, and spatiotemporal
information. We present a detailed description of the FakeNewsNet, demonstrate an exploratory analy-
sis of two datasets from varying perspectives, and discuss the benefits of the FakeNewsNet for potential
applications on fake news study on social media. The latest version of the FakeNewsNet is available at:
https://github.com/KaiDMML/FakeNewsNet

1 Introduction
Social media has become a primary source of news consumption nowadays. Social media is cost-free, easy
to access, and can fast disseminate posts. Hence, it acts as an excellent way for individuals to post and/or
consume information. For example, the time individuals spend on social media is continually increasing1 . As
another example, studies from Pew Research Center shows that around 68% of Americans get some of their
news on social media in 20182 and this has shown a constant increase since 2016. Since there is no regulatory
authority on social media, the quality of news pieces spread in social media is often lower than traditional
news sources. In other words, social media also enables the widespread of fake news. Fake news [18, 31]
means the false information that is spread deliberately to deceive people. Fake news affects the individuals
as well as society as a whole. First, fake news can disturb the authenticity balance of the news ecosystem.
Second, fake news persuades consumers to accept false or biased stories. For example, some individuals and
organizations spread fake news in social media for financial and political gains [1, 2]. It is also reported that
fake news has an influence on the 2016 US presidential elections3 . Finally, fake news may cause significant
effects on real-world events. For example, “Pizzagate”, a piece of fake news from Reddit, leads to a real
shooting4 . Thus, fake news detection is a critical issue that needs to be addressed.
1 https://www.socialmediatoday.com/marketing/how-much-time-do-people-spend-social-media-infographic
2 http://www.journalism.org/2018/09/10/news-use-across-social-media-platforms-2018/
3 https://www.independent.co.uk/life-style/gadgets-and-tech/news/tumblr-russian-hacking-us-presidential-election-fake-

news-internet-research-agency-propaganda-bots-a8274321.html
4 https://www.rollingstone.com/politics/politics-news/anatomy-of-a-fake-news-scandal-125877/

1
Detecting fake news on social media presents unique challenges. First, fake news pieces are intentionally
written to mislead consumers, which makes it not satisfactory to spot fake news from news content itself.
Thus, we need to explore information in addition to news content, such as user engagements and social
behaviors of users on social media. For example, a credible user’s comment that “This is fake news” is
a strong signal that the news may be fake. Second, the research community lacks datasets which contain
spatiotemporal information to understand how fake news propagates over time in different regions, how users
react to fake news, and how we can extract useful temporal patterns for (early) fake news detection and
intervention. Thus, it is necessary to have comprehensive datasets that have news content, social context
and spatiotemporal information to facilitate fake news research. However, to the best of our knowledge,
existing datasets only cover one or two aspects.
Therefore, in this paper, we construct and publicize a multi-dimensional data repository FakeNewsNet 5 ,
which currently contains two datasets with news content, social context, and spatiotemporal information.
The dataset is constructed using an end-to-end system, FakeNewsTracker6 [27]. The constructed FakeNews-
Net repository has the potential to boost the study of various open research problems related to fake news
study. First, the rich set of features in the datasets provides an opportunity to experiment with different
approaches for fake new detection, understand the diffusion of fake news in social network and intervene in
it. Second, the temporal information enables the study of early fake news detection by generating synthetic
user engagements from historical temporal user engagement patterns in the dataset [15]. Third, we can in-
vestigate the fake news diffusion process by identifying provenances, persuaders, and developing better fake
news intervention strategies [21]. Our data repository can serve as a starting point for many exploratory
studies for fake news, and provide a better, shared insight into disinformation tactics. We aim to continu-
ously update this data repository, expand it with new sources and features, as well as maintain completeness.
The main contributions of the paper are:

• We construct and publicize a multi-dimensional data repository for various facilitating fake news de-
tection related researches such as fake news detection, evolution, and mitigation;
• We conduct an exploratory analysis of the datasets from different perspectives to demonstrate the
quality of the datasets, understand their characteristics and provide baselines for future fake news
detection; and
• We discuss benefits and provides insight for potential fake news studies on social media with Fake-
NewsNet.

2 Background and Related Work


Fake news detection in social media aims to extract useful features and build effective models from existing
social media datasets for detecting fake news in the future. Thus, a comprehensive and large-scale dataset
with multi-dimension information in online fake news ecosystem is important. The multi-dimension infor-
mation not only provides more signals for detecting fake news but can also be used for researches such as
understanding fake news propagation and fake news intervention. Though there exist several datasets for fake
news detection, the majority of them only contains linguistic features. Few of them contains both linguistic
and social context features. To facilitate research on fake news, we provide a data repository which includes
not only news contents and social contents, but also spatiotemporal information. For a better comparison
of the differences, we list existing popular fake news detection datasets below and compare them with the
FakeNewsNet repository in Table 1.

• BuzzFeedNews7 : This dataset comprises a complete sample of news published in Facebook from 9
news agencies over a week close to the 2016 U.S. election from September 19 to 23 and September 26
and 27. Every post and the linked article were fact-checked claim-by-claim by 5 BuzzFeed journalists.
It contains 1,627 articles –826 mainstream, 356 left-wing, and 545 right-wing articles.
5 https://github.com/KaiDMML/FakeNewsNet
6 http://blogtrackers.fulton.asu.edu:3000/#/about
7 https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data

2
• LIAR8 : This dataset [26] is collected from fact-checking website PolitiFact. It has 12.8 K human
labeled short statements collected from PolitiFact and the statements are labeled into six categories
ranging from completely false to completely true as pants on fire, false, barely-true, half-true, mostly
true, and true.
• BS Detector9 : This dataset is collected from a browser extension called BS detector developed for
checking news veracity. It searches all links on a given web page for references to unreliable sources by
checking against a manually compiled list of domains. The labels are the outputs of the BS detector,
rather than human annotators.
• CREDBANK10 : This is a large-scale crowd-sourced dataset [13] of around 60 million tweets that
cover 96 days starting from Oct. 2015. The tweets are related to over 1,000 news events. Each event
is assessed for credibilities by 30 annotators from Amazon Mechanical Turk.
• BuzzFace11 : This dataset [17] is collected by extending the BuzzFeed dataset with comments related
to news articles on Facebook. The dataset contains 2263 news articles and 1.6 million comments.
• FacebookHoax12 : This dataset [23] comprises information related to posts from the facebook pages
related to scientific news (non- hoax) and conspiracy pages (hoax) collected using Facebook Graph
API. The dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific) with more than
2,300,000 likes.

We provide a comparison in Table 1 to show that no existing public datasets provide all features of
news content, social context, and spatiotemporal information. Existing datasets have some limitations that
FakeNewsNet addresses. For example, BuzzFeedNews only contains headlines and text for each news piece
and covers news articles from very few news agencies. LIAR dataset contains mostly short statements
instead of entire news articles with meta attributes. BS Detector data is collected and annotated by using
a developed news veracity checking tool, rather than using human expert annotators. CREDBANK dataset
was originally collected for evaluating tweet credibilities and the tweets in the dataset are not related to
the fake news articles and hence cannot be effectively used for fake news detection. BuzzFace dataset has
basic news contents and social context information but it does not capture the temporal information. The
FacebookHoax dataset consists very few instances about conspiracy theories and scientific news.
To address the disadvantages of existing fake news detection datasets, the proposed FakeNewsNet reposi-
tory collects multi-dimension information from news content, social context, and spatiotemporal information
from different types of news domains such as political and entertainment sources.

Table 1: Comparison with existing fake news detection datasets


Features News Content Social Context Spatiot. Information
Dataset Linguistic Visual User Post Response Network Spatial Temporal
BuzzFeedNews 3
LIAR 3
BS Detector 3
CREDBANK 3 3 3 3 3
BuzzFace 3 3 3 3
FacebookHoax 3 3 3 3
FakeNewsNet 3 3 3 3 3 3 3 3

8 https://www.cs.ucsb.edu/ william/software.html
9 https://github.com/bs-detector/bs-detector
10 http://compsocial.github.io/CREDBANK-data/
11 https://github.com/gsantia/BuzzFace
12 https://github.com/gabll/some-like-it-hoax

3
Figure 1: The flowchart of dataset integration process for FakeNewsNet. It mainly describes the collection
of news content, social context and spatiotemporal information.

3 Dataset Integration
In this section, we introduce a process that integrates datasets to construct the FakeNewsNet repository.
We demonstrate (see Figure 1) how we can collect news contents with reliable ground truth labels, how we
obtain additional social context and spatiotemporal information.

3.1 News Content


To collect reliable ground truth labels for fake news, we utilize fact-checking websites to obtain news contents
for fake news and true news such as PolitiFact 13 and GossipCop 14 . In PolitiFact, journalists and domain
experts review the political news and provide fact-checking evaluation results to claim news articles as fake15
or real16 . We utilize these claims as ground truths for fake and real news pieces. In PolitiFact’s fact-checking
evaluation result, the source URLs of the web page that published the news articles are provided, which can
be used to fetch the news contents related to the news articles. In some cases, the web pages of source news
articles are removed and are no longer available. To tackle this problem, we i) check if the removed page
was archived and automatically retrieve content at the Wayback Machine17 ; and ii) make use of Google web
search in automated fashion to identify news article that is most related to the actual news. GossipCop is a
website for fact-checking entertainment stories aggregated from various media outlets. GossipCop provides
rating scores on the scale of 0 to 10 to classify a news story as the degree from fake to real. From our
observation, almost 90% of the stories from GossipCop have scores less than 5, which is mainly because the
purpose of GossipCop is to showcase more fake stories. In order to collect true entertainment news pieces,
we crawl the news articles from E! Online18 , which is a well-known trusted media website for publishing
entertainment news pieces. We consider all the articles from E! Online as real news sources. We collect all
the news stories from GossipCop with rating scores less than 5 as the fake news stories.
Since GossipCop does not explicitly provide the URL of the source news article, so similarly we search
the news headline in Google or archive to obtain the news source information. The headlines of GossipCop
13 https://www.politifact.com/
14 https://www.gossipcop.com/
15 https://www.politifact.com/subjects/fake-news/
16 https://www.politifact.com/truth-o-meter/rulings/true/
17 https://archive.org/web/
18 https://www.eonline.com/

4
news articles are generally written to reflect the fact and so may not be used directly. For example, one of
the headlines, “Jennifer Aniston NOT Wearing Brad Pitts Engagement Ring, Despite Report” mentions the
fact instead of the original news articles title. We utilize some heuristics to extract proper headlines such
as i) using the text in quoted string; ii) removing negative sentiment words. For example, some headlines
include quoted strings which are exact text from the original news source. In this case, we extract the
named entities from the headline using CoreNLP tool [12] and quoted strings to form the search query. For
example, in headline Jennifer Aniston, Brad Pitt NOT “Just Married” Despite Report, we extract named
entities including Jennifer Aniston, Brad Pitt and quoted strings including Just Married and form the search
query as “Jennifer Aniston Brad Pitt Just Married” because the quoted text in addition with named entities
mostly provides the context of the original news. As another example, the headlines are written in the
negative sense to correct the false information, e.g., “Jennifer Aniston NOT Wearing Brad Pitts Engagement
Ring, Despite Report”. So we remove negative sentiment words retrieved from SentiWordNet[3] and some
hand-picked words from the headline to form the search query, e.g., “Jennifer Aniston Wearing Brad Pitts
Engagement Ring”.

Table 2: Statistics of the FakeNewsNet repository


PolitiFact GossipCop
Type Features
Fake Real Fake Real
# News articles 432 624 5,323 16,817
News Linguistic
# News articles with text 420 528 4,947 16,694
Content
Visual # News articles with images 336 447 1,650 16,767
# Users posting tweets 95,553 249,887 265,155 80,137
# Users involved in likes 113,473 401,363 348,852 145,078
User
# Users involved in retweets 106,195 346,459 239,483 118,894
# Users involved in replies 40,585 18,6675 106,325 50,799
Post # Tweets posting news 164,892 399,237 519,581 876,967
Social # Tweets with replies 11,975 41,852 39,717 11,912
Context Response # Tweets with likes 31692 93,839 96,906 41,889
# Tweets with retweets 23,489 67,035 56,552 24,955
# Followers 405,509,460 1,012,218,640 630,231,413 293,001,487
# Followees 449,463,557 1,071,492,603 619,207,586 308,428,225
Network
Average # followers 1299.98 982.67 1020.99 933.64
Average # followees 1440.89 1040.21 1003.14 982.80
# User profiles with locations 217,379 719,331 429,547 220,264
Spatial
# Tweets with locations 3,337 12,692 12,286 2,451
SpaTemp.
# Timestamps for news 296 167 3,558 9,119
Infor. Temporal
# Timestamps for response 171,301 669,641 381,600 200,531

3.2 Social Context


The user engagements related to the fake and real news pieces from fact-checking websites are collected using
search API provided by social media platforms such as the Twitter’s Advanced Search API19 . The search
queries for collecting user engagements are formed from the headlines of news articles, with special characters
removed from the search query to filter out the noise. We search for tweets using queries containing all the
words in the headline to ensure the relevance of the resultant tweets. In addition, the URLs mentioned in
the tweets collected are further used as search queries to collect additional tweets, so that we try to reduce
the bias of data collection only using keywords. After we obtain the social media posts that directly spread
news pieces, we further fetch the user response towards these posts such as replies, likes, and reposts. In
addition, when we obtain all the users engaging in news dissemination process, we collect all the metadata
for user profiles, user posts, and the social network information.

3.3 Spatiotemporal Information


The spatiotemporal information includes spatial and temporal information. For spatial information, we
obtain the locations explicitly provided in user profiles. The temporal information indicates that we record
the timestamps of user engagements, which can be used to study how fake news pieces propagate on social
media, and how the topics of fake news are changing over time. Since fact-checking websites periodically
19 https://twitter.com/search-advanced?lang=en

5
(a) PolitiFact Fake News (b) PolitiFact Real News

(c) GossipCop Fake News (d) GossipCop Real News


Figure 2: The word cloud of new body text for fake and real news on PolitiFact and GossipCop.

update newly coming news articles, so we dynamically collect these newly added news pieces and update
the FakeNewsNet repository as well. In addition, we keep collecting the user engagements for all the news
pieces periodically in the FakeNewsNet repository such as the recent social media posts, and second order
user behaviors such as replies, likes, and retweets. For example, we run the news content crawler and update
Tweet collector per day. The spatiotemporal information provides useful and comprehensive information for
studying fake news problem from a temporal perspective.

4 Data Analysis
FakeNewsNet has multi-dimensional information related to news content, social context, and spatiotemporal
information. In this section, we first provide some preliminary quantitative analysis to illustrate the features
of FakeNewsNet. We then perform fake news detection using several state-of-the-art models to evaluate the
quality of the FakeNewsNet repository. The detailed statistics of FakeNewsNet repository is illustrated in
Table 2.

4.1 Assessing News Content


Since fake news attempts to spread false claims in news content, the most straightforward means of detecting
it is to find clues in a news article. First, we analyze the topic distribution of fake and real news articles.
From figures 2(a) and 2(b), we can observe that the fake and real news of the PolitiFact dataset is mostly
related to the political campaign. In case of GossipCop dataset from figures 2(c) and 2(d), we observe that
the fake and real news are mostly related to gossip about the relationship among celebrities. In addition, we
can see the topics for fake news and real news are slightly different in general. However, for specific news, it
is difficult to only use topics in the content to detect fake news [18], which necessitates the need to utilize
other auxiliary information such as social context.
We also explore the distribution of publishers who publish fake news on both datasets. We find out that
there are in total 301 publishers publishing 432 fake news pieces, among which 191 of all publishers only
publish 1 piece of fake news, and 40 publishers publish at least 2 pieces of fake news such as theglobalhead-
lines.net and worldnewsdailyreport.com. For Gossipcop, there are in total 209 publishers publishing 6,048
fake news pieces, among which 114 of all publishers only publish 1 piece of fake news, and 95 publishers
publish at least 2 pieces of fake news such as hollywoodlife.com and celebrityinsider.org. The reason may be
that these fact-checking websites try to identify those check-worthy breaking news events regardless of the
publishers, and fake news publishers can be shut down after they were reported to publish fake news pieces.

6
4.2 Comparing Social Contexts of Fake and Real News
Social context represents the news proliferation process over time, which provides useful auxiliary information
to infer the veracity of news articles. Generally, there are three major aspects of the social media context
that we want to represent: user profiles, user posts, and network structures. Next, we perform an exploratory
study of these aspects on FakeNewsNet and introduce the potential usage of these features to help fake news
detection.

4.2.1 User Profiles

(a) PolitiFact dataset (b) GossipCop dataset


Figure 3: The distribution of user profile creation dates on PolitiFact and GossipCop datasets

User profiles on social media have been shown to be correlated with fake news detection [22]. Research
has also shown that fake news pieces are likely to be created and spread by non-human accounts, such as
social bots or cyborgs [18, 20]. We will illustrate some user profile features in FakeNewsNet repository.
First, we explore whether the creation time of user accounts for fake news and true news is different or
not. We compute time ranges of account register time with the current date and the results are shown in
Figure 3. We can see that the account creation time distribution of users posting fake news is significantly
different from those who post real news, with the p-value< 0.05 under t-test. Also, we notice that it’s not
necessary that users with an account created long time or shorter time post fake/real news more often.
For example, the mean creation time for users posting fake news (2214.09) is less than that for real news
(2166.84) in Politifact; while we see opposite case in Gossipcop dataset.

Figure 4: A comparison of bot scores on users related to fake and real news on PolitiFact dataset.

Next, we take a deeper look into the user profiles and assess the social bots effects. We randomly selected
10,000 users who posted fake and real news and performed bot detection using Botometer [5], one of the

7
(a) PolitiFact dataset (b) GossipCop dataset

Figure 5: Ternary plots of the ratio of the positive, neutral and negative sentiment replies for fake and real
news.

state-of-the-art bot detection algorithm. Botometer20 takes Twitter username as input and utilizes various
features extracted from meta-data and outputs a probability score in [0, 1], indicating how likely the user
is a bot. We set the threshold of 0.5 on the bot score returned from the Botometer results to determine
bot accounts. Figure 4 shows the ratio of the bot and human users involved in tweets related to fake and
real news. We can see that bots are more likely to post tweets related to fake news than real users. For
example, almost 22% of users involved in fake news are bots, while only around 9% of users are predicted as
bot users for real news. Similar results were observed with different thresholds on bot scores based on both
datasets. This indicates that there are bots in Twitter for spreading fake news, which is consistent with the
observation in [20]. In addition, most users that spread fake news (around 78%) are still more likely to be
humans than bots (around 22%), which is also in consistence with the findings in [24].

4.2.2 Post and Response


People express their emotions or opinions towards fake news through social media posts, such as skeptical
opinions, sensational reactions, etc. These features are important signals to study fake news and disinfor-
mation in general [9, 14].
We perform sentiment analysis on the replies of user posts that spread fake news and real news using
one of the state-of-the-art unsupervised sentiment prediction tool called VADER21 [8]. Figure 5 shows the
relationship between positive, neutral and negative replies for all news articles. For each news piece, we
obtain all the replies for this news piece and predict the sentiment as positive, negative, or neutral. Then
we calculate the ratio of positive, negative, and neutral replies for the news. For example, if a news piece
has the sentiment distribution of replies as [0.5, 0.5, 0.5], it occurs in the middle of the very center of the
triangle in Figure 5(a). We can also see that the real news have more number of neutral replies over positive
and negative replies whereas fake articles have a bigger ratio of negative sentiments. In case of sentiment
of the replies of the GossipCop dataset shown in Figure 5(b), we cannot observe any significant differences
between fake and real news. This could be because of the difficulty in identifying fake and real news related
to entertainment by common people.
We analyze the distribution of likes, retweets, and replies of tweets, which can help gain insights on user
interaction related to fake and real news. Social science studies have theorized the relationship between user
behaviors and their perceived beliefs on the information on social media [10]. For example, the behaviors of
likes and retweets are more emotional while replies are more rational.
We plot the ternary triangles which illustrate the ratio of replies, retweets, and likes from the second
order engagements towards the posts that spread fake news or real news pieces. From Figure 6, we observe
that the: i) fake news pieces tend to have fewer replies and more retweets; ii) Real news pieces have more
ratio of likes than fake news pieces, which may indicate that users are more likely to agree on real news. The
differences in the distribution of user behaviors between fake news and real news have potentials to study
20 https://botometer.iuni.iu.edu/
21 https://github.com/cjhutto/vaderSentiment

8
(a) PolitiFact dataset (b) GossipCop dataset

Figure 6: Ternary plots of the ratio of likes, retweet and reply of tweets related to fake and real news

(a) Follower count of users in PolitiFact (b) Followee count of users in PolitiFact
dataset dataset

(c) Follower count of users in GossipCop (d) Followee count of users in GossipCop
dataset dataset

Figure 7: The distribution of the count of followers and followees related to fake and real news

users’ beliefs characteristics. FakeNewsNet provides real-world datasets to understand the social factors of
user engagements and underlying social science as well.

4.2.3 Networks
Users tend to form different networks on social media in terms of interests, topics, and relations, which serve
as the fundamental paths for information diffusion [18]. Fake news dissemination processes tend to form an
echo chamber cycle, highlighting the value of extracting network-based features to represent these types of
network patterns for fake news detection [6].
We look at the social network statistics of all the users that spread fake news or real news. The social
network features such as followers count and followee count can be used to estimate the scope of how the
fake news can spread in social media. We plot the distribution of follower count and followee count of users
in Figure 7. We can see that: i) the follower and followee count of the users generally follows power law

9
distribution, which is commonly observed in social network structures; ii) there is a spike in the followee
count distribution of both users and this is because of the restriction imposed by Twitter22 on users to have
at most 5000 followees when the number of following is less than 5000.

4.3 Characterizing Spatiotemporal Information

(a) Temporal user engagements of fake news (b) Temporal user engagements of real news

Figure 8: The comparison of temporal user engagements of fake and real news

Recent research has shown users’ temporal responses can be modeled using deep neural networks to
help detection fake news [16], and deep generative models can generate synthetic user engagements to help
early fake news detection [11]. The spatiotemporal information in FakeNewsNet depicts the temporal user
engagements for news articles, which provides the necessary information to further study the utility of using
spatiotemporal information to detect fake news.
First, we investigate if the temporal user engagements such as posts, replies, retweets, are different for
fake news and real news with similar topics, e.g., fake news “TRUMP APPROVAL RATING Better than
Obama and Reagan at Same Point in their Presidencies” from June 9, 2018 to 13 June, 2018 and real news
“President Trump in Moon Township Pennsylvania” from March 10, 2018 to 20 March, 2018. As shown in
Figure 8, we can observe that: i) for fake news, there is a sudden increase in the number of retweets and it
does remain constant beyond a short time whereas, in the case of real news, there is a steady increase in
the number of retweets; ii) Fake news pieces tend to receive fewer replies than real news. We have similar
observations in Table 2, and replies count for 5.76% among all tweets for fake news, and 7.93% for real
news. The differences of diffusion patterns for temporal user engagements have the potential to determine
the threshold time for early fake news detection. For example, if we can predict the sudden increase of user
engagements, we should use the user engagements before the time point and detect fake news accurately to
limit the affect size of fake news spreading [21].
Next, we demonstrate the geo-location distribution of users engaging in fake and real news (See Figure 9
for Politifact dataset). We show the locations explicitly provided by users in their profiles, and we can see
that users in the PolitiFact dataset who posting fake news have a different distribution than those posting real
news. Since it is usually sparse of locations provided by users explicitly, we can further consider the location
information attached with Tweets, and even utilize existing approaches for inferring the locations [28]. It
would be interesting to explore how users are geo-located distributes using FakeNewsNet repository from
different perspectives.

4.4 Fake News Detection Performance


In this subsection, we utilize the PolitiFact and GossipCop datasets from FakeNewsNet repository to perform
fake news detection. We use 80% of data for training and 20% for testing. For evaluation, we use accuracy
and F1 score.

• News content: To evaluate the news contents, the text contents from source news articles are rep-
resented as a one-hot encoded vector and then we apply standard machine learning models including
support vector machines (SVM), logistic regression (LR), Naive Bayes (NB), and CNN. For SVM, LR,
22 https://help.twitter.com/en/using-twitter/twitter-follow-limit

10
(a) Spatial distribution for fake news (b) Spatial distribution for real news

Figure 9: Spatial distribution of users posting tweets related to fake and real news in PolitiFact dataset.

Table 3: Fake news detection performance on FakeNewsNet


Model PolitiFact GossipCop
Acc. F1 Acc. F1
SVM 0.580 0.659 0.497 0.595
LR 0.642 0.633 0.648 0.646
NB 0.617 0.651 0.624 0.649
CNN 0.629 0.583 0.723 0.725
SAF /S 0.654 0.681 0.689 0.703
SAF /A 0.667 0.619 0.635 0.706
SAF 0.691 0.706 0.689 0.717

and NB, we used the default settings provided in the scikit-learn and do not tune parameters. For
CNN we use the standard implementation with default setting23 . We also evaluate the classification
of news articles using Social article fusion (SAF /S) [27] model that utilizes auto-encoder for learning
features from news articles to classify new articles as fake or real.
• Social context: In order to evaluate the social context, we utilize the variant of SAF model [27], i.e.,
SAF /A, which utilize the temporal pattern of the user engagements to detect fake news.

• News content and social context: Social Article Fusion(SAF) model that combines SAF /S and
SAF /A is used. This model uses autoencoder with LSTM cells of 2 layers for encoder as well as
decoder and also temporal pattern of the user engagements are also captured using another network
of LSTM cells with 2 layers.

The experimental results are shown in Table 3. We can see that: i) Among news content-based methods,
SAF /S perform better in terms of accuracy and F1 score in most cases. SAF /A provides a similar result
around 66.7% accuracy as SAF /S. The compared baselines models provide reasonably good performance
results for the fake news detection where accuracy is mostly around 65% on PolitiFact; ii) we observe that
SAF relatively achieves better accuracy than both SAF /S and SAF /A for both dataset. For example, SAF
has around 5.65% and 3.60% performance improvement than SAF /S and SAF /A on PolitiFact in terms of
Accuracy. This indicates that user engagements can help fake news detection in addition to news articles on
PolitiFact dataset.
In summary, FakeNewsNet provides multiple dimensions of information that has the potential to benefit
researchers to develop novel algorithms for fake news detection.

5 Data Structure
In this section, we describe in details of the structure of FakeNewsNet. We will introduce the data format
and provide API interfaces that allows for efficient downloading of dataset under the policy of social media
platforms.
23 https://github.com/dennybritz/cnn-text-classification-tf

11
5.1 API Interfaces
The full dataset is massive and the actual content cannot be directly distributed because of Twitter’s sharing
policy24 . The dataset25 is referenced using DOI 26 and adheres FAIR Data Principles 27 . The APIs are
provided in the form of multiple Python scripts which are well-documented and CSV file with news content
URLs and associated tweet ids are provided as well. In order to initiate the download, the user need to
simply run the main.py file with the required configuration. The APIs make use of Twitter Access tokens
to fetch information related to tweets. These APIs can help to download specific subsets of dataset such
as linguistic content, tweet information, retweet information, user information and social network. Since
Twitter does not provide APIs to download replies and likes of tweets, web scrapping tools can be used.

5.2 Data Format


The news pieces from different platforms/domains are stored in different directories. For example, gossip-
cop/fake directory will contain fake news samples from GossipCop dataset. Each directory will possess the
associated auto-generated news ID as its name and contain the following structure: news article.json file,
tweets folder, retweets folder, replies folder, and likes folder.

• news article.json includes all the meta information of the news articles collected using the provided
news source URLs. This is a JSON object with attributes including:

– text is the text of the body of the news article.


– images is a list of the URLs of all the images in the news article web page.
– publish date indicates the publishing date of that article.

• tweets folder contains the metadata of the list of tweets associated with the news article. Each file in
this folder contains the tweet objects returned by the Twitter API.
• retweets folder includes a list of files containing the retweets of tweets posting the news articles. Each
file is named as <tweet id>.json and have a list of retweet objects collected using Twitter API.
• replies folder contains files including replies and conversation threads of tweets sharing the news such
as reply text, user details and reply timestamps.
• likes folder comprises files containing a list of IDs for users who have liked each of the tweets sharing
the news article.

In addition, we store the meta data of all users including profiles, historical tweets, followers, followees
through the following folders. Each of the these folders contains files named as <user id>.json indicating a
particular user details. Note that we only show the meta of 5000 users in the provided link due to the space
limitation.
• user profiles folder includes files containing all the metadata of the users in the dataset. Each file
is this directory is a JSON object collected from Twitter API containing information about the user
including profile creation time, geolocation of the user, profile image URL, followers count, followees
count, number of tweets posted and number of tweets favorited.
• user timeline tweets folder includes JSON files containing the list of at most 200 recent tweets posted
by the user. This includes the complete tweet object with all information related to tweet.
• user followers folder includes JSON files containing a list of user IDs of users following a particular
user.
• user following folder includes JSON files containing a list of user IDs a particular user follows.
24 https://developer.twitter.com/en/developer-terms/agreement-and-policy
25 To access the dataset, we have published code implementation available at https://github.com/KaiDMML/FakeNewsNet

that allows the users to download specific subsets of data.


26 https://doi.org/10.7910/DVN/UEMMHS
27 https://www.force11.org/group/fairgroup/fairprinciples

12
6 Potential Applications
FakeNewsNet contains information from multi-dimensions which could be useful for many applications. We
believe FakeNewsNet would benefit the research community for studying various topics such as: (early) fake
news detection, fake news evolution, fake news mitigation, malicious account detection.

6.1 Fake News Detection


One of the challenges for fake news detection is the lack of labeled benchmark dataset with reliable ground
truth labels and comprehensive information space, based on which we can capture effective features and
build models. FakeNewsNet can help the fake news detection task because it has reliable labels annotated
by journalists and domain experts, and multi-dimension information from news content, social context, and
spatiotemporal information.
First, news contents are the fundamental sources to find clues to differentiate fake news pieces. For
example, studies have shown that news contents can be modeled with tensor embedding in a semi-supervised
or unsupervised manner to detect fake news [30, 32]. In addition, news representation can be obtained with
deep neural networks to improve fake news detection [33, 35]. In FakeNewsNet, we provide various attributes
of news articles such as publishers, headlines, body texts, and images/videos. This information can be used
to extract different linguistic features and visual features to further build detection models for clickbaits
or fake news. Since we directly collect news articles from fact-checking websites such as PolitiFact and
GossipCop, we provide detailed explanations from the fact-checkers, which are useful to learn common and
specific perspectives of in what aspects the fake news pieces are formed.
Second, user engagements represent the news proliferation process over time, which provides useful aux-
iliary information to infer the veracity of news articles [34]. Generally, there are three major aspects of the
social context: users, generated posts, and networks. Since fake news pieces are likely to be created and
spread by non-human accounts, such as bots [20]. Thus, capturing users’ profiles and characteristics can
provide useful information for fake news detection. Also, people express their emotions or opinions towards
fake news through social media posts and thus we collect all the user posts for news pieces, as well as en-
gagements such as reposts, comments, likes, which can be utilized to extract abundant features to captures
fake news patterns. Moreover, fake news dissemination processes tend to form an echo chamber cycle, high-
lighting the value of extracting network-based features to represent these types of network patterns for fake
news detection. We provide a large-scale social network of all the users involving in the news dissemination
process.
Third, early fake news detection aims to give early alerts of fake news during the dissemination process
before it reaches a broad audience [11]. Therefore early fake news detection methods are highly desirable
and socially beneficial. For example, capturing the pattern of user engagements in the early phases could
be helpful to achieve the goal of unsupervised detection. Recent approaches utilize advanced deep gener-
ative models to generate synthetic user comments to help improve fake news detection performance [15].
FakeNewsNet contains all these types of information, which provides potentials to further explore early fake
news detection models. In addition, FakeNewsNet contains two datasets of different domains, i.e., political
and entertainment, which can help to study common and different patterns for fake news under different
topics. Moreover, being able to explain prediction results is important for decision makers to mitigate fake
news, and FakeNewsNet has multi-source of signals which can be exploited as explainable factors (e.g., user
comments) [25].

6.2 Fake News Evolution


The fake news diffusion process also has different stages in terms of people’s attention and reactions as
time goes by, resulting in a unique life cycle. For example, breaking news and in-depth news demonstrate
different life cycles in social media [4], and social media reactions can help predict future visitation patterns
of news pieces accurately even at an early stage. We can have a deeper understanding of how particular
stories “go viral” from normal public discourse by studying the fake news evolution process. First, tracking
the life cycle of fake news on social media requires recording essential trajectories of fake news diffusion in
general [19]. Thus, FakeNewsNet has collected the related temporal user engagements which can keep track

13
of these trajectories. Second, for a specific news event, the related topics may keep changing over time and
be diverse for fake news and real news. FakeNewsNet is dynamically collecting associated user engagements
and allows us to perform comparison analysis (e.g., see Figure 8), and further investigate distinct temporal
patterns to detect fake news [16]. Moreover, statistical time series models such as temporal point process
can be used to characterize different stages of user activities of news engagements [7]. FakeNewsNet enables
the temporal modeling from real-world datasets, which is otherwise impossible from synthetic datasets.

6.3 Fake News Mitigation


Fake news mitigation aims to reduce the negative effects brought by fake news. During the spreading process
of fake news, users play different roles such as provenances: the sources or originators who publish fake news
pieces; persuaders: who spread fake news with supporting opinions; and clarifiers: who propose skeptically
and opposing viewpoints towards fake news and try to clarify them. Identifying key users on social media
is important to mitigate the effect of fake news [29]. For example, provenances can help answer questions
such as whether the piece of news has been modified during its propagation. In addition, it is necessary to
identify influential persuaders to limit the spread scope of fake news by blocking the information flow from
them to their followers [21]. FakeNewsNet provides rich information about users who post, like, comment
on fake and real news pieces (see Figure 6), which enables the exploration of identifying different types of
users.
To mitigate the effect of fake news, network intervention aims to develop strategies to control the
widespread dissemination of fake news before it goes viral. Two major strategies of network intervention
are: i) Influence Minimization: minimizing the spread scope of fake news during dissemination process; ii)
Mitigation Campaign: Limiting the impact of fake news by maximizing the spread of true news. FakeNews-
Net allows researchers to build a diffusion network with spatiotemporal information and can facilitate the
deep understanding of minimizing the influence scopes. Furthermore, we may able to identify the fake news
and real news pieces for a specific event from FakeNewsNet and study the effect of mitigation campaigns in
real-world datasets.

6.4 Malicious Account Detection


Studies have shown that malicious accounts that can amplify the spread of fake news include social bots,
trolls, and cyborg users. Social bots can give a false impression that information is highly popular and
endorsed by many people, which enables the echo chamber effect for the propagation of fake news. We can
study the nature of users who spread fake news and identify the characteristics of bot accounts used in fake
news diffusion process through FakeNewsNet. Using features like user profile metadata and historical tweets
of users who spread fake news along with social network one could analyze the differences in characteristics
of users to clusters them as malicious or not. Through a preliminary study in Figure 4, we have shown that
bot users are more likely to exist in the fake news spreading process. Although existing works have studied
bot detection in general, but few studies investigate the influences of social bots for fake news spreading.
FakeNewsNet could potentially facilitate the study of understanding the relationship between fake news and
social bots, and further, explore the mutual benefits of studying fake news detection or bot detection.

7 Conclusion and Future Work


In this paper, we provide a comprehensive repository FakeNewsNet which contains news content, social con-
text, and spatiotemporal information. We propose a principled strategy to collect relevant data from different
sources. Moreover, we perform a preliminary exploration study on various features on FakeNewsNet and
demonstrate its utility through fake news detection task over several state-of-the-art baselines. FakeNewsNet
has the potential to facilitate many promising research directions such as fake news detection, mitigation,
evolution, malicious account detection, etc.
There are several interesting options for future work. First, FakeNewsNet repository can be extended
to other reliable news sources such as other fact-checking websites or curated data collections. Second, the
selection strategy can be used for web search results to reduce noise in the data collection process. Third,

14
FakeNewsNet repository can be integrated with front-end softwares and build an end-to-end system for fake
news study.

Ackowledgments
This material is in part supported by the NSF awards #1909555, #1614576, #1742702, #1820609, and
#1915801.

References
[1] Srijan Kumar, and Neil Shah. 2018. False information on web and social media: A survey. In arXiv
preprint arXiv:1804.08559(2018)
[2] Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the web: Impact, characteristics,
and detection of wikipedia hoaxes. In WWW’16.
[3] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet3.0: an enhanced lexical
resource for sentiment analysis and opinion mining. In Lrec, Vol. 10. 2200–2204.
[4] Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer, and Matt Stempeck. Characterizing the life cycle
of online news stories using social media reac-tions. In CHI’14.
[5] Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Fil-ippo Menczer. Botornot:
A system to evaluate social bots. In WWW’16.
[6] Michela Del Vicario, Gianna Vivaldo, Alessandro Bessi, Fabiana Zollo, AntonioScala, Guido Caldarelli,
and Walter Quattrociocchi. 2016. Echo chambers: Emo-tional contagion and group polarization on face-
book.Scientific reports6 (2016),37825.
[7] Mehrdad Farajtabar, Jiachen Yang, Xiaojing Ye, Huan Xu, Rakshit Trivedi, EliasKhalil, Shuang Li, Le
Song, and Hongyuan Zha. 2017. Fake news mitigation viapoint process based intervention. arXiv preprint
arXiv:1703.07823(2017).
[8] CJ Hutto Eric Gilbert. Vader: A parsimonious rule-based model for sen-timent analysis of social media
text. In ICWSM’14.
[9] Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo. News Verification by Exploiting Conflicting
Social Viewpoints in Microblogs. In AAAI’16.
[10] Antino Kim and Alan R Dennis. 2017. Says Who?: How News PresentationFormat Influences Perceived
Believability and the Engagement Level of SocialMedia Users. (2017).
[11] Yang Liu and Yi-fang Brook Wu. Early Detection of Fake News on SocialMedia Through Propagation
Path Classification with Recurrent and Convolu-tional Networks. In AAAI’18.
[12] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, StevenBethard, and David McClosky.
2014. The Stanford CoreNLP natural languageprocessing toolkit. In ACL’14. 55–60.
[13] Tanushree Mitra and Eric Gilbert. CREDBANK: A Large-Scale SocialMedia Corpus With Associated
Credibility Annotations. In ICWSM’15.
[14] Vahed Qazvinian, Emily Rosengren, Dragomir R Radev, and Qiaozhu Mei. [n.d.]. Rumor has it: Iden-
tifying misinformation in microblogs. In EMNLP’11.
[15] Feng Qian, ChengYue Gong, Karishma Sharma, and Yan Liu. Neural User Response Generator: Fake
News Detection with Collective User Intelligence.. In IJCAI’18.
[16] Natali Ruchansky, Sungyong Seo, and Yan Liu. Csi: A hybrid deep modelfor fake news detection. In
CIKM’17.

15
[17] Giovanni C Santia and Jake Ryland Williams. BuzzFace: A News VeracityDataset with Facebook User
Commentary and Egos. In ICWSM’18.
[18] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social
media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 22–36.

[19] Chengcheng Shao, Giovanni Luca Ciampaglia, Alessandro Flammini, and Fil-ippo Menczer. Hoaxy: A
platform for tracking online misinformation. In WWW’16.
[20] Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Alessandro Flam-mini, and Filippo Menczer.
2017. The spread of fake news by social bots.arXivpreprint arXiv:1707.07592(2017).
[21] Kai Shu, H. Russell Bernard, and Huan Liu. 2018. Studying Fake News via Net-work Analysis: Detection
and Mitigation.CoRRabs/1804.10233 (2018).
[22] Kai Shu, Suhang Wang, and Huan Liu. 2018. Understanding user profiles on social media for fake news
detection. In 2018 IEEE MIPR. IEEE, 430–435.
[23] Eugenio Tacchini, Gabriele Ballarin, Marco L Della Vedova, Stefano Moret, and Luca de Al-
faro. 2017. Some like it hoax: Automated fake news detection in social networks.arXiv preprint
arXiv:1704.07506(2017).
[24] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online.Science359,
6380 (2018), 1146–1151.
[25] Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. defend: Explainable fake news
detection. In KDD 2019.
[26] William Yang Wang. 2017. ” liar, liar pants on fire”: A new benchmark dataset for fake news detection.
arXiv preprint arXiv:1705.00648(2017).
[27] Kai Shu, Deepak Mahudeswaran, and Huan Liu. FakeNewsTracker: a toolfor fake news collection,
detection, and visualization. In CMOT’18.
[28] Arkaitz Zubiaga, Alex Voss, Rob Procter, Maria Liakata, Bo Wang, and Adam Tsakalidis. 2017.Towards
real-time, country-level location classification of worldwide tweets.IEEE Transactions on Knowledge and
Data Engineering29,9 (2017), 2053–2066.
[29] Mustafa Alassad, Muhammad Nihal Hussain, and Nitin Agarwal. 2019. Finding Fake News Key Spread-
ers in Complex Social Networks by Using Bi-Level Decomposition Optimization Method. In International
Conference on Modelling and Simulation of Social-Behavioural Phenomena in Creative Societies
[30] Gisel Bastidas Guacho, Sara Abdali, Neil Shah, and Evangelos E Papalexakis. 2018. Semi-supervised
Content-based Detection of Misinformation via Tensor Embeddings. In ASONAM.

[31] Kai Shu, and Huan Liu. Detecting fake news on social media. In Synthesis Lectures on Data Mining
and Knowledge Discovery, 2019.
[32] Seyedmehdi Hosseinimotlagh and Evangelos E Papalexakis. 2018. Unsupervised Content-Based Identi-
fication of Fake News Articles with Tensor Decomposition Ensembles. (2018).
[33] Hamid Karimi, Proteek Roy, Sari Saba-Sadiya, and Jiliang Tang. 2018. Multi Source Multi-Class Fake
News Detection. In COLING.
[34] Kai Shu, Suhang Wang, and Huan Liu. Beyond News Contents: The Role of Social Context for Fake
News Detection. In WSDM’19.
[35] Hamid Karimi and Jiliang Tang. 2019. Learning Hierarchical Discourse-level Structure for Fake News
Detection. arXiv preprint arXiv:1903.07389 (2019)

16

View publication stats

You might also like