Misinformation Detection and Fact Checking Report
Misinformation Detection and Fact Checking Report
on Social Media
Dissertation submitted to the
University College of Engineering (A), Osmania University.
In Partial fulfilment for the award of the degree
of
BACHELOR OFENGINEERING
In
COMPUTER SCIENCE AND ENGINEERING
Submitted by
MANEESHA Y (100521733033)
MADIHA FIRDOUS (100521733032)
N SRI CHANDANA (100521733040)
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING
UNIVERSITY COLLEGE OF ENGINEERING (A),
OSMANIA UNIVERSITY HYDERABAD, TS-500007
MAY – 2025
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY COLLEGE OF ENGINEERING,
OSMANIA UNIVERSITY
CERTIFICATE
This is to certify the bonafide work of Maneesha Y, Madiha Firdous, N Sri Chandana
bearing roll numbers 100521733033,100521733032,100521733040 respectively, for their
major project, in partial fulfilment of Bachelor of Engineering Degree, offered by Department
of Computer Science and Engineering, University College of Engineering, Osmania
University.
We declare that the wok reported in the project report entitled “Misinformation Detection and
Fact Checking on Social Media” submitted by Maneesha Y, Madiha Firdous, N Sri Chandana
is a record of the work done by us in the DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING, UNIVERSITY COLLEGE OF ENGINEERING. No part of the report is
copied from books/journals/internet and wherever referred, the same has been duly
acknowledged in the text. The reported data is based on the work done entirely by us and not
copied from any other source or submitted to any other Institute or University for the award
of a degree or diploma.
SIGNATURES
Maneesha Y:
Madiha Firdous:
N Sri Chandana:
ACKNOWLEGMENT
It is our privilege and pleasure to express our profound sense of respect, gratitude,
and indebtedness to our guide E.Pragnavi, Department of Computer Science and
Engineering, UCEOU, for his inspiration, guidance, cogent discussion, constructive
criticisms, and encouragement throughout this dissertation work.
In response to this challenge, our project presents a multimodal fake news detection system
that analyses both text and images associated with online posts. Using the Fakeddit dataset,
which contains labelled social media entries with corresponding images and text, we built a
deep learning model that leverages BERT for extracting contextual text embeddings and
ResNet50 for capturing visual features. These features are fused and passed through a neural
network for final classification.
Our experimental setup demonstrated that the multimodal system significantly outperforms
unimodal baselines. The model achieved an accuracy of 86.2%, along with strong precision
and recall scores, indicating its reliability across varied misinformation types. This
improvement highlights the advantage of jointly analysing textual and visual cues, which
often reveal inconsistencies undetectable by single-modality systems.
TABLE OF CONTENTS
S.NO CONTENT PAGE NO.
1. Chapter 1: Introduction
1.1. Introduction
1.2. Aim
1.3. Problem Definition
1.4. Motivation
1.5. Deep Learning
1.6. Convolutional Neural Network
1.7. ResNet50 Architecture
1.8. BERT Architecture
7. Chapter 7 Implementation
10. References
11. Appendix
LIST OF FIGURES
CHAPTER I
1.INTRODUCTION
1.1 Introduction
In today’s digital-first society, social media platforms like Twitter, Facebook, and Reddit have
become primary sources of news and information for millions of people. These platforms
enable users to share information instantly and globally, creating a fast-paced environment
for communication. However, this speed and openness also make it easy for false or
misleading information—commonly known as misinformation or fake news—to spread
uncontrollably. The impact of such content can be severe, influencing elections, undermining
public health, and creating unnecessary panic.
Fake news is no longer limited to just manipulated headlines or deceptive articles. Many
posts combine misleading textual information with compelling but unrelated or altered
images, making them more convincing and harder to detect. This multimodal nature of
misinformation presents new challenges for detection systems that analyze content based
solely on one type of input—either text or images. As a result, unimodal systems often fail to
recognize the complete context or underlying deception in such posts.
Recognizing this gap, our project proposes a multimodal deep learning system for detecting
misinformation. The system utilizes BERT (Bidirectional Encoder Representations from
Transformers) to understand the contextual nuances of text, and ResNet50, a powerful
convolutional neural network, to extract meaningful features from images. These two
modalities are fused into a single feature vector, which is then classified using a neural
network to determine whether a post is real or fake.
By combining both visual and textual cues, our approach provides a more accurate and
intelligent method for misinformation detection. The system not only identifies
inconsistencies between modalities but also learns patterns that could indicate manipulation.
This project aims to contribute a scalable, AI-based solution that can support online platforms
and fact-checking tools in mitigating the spread of false information.
1.2 AIM
The aim of this project is to design and develop an intelligent, scalable, and efficient deep
learning system for the detection of misinformation and fake news on social media platforms.
As online platforms become increasingly influential in shaping public opinion, the threat
posed by fake news has grown substantially. Our goal is to harness the power of multimodal
deep learning techniques to accurately assess the credibility of social media content by
analyzing both textual and visual elements.
By leveraging cutting-edge AI models such as BERT for natural language understanding and
ResNet50 for visual interpretation, we intend to create a system capable of identifying
deceptive or manipulated content that may otherwise bypass traditional detection
mechanisms. This approach aims not only to improve detection accuracy but also to provide a
strong foundation for real-time misinformation monitoring tools in large-scale, real-world
applications.
Specific Objectives:
Through these objectives, our project aims to contribute a reliable, data-driven solution to the
growing problem of fake news. By leveraging the synergy of multimodal deep learning, we
strive to set a precedent for future AI systems capable of enhancing digital content credibility
and ensuring safer information environments.
Despite numerous advances in machine learning and natural language processing, the
detection of fake news and misinformation remains an unsolved challenge. Traditional fake
news detection models tend to focus on either textual content or visual elements, treating
them in isolation. However, in real-world scenarios, especially on social media,
misinformation often manifests in multimodal forms—combining deceptive text with
unrelated, altered, or emotionally charged images. This disconnect between text and image
can mislead readers, bypass unimodal detection systems, and amplify the spread of harmful
content
Design and implement a multimodal deep learning framework that can process both text and
image components of a post and classify it as fake or real, with improved accuracy over
unimodal baselines
1.4 MOTIVATION
The motivation for this project stems from the increasingly sophisticated ways in which
misinformation is being propagated. Social media users often consume and share content
without verifying its credibility. Manual fact-checking by journalists and watchdog
organizations, while accurate, is time-consuming, limited in scale, and reactive rather than
preventive.
Several factors motivated our choice to use a multimodal deep learning model:
Text-Image Discrepancies: Many fake posts pair legitimate-sounding text with false
or unrelated images. Analyzing these discrepancies helps in uncovering manipulative
intent.
Higher Classification Accuracy: Combining features from both text and image
domains has been shown to outperform unimodal systems in benchmark studies.
Advanced Architectures: The availability of powerful pretrained models like BERT
for language and ResNet50 for vision tasks makes it feasible to develop sophisticated
solutions using transfer learning.
Societal Impact: The COVID-19 pandemic, political disinformation campaigns, and
environmental hoaxes have demonstrated the urgency of building reliable
misinformation detection tools.
Our work is therefore not only technically ambitious but socially relevant. Through this
project, we aim to contribute a practical solution that could aid researchers, social media
platforms, and policy makers in curbing the spread of harmful fake news content.
In the context of our project, deep learning serves as the foundation for building an
intelligent, multimodal fake news detection system. Specifically, we utilize two powerful
deep learning architectures tailored for different data modalities:
Convolutional Neural Networks (CNNs), such as ResNet50, are used to analyse and
extract spatial features from images. These networks can identify textures, objects,
and subtle visual cues that may indicate manipulation or deception in media content.
By combining both CNNs for visual analysis and transformers for textual analysis, our
system takes a dual-modality approach. This allows for a deeper and more nuanced
understanding of the content, enabling the model to detect inconsistencies between images
and text—a common trait of misinformation.
While deep learning offers immense capabilities, it also comes with challenges. Training deep
models requires large labelled datasets, significant computational resources, and careful
hyperparameter tuning. In this project, deep learning not only enables more accurate fake
news detection but also demonstrates its broader applicability in tackling real-world
challenges.
Convolutional Neural Networks (CNNs) are a class of deep learning models specifically
designed to process and analyse visual data. They have become the foundation of many state-
of-the-art applications in computer vision, including image classification, object detection,
and facial recognition. CNNs are capable of learning hierarchical features from images, such
as edges, textures, and complex shapes, making them highly effective for understanding and
interpreting visual content.
The architecture of a Convolutional Neural Network (CNN) is structured into three primary
components: the input layer, the hidden layers, and the output layer.
1.Input Layer: The input layer serves as the entry point for the network. It receives the input
image data, which is typically in the form of a matrix of pixel values representing the image's
features.
2.Hidden Layers: The hidden layers are where the core processing of the CNN occurs. They
consist of multiple convolutional and pooling layers.
Convolutional Layers: These layers apply a set of learnable filters (also called
kernels) to the input image. Each filter detects specific patterns or features, such as
edges or textures, by performing convolution operations. This helps the network to
extract hierarchical representations of the input data.
Pooling Layers: After each convolutional layer, a pooling layer is often applied to
down sample the feature maps generated by the convolutional layers. Pooling helps to
reduce the spatial dimensions of the feature maps while retaining important
information, thereby decreasing computational complexity and controlling overfitting.
3.Output Layer: The output layer provides the final results of the network's processing. It
typically consists of one or more fully connected layers that perform classification or
regression tasks based on the features extracted by the hidden layers. For classification tasks,
the output layer produces the predicted class label or probability scores for each class.
The effectiveness of a CNN largely depends on the configuration of its hidden layers—such
as the number of layers, filter size, stride, padding, and pooling strategy. By carefully tuning
these hyperparameters, CNNs can be optimized for both accuracy and efficiency in a wide
range of image-based applications.
In this project, we use ResNet50, a deep CNN with 50 layers and residual connections, which
enables efficient training of very deep networks and improves feature extraction for image-
based misinformation detection.
ResNet50 is a deep convolutional neural network architecture that extends the standard CNN
design through the innovative use of residual connections. It is a 50-layer variant of the
ResNet (Residual Network) family and has become a standard in image classification tasks
due to its ability to train very deep networks efficiently. The core innovation of ResNet50 lies
in its ability to overcome the vanishing gradient problem, which often hampers the training
of deep neural networks. This is achieved through skip connections, allowing the model to
learn residual functions rather than direct mappings.
The architecture of ResNet50 can be broken down into several key components:
Convolutional Layers:
These are the foundational layers responsible for extracting features from input
images. The layers apply filters to detect low-level patterns such as edges and
textures. Each convolutional layer is followed by Batch Normalization and ReLU
activation to improve stability and introduce non-linearity. A max pooling operation
is applied early in the architecture to reduce spatial dimensions while preserving
essential features.
Identity and Convolutional Blocks:
These are the central building blocks of ResNet50. An identity block allows the input
to be added directly to the output of the convolutional layers within the block. This
addition is made possible through skip connections, enabling the network to learn
changes (or “residuals”) from the input rather than starting from scratch. The
convolutional block is a slightly modified version that includes a 1×1 convolution to
match dimensions when the input and output sizes differ. These blocks work together
to enable efficient deep learning without performance degradation.
Skip Connections
The ResNet50 architecture incorporates skip connections, also referred to as residual
connections, as a fundamental aspect. These connections are pivotal in mitigating the issue of
vanishing gradients by facilitating the direct flow of information from the input to the output
of the network, circumventing one or more layers. Consequently, the network is empowered
to learn residual functions that effectively map the input to the desired output, thus alleviating
the need to learn the entire mapping from scratch.
In our project, ResNet50 serves as the backbone for visual feature extraction. It processes
each image and outputs a 2048-dimensional feature vector representing the most relevant
and abstract characteristics of the visual content. This vector is then fused with textual
features from BERT to enable multimodal fake news classification. The use of ResNet50
ensures that the system captures subtle visual cues and patterns—such as manipulated
images, misleading graphics, or visual inconsistencies—that may contribute to
misinformation.
1.8 BERT Architecture
At the core of BERT is the Transformer architecture, which relies heavily on self-attention
mechanisms. These attention layers allow the model to assign varying degrees of importance
to different words in a sentence, depending on the context in which they appear. This is
especially useful for interpreting complex or ambiguous language, as it allows BERT to
understand the true meaning of words based on their surroundings. For instance, the word
"bank" could refer to a financial institution or the side of a river—BERT can disambiguate
this based on the sentence.
The architecture of the BERT-base model, which we use in our project, consists of:
12 Transformer encoder layers
12 self-attention heads
768-dimensional hidden size
110 million parameters
In our project, BERT is used to process the textual content of social media posts, extracting
contextual embeddings that represent the semantic meaning of the text. Each post is
tokenized and passed through BERT, and the output from the special [CLS] token (which
stands for "classification") is taken as a 768-dimensional vector representing the entire input
sentence or paragraph. This vector is rich in contextual information and captures the
underlying intent and tone of the post, which is critical for detecting misleading or deceptive
text.
By integrating BERT with image features from ResNet50, our system benefits from a
comprehensive, multimodal representation of each social media post. BERT enhances the
model's ability to understand subtle textual manipulations such as sarcasm, misinformation,
emotional bias, or clickbait-style phrasing—all of which are common tactics used in fake
news. Its pretraining on massive corpora, including Wikipedia and BooksCorpus, gives it a
strong foundational understanding of language, which is fine-tuned in our model for the
specific task of misinformation detection.
CHAPTER II
2.LITERATURE SURVEY
The detection of misinformation and fake news using artificial intelligence has become a
critical area of research, especially with the growing influence of social media. Researchers
have explored a wide range of approaches—from natural language processing to multimodal
deep learning—to automate and improve the accuracy of misinformation detection systems.
Zhou et al. (2020) – FakeNewsNet: A Data Repository for Fake News Research
Zhou and colleagues introduced FakeNewsNet, a benchmark data repository that integrates
textual content, social context, and spatiotemporal signals associated with fake and real news.
Their research demonstrated that fake news often spreads through unique social patterns and
user interactions, suggesting that content alone is insufficient for accurate classification. This
dataset supports the development of models that analyze multiple information streams for
better misinformation detection.
Ahmed et al. (2021) – Transformer Models vs RNNs for Fake News Detection
Ahmed conducted a comparative evaluation of BiLSTM, GRU, and transformer-based
models. The study confirmed that transformer models, particularly BERT, outperformed
traditional RNN-based architectures in accuracy and contextual understanding. This is
attributed to BERT’s bidirectional attention and pretraining on large corpora. Their research
reinforced BERT’s dominance in modern NLP tasks including misinformation detection.
Talwar and Arora (2023) – Real-Time Detection Systems for Crisis Misinformation
This study addressed the importance of real-time misinformation detection, especially
during public health or political crises. The authors emphasized the need for lightweight,
scalable systems capable of handling large volumes of social media data. They also proposed
the integration of browser plugins and content moderation APIs, setting the direction for
practical deployments of fake news detectors.
Summary and Insights
The reviewed literature highlights a clear transition from unimodal, content-only systems to
multimodal architectures that consider both textual and visual modalities. Research
consensus shows that combining text and image improves accuracy, particularly when posts
include conflicting or manipulated media. Our project builds upon these findings by
developing a BERT–ResNet50 based multimodal system, trained and tested on the Fakeddit
dataset, aiming to contribute a scalable, AI-powered solution to misinformation detection.
CHAPTER III
3. Fundamentals of Fake News & Misinformation
Understanding the different forms of false information is crucial for building effective
detection systems:
Misinformation can be broadly classified based on the intent to deceive and the context in
which it is shared:
1. Satire or Parody: Uses humor, irony, or exaggeration to twist the meaning of genuine
content. While often intended to entertain or provide social commentary, it can be
misinterpreted as factual if the context is not clear. For example, satirical news
websites like The Onion create exaggerated, fictional stories that may be taken
seriously if shared without proper context.
7. Fabricated Content: The most extreme form of misinformation, where entirely false
content is created with the intent to deceive and cause harm. Unlike other types,
fabricated content lacks any basis in fact and is often designed to provoke strong
emotional reactions or generate clicks for financial gain.
3.4 Why Multimodal Detection?
The vast majority of misinformation spreads through content that mixes text and images. For
instance, a real image might be paired with deceptive text to create false impressions. A
robust detection system must therefore examine not only the text or the image in isolation but
also how they relate to one another. Key benefits of multimodal detection include:
Health Impact: The spread of false medical information can have dire consequences,
as seen during the COVID-19 pandemic. Misinformation about vaccines, treatments,
and preventive measures can increase public health risks and reduce trust in scientific
institutions.
Detecting misinformation is a complex and evolving challenge, driven by the rapid growth of
digital content and the sophisticated tactics used by malicious actors. Key challenges include:
Rapid Spread: Misinformation can spread quickly on social media, making real-time
detection critical but difficult to achieve.
Language Complexity: Sarcasm, satire, and local language nuances can make
detection challenging, as AI systems may struggle to interpret context accurately.
Scalability and Speed: Effective detection requires scalable systems that can process
large volumes of data without significant delays.
CHAPTER IV
4.EXISTING METHODS
Detecting misinformation and fake news has become a critical area of research in the era of
rapidly growing online content. As the influence of social media and digital platforms
continues to rise, a variety of methods have been explored to identify misleading information.
These approaches range from traditional machine learning models to more advanced deep
learning techniques. Below, we provide an overview of existing methods employed for
misinformation and fake news detection, along with their respective limitations:
4.1 Text-Based Approaches
Traditional methods for fake news detection predominantly focus on analyzing the text
content itself. These models rely on feature extraction techniques such as TF-IDF, Bag of
Words, and n-grams to capture key terms and patterns in the text. Classification algorithms
like Logistic Regression, Naive Bayes, and Support Vector Machines (SVM) are then applied
to the extracted features.
Limitations:
Shallow Analysis: These methods often fail to capture deeper semantic meaning and
complex relationships within the text. For example, they struggle with understanding
sarcasm, irony, or subtle contextual cues that may indicate misinformation.
Vulnerability to Stylistic Manipulation: Text-based models are susceptible to
manipulation through stylistic writing techniques like satire, which may deceive these
models into classifying misleading content as legitimate.
Contextual Challenges: These approaches do not consider the broader context in
which the content is published, such as the source's credibility or historical patterns of
misinformation, making them less effective in some cases.
4.2 Image-Based Approaches
As misinformation increasingly involves visuals, the role of image analysis in fake news
detection has become more significant. Convolutional Neural Networks (CNNs), particularly
architectures such as VGG and ResNet, are commonly used for detecting manipulations and
inconsistencies in images. These models analyze image patterns, including visual artifacts,
inconsistencies, and unusual features, to identify whether an image has been altered to
mislead viewers.
Limitations:
Lack of Text Context: Image-based approaches cannot understand the context
provided by the accompanying text. In cases where images are used with misleading
or false narratives, these models may fail to detect the misinformation without
considering the full multimodal content.
Complexity of Image Manipulations: Advanced image manipulation techniques
(e.g., deepfakes or subtle editing) can be difficult for CNN-based models to detect,
especially when the alterations are subtle or sophisticated.
Generalization Issues: CNNs trained on specific datasets may struggle to generalize
to images from diverse sources or domains, reducing their effectiveness in real-world
applications.
4.3 Ensemble and Hybrid Models
To overcome the limitations of text-based and image-based models, some studies have turned
to ensemble methods that combine multiple classifiers. Ensemble techniques such as Random
Forests and Gradient Boosting Machines are employed to process manually engineered
features. These features may include user metadata, post structure, linguistic cues, and other
indicators that provide contextual information about the content.
Limitations:
Dependence on Manual Feature Engineering: While ensemble models improve
accuracy, they still rely on handcrafted features, which require expert knowledge and
may not fully capture all aspects of misinformation, especially when it involves
nuanced or less obvious patterns.
Complexity and Computational Cost: These models can be computationally
expensive due to the need for processing multiple classifiers and handling large
amounts of manually extracted features. This can limit their scalability and real-time
application.
Performance on Multimodal Misinformation: Although ensemble models combine
various classifiers, they still struggle when it comes to handling complex, multimodal
misinformation—such as content where text and image interplay significantly—
because they do not integrate both modalities effectively.
4.4 Multimodal Deep Learning Approaches
The most recent advancements in fake news detection focus on multimodal deep learning
techniques that process both text and image data simultaneously. These models combine the
strengths of natural language processing (NLP) and computer vision to capture both the
textual and visual components of misleading content. BERT, a powerful model for contextual
text embeddings, is commonly used to extract rich semantic information from text.
Meanwhile, CNNs such as ResNet50 are employed to capture visual features from images.
Limitations:
Data and Computational Requirements: Multimodal models require large, labeled
datasets containing both text and image pairs, which can be difficult to obtain.
Additionally, these models are computationally intensive and may require significant
hardware resources for training and inference.
Model Complexity: The integration of multiple modalities (text and image) increases
the complexity of the model, making it more difficult to train and tune. Fine-tuning
these models requires careful balancing of both modalities to ensure optimal
performance.
Interpretability Issues: While these models outperform unimodal approaches in
terms of accuracy, they may lack interpretability. Understanding how the model
arrives at a decision can be challenging, which can be problematic in domains like
journalism or healthcare where transparency is crucial.
Contextual Fusion Challenges: In some cases, the fusion of text and image data may
not be effective enough to capture all the subtleties of misinformation, particularly
when the misleading nature of the content is not immediately apparent from either
modality in isolation.
CHAPTER V
5.PROPOSED METHOD
Our proposed method for detecting misinformation and fake news leverages deep learning-
based multimodal analysis that integrates textual and visual content. The core objective is to
develop an intelligent system that not only detects misinformation based on individual
modalities (text or image) but also understands the semantic alignment between them,
improving accuracy in real-world social media scenarios.
The system is designed to analyze posts that include both text and images—common in fake
news—and classify them as real or fake based on the coherence of the combined information.
This approach addresses the shortcomings of unimodal systems and moves toward more
context-aware and content-robust detection mechanisms.
We utilize the Fakeddit dataset, a widely recognized benchmark dataset that includes
multimodal posts labelled for fake/real classification. It includes a diverse collection of posts
with text, images, and metadata.
Text Preprocessing:
Removal of special characters and irrelevant tokens.
Tokenization and sequence padding.
Text is processed using the BERT tokenizer to match the input requirements of the
BERT model.
Image Preprocessing:
All images are resized to 224×224 pixels.
Normalization is applied using ImageNet standards, ensuring compatibility with the
ResNet50 model.
Balancing:
To prevent model bias, we ensure the dataset is balanced with an equal number of
fake and real instances.
Our system follows a dual-stream architecture—processing text and image data separately
and then fusing the features for final classification. This modular architecture enhances both
performance and interpretability.
A. Text Feature Extraction – BERT
We employ the bert-base-uncased model to extract rich contextual embeddings.
The [CLS] token output, a 768-dimensional vector, represents the entire input
sentence.
This enables the model to understand nuanced language features, sarcasm, or
misleading phrasing.
B. Image Feature Extraction – ResNet50
We use a pre-trained ResNet50 model with the classification head removed.
The global average pooling layer output is extracted as a 2048-dimensional feature
vector.
This captures spatial and visual patterns that might indicate manipulated or
misleading imagery.
Multimodal Contextual Understanding: The fusion of textual and visual cues enables
the model to detect semantic inconsistencies—e.g., an image depicting one event
accompanied by text describing another.
High Accuracy and Robustness: The use of pretrained models (BERT and ResNet50)
and a carefully designed fusion mechanism significantly improves classification
performance compared to unimodal baselines.
CHAPTER VI
6. Dataset Exploration
For this project, we employed the Fakeddit dataset, a widely recognized benchmark
specifically tailored for multimodal misinformation detection. The dataset was curated
from Reddit, offering a rich and diverse set of social media posts that include both textual
content and corresponding visual media (images).
Multimodal Data Support: Each data sample includes text (post title) and image,
allowing for simultaneous learning from both modalities.
Multi-level Annotations: It provides three granular labeling levels — binary
(real/fake), multiclass (6 categories), and multilabel (multiple tags).
Large-Scale: With over 1 million Reddit posts, the dataset offers a high volume of
labeled data, which is critical for training deep learning models without risking
overfitting.
This makes Fakeddit particularly well-suited for deep neural architectures like BERT and
ResNet, which require large, diverse datasets for optimal generalization.
Each data point in the dataset is organized into the following key attributes:
Post ID: A unique alphanumeric identifier used to track each Reddit post.
Title / Text Content: The main user-submitted content, often in the form of a
headline or short paragraph.
Image URL: A direct link to the image attached to the Reddit post.
Label: Ground-truth annotation of the post’s credibility:
o 1 = Real
o 0 = Fake
To make the dataset compatible with our deep learning architecture, we performed multiple
preprocessing steps for both text and image components.
Text Preprocessing:
Cleaning: Removed HTML tags, special characters, hyperlinks, and non-ASCII text.
Lowercasing: All words were converted to lowercase to maintain consistency in
tokenization.
Tokenization: Applied the BERT tokenizer from Hugging Face Transformers, which
preserves sub-word representations and uses WordPiece encoding.
Padding & Truncation: Sentences were truncated or padded to a maximum
sequence length of 128 tokens to ensure uniform input size for the BERT model.
Stopword Retention: Unlike traditional NLP pipelines, stopwords were retained,
since BERT benefits from full sentence context.
Image Preprocessing:
Download & Validation: Image URLs were validated. Corrupted or broken URLs
were discarded.
Resizing: All images were resized to 224×224 pixels, the input requirement for
ResNet50.
Normalization: Pixel values were normalized to the ImageNet mean and standard
deviation to align with pre-trained ResNet expectations.
Color Format Conversion: Ensured all images were in RGB format, converting
from grayscale or CMYK if necessary.
Data Filtering: Entries missing either text or image content were removed.
Class Balancing: The original dataset had class imbalance. We applied
undersampling to the majority class, resulting in a balanced dataset with 50% fake
and 50% real posts.
Metric Value
Total Posts Used 40,000
Real Posts 20,000
Fake Posts 20,000
Image-Text Paired Samples 100%
Image Size (Input to CNN) 224x224
Max Tokens (BERT Input) 128
This balance ensures fair training and testing, minimizing bias towards either class during
evaluation.
1. Broken Image Links: Some image URLs no longer existed or led to 404 errors. We
resolved this by filtering such samples.
2. Text-Image Mismatch: In certain samples, the image and text appeared semantically
unrelated — a known trait in fake news used to create false context. This made it
essential to use multimodal learning rather than relying on a single modality.
3. Informal and Noisy Text: Social media language includes slang, emojis, sarcasm,
abbreviations, and grammatical errors. This necessitated the use of context-aware
models like BERT, which can handle informal language.
4. Dataset Size vs. Compute Limits: Although Fakeddit offers over 1 million posts,
resource constraints restricted us to a subset of 40,000 carefully filtered and balanced
samples.
CHAPTER VII
7.IMPLEMENTATION
Our proposed method for detecting misinformation and fake news leverages deep learning-
based multimodal analysis that integrates textual and visual content. The core objective is to
develop an intelligent system that not only detects misinformation based on individual
modalities (text or image) but also understands the semantic alignment between them,
improving accuracy in real-world social media scenarios.
The system is designed to analyze posts that include both text and images—common in fake
news—and classify them as real or fake based on the coherence of the combined information.
This approach addresses the shortcomings of unimodal systems and moves toward more
context-aware and content-robust detection mechanisms.
We utilize the Fakeddit dataset, a widely recognized benchmark dataset that includes
multimodal posts labelled for fake/real classification. It includes a diverse collection of posts
with text, images, and metadata.
Text Preprocessing:
Removal of special characters and irrelevant tokens.
Tokenization and sequence padding.
Text is processed using the BERT tokenizer to match the input requirements of the
BERT model.
Image Preprocessing:
All images are resized to 224×224 pixels.
Normalization is applied using ImageNet standards, ensuring compatibility with the
ResNet50 model.
Balancing:
To prevent model bias, we ensure the dataset is balanced with an equal number of
fake and real instances.
CHAPTER VIII
8.RESULTS AND ANALYSIS
1. Accuracy
Formula:
Explanation:
Accuracy is the most commonly used metric for classification tasks. It measures the
overall proportion of correctly classified instances (both real and fake) out of the total
instances. While useful as a high-level measure, accuracy can be misleading if the
dataset is imbalanced — for example, if there are significantly more real news
samples than fake ones.
2. Precision
Formula:
Precision = TP / (TP + FP)
Explanation:
Precision quantifies how many of the samples that the model predicted as fake are
actually fake. It is particularly important in applications where false positives carry a
higher cost, such as falsely flagging legitimate news as misinformation. High
precision ensures that the model maintains credibility and reduces wrongful
misclassifications.
3. Recall (Sensitivity)
Formula:
4. F1 Score
Formula:
Explanation:
The F1 Score is the harmonic mean of precision and recall, providing a single metric
that balances both. It is especially useful when there is an uneven class distribution, or
when both false positives and false negatives are equally critical. A high F1 Score
indicates that the model achieves both high precision and high recall, which is ideal
for robust misinformation detection.
5. Confusion Matrix
Explanation:
The confusion matrix is a 2x2 table that provides a complete breakdown of how the
classification model performs with respect to each class. It allows visual inspection of
where the model makes mistakes and helps identify potential biases.
Usage:
The matrix enables a deeper understanding of the model's strengths and weaknesses.
For instance, a model that misclassifies many real news items as fake would have a
high number of false positives, which may indicate an over-aggressive detection
policy.
Metric | Value
-----------------|-------
Accuracy | 93.8%
Precision | 93.4%
Recall | 94.2%
F1 Score | 93.8%
AUC Score | 0.95
CHAPTER IX
9.FUTURE SCOPE
1. Accuracy
Formula:
Explanation:
Accuracy is the most commonly used metric for classification tasks. It measures the
overall proportion of correctly classified instances (both real and fake) out of the total
instances. While useful as a high-level measure, accuracy can be misleading if the
dataset is imbalanced — for example, if there are significantly more real news
samples than fake ones.
2. Precision
Formula
CHAPTER X
10.FUTURE SCOPE
The proposed model lays a solid foundation for fake news detection, and several
enhancements can be considered for future work.While the current system performs
well in detecting fake news using multimodal data, there are several promising
directions for future enhancement:
REFERENCES
[3] He et al., Deep Residual Learning for Image Recognition, CVPR 2016
[5] Sahar Tahmasebi et al., Multimodal Misinformation Detection, ACM CIKM 2024