0% found this document useful (0 votes)

12 views5 pages

Write Up

Uploaded by

arun cral

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views5 pages

Write Up

Uploaded by

arun cral

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Fake News Detection in Tamil Social Media using Multilingual

Transformers
Abstract

The proliferation of fake news on social media platforms presents a significant

societal challenge, particularly in regional languages like Tamil where resources
for automated detection are scarce1111. The prevalent use of code-mixed
"Tanglish" (Tamil-English) text further complicates automated detection for
existing systems2. This project aims to develop a robust system to identify fake
news in Tamil social media content. We will curate a novel, labelled dataset for
this task, addressing the scarcity of high-quality data. The core of this work
involves the implementation and rigorous benchmarking of advanced
multilingual transformer models, specifically comparing the performance of the
general-purpose

mBERT against the Indic language-specialized IndicBERT. A key innovation of

this project is the integration of Explainable AI (XAI) techniques, namely LIME and
SHAP, to address the "black box" nature of deep learning models and provide
transparent, interpretable results—a critical research gap in this domain 3333. The
expected outcome is a high-accuracy, trustworthy, and context-aware model that
serves as a new benchmark for fake news detection in the Tamil language.

1. Problem Statement

The spread of intentionally fabricated information on social media has profound

consequences, ranging from influencing public opinion to eroding trust in
institutions4444. While fake news detection is a well-researched area for English,
low-resource languages like Tamil have been largely overlooked 55555555. This
project addresses four key challenges:

1. The Code-Mixing Challenge: Social media users in the Tamil-speaking

world frequently use a combination of Tamil and English words, often
written in Roman script ("Tanglish")6666. This linguistic complexity makes it
difficult for traditional, monolingual NLP models to comprehend the text's
semantics and context accurately7777.

2. Scarcity of Labeled Data: There is a significant lack of large-scale,

publicly available, and properly annotated datasets for Tamil fake news
detection8888. This data scarcity is a major bottleneck for training robust
deep learning models.

3. Model Generalizability: General multilingual models like mBERT are

trained on over 100 languages simultaneously 9. This can dilute their
effectiveness for a specific language like Tamil, as the model's parameters
are shared and it is often pre-trained on formal text like Wikipedia, not
informal social media content10101010.

4. Lack of Transparency: Advanced models like transformers are often

"black boxes," making it difficult to understand their decision-making
process11111111. For a sensitive task like flagging content as "fake," this lack
of interpretability is a major barrier to user trust and practical
deployment12.

2. Existing System & Methods

The field of fake news detection has evolved through several stages, with each
approach having distinct capabilities and limitations.

 Traditional Machine Learning: Early systems relied on algorithms like

Support Vector Machines (SVM), Naïve Bayes, and Random Forest
combined with feature extraction techniques like TF-IDF 131313. While
interpretable, these methods are largely unable to capture the deep
contextual and semantic meaning of text, performing poorly on nuanced
and code-mixed content.

 Early Deep Learning Models: The introduction of Recurrent Neural

Networks (RNNs), LSTMs, and Convolutional Neural Networks (CNNs)
marked an improvement141414141414141414. These models could learn from
sequential data and extract more complex patterns than traditional ML.
However, they can still struggle with the long-range dependencies and
linguistic complexities found in social media text.

 State-of-the-Art Transformers: Transformer-based models like BERT

revolutionized NLP15. The availability of

mBERT (multilingual BERT) provided a powerful tool for many languages 16.
However, as shown in research on other low-resource languages like Bangla,

mBERT's performance can be limited due to its mixed-language vocabulary and

training on structured data, which is different from informal social media
posts17171717.

3. Proposed Solution

This project proposes a systematic, multi-stage approach to build and validate a

high-performance, interpretable model for Tamil fake news detection.

1. Dataset Curation: A new, high-quality Tamil fake news dataset will be

created.

o Fake News Collection: We will gather articles and social media

posts that have been debunked by credible Tamil fact-checking
organizations.

o Real News Collection: We will gather posts from the official social
media accounts of reputable Tamil news outlets.
2. Model Benchmarking: We will implement and rigorously compare three
levels of models to demonstrate a clear performance improvement.

o Baseline Model: An SVM classifier with TF-IDF features will be

implemented as a baseline to represent traditional methods 18.

o Generalist Transformer: The mBERT model will be fine-tuned on

our dataset to test the performance of a general multilingual
approach19191919.

o Specialist Transformer: The IndicBERT model, which is pre-

trained specifically on 12 Indian languages including Tamil, will be
fine-tuned20202020. We hypothesize that

IndicBERT will outperform mBERT due to its specialized training, a concept

supported by parallel research like the Bangla-BERT paper 21212121.

3. Explainable AI (XAI) Integration: To address the "black box" problem,

we will integrate XAI techniques on the best-performing model from the
benchmarking phase.

o LIME: Will be used to identify which words in a specific tweet

contributed most to its "fake" or "real" classification 22.

o SHAP: Will be used to provide a more comprehensive

understanding of global feature importance, revealing which words
are generally the strongest indicators of fake news across the
dataset23.

4. Evaluation: All models will be evaluated using a standardized set of

metrics: Accuracy, Precision, Recall, and F1-
Score24242424242424242424242424242424. The F1-Score will be the primary metric for
comparison, as it provides a balanced measure for potentially imbalanced
datasets.

4. System Architecture

The project will follow a structured pipeline:

1. Data Collection & Preprocessing: Gather real and fake news from Tamil
sources and clean the text data.

2. Benchmarking:

o Train and evaluate the SVM (Baseline) model using TF-IDF vectors.

o Fine-tune and evaluate the mBERT model.

o Fine-tune and evaluate the IndicBERT model.

3. Model Selection: The model with the highest F1-Score on the test set will
be selected.
4. XAI Integration: Apply LIME and SHAP to the selected model to generate
explanations.

5. Final Output: The system will output both a classification (Real/Fake) and
a visual explanation for the decision.

5. Software and Hardware Requirements

 Software Requirements:
25
o Programming Language: Python

o Core Libraries:

 Pandas, NumPy: For data manipulation and processing 26.

 Scikit-learn: For implementing the baseline SVM model.

 PyTorch or TensorFlow: The deep learning framework for

running transformers.

 Hugging Face Transformers: To access pre-trained mBERT and

IndicBERT models27.

 LIME, SHAP: For implementing the XAI component.

o Development Environment: Jupyter Notebooks or Google

Colaboratory.

 Hardware Requirements:

o CPU/RAM: A standard computer with a minimum of 4GB of RAM for

initial data handling28.

o GPU: Fine-tuning transformer models is computationally expensive

and requires a GPU. Google Colab provides free access to NVIDIA
T4 GPUs, which will be sufficient for this project. Training on a CPU-
only machine is not feasible.

6. Conclusion

This project will address a critical gap in the fight against misinformation by
developing a robust and transparent fake news detection system for the Tamil
language. By creating a new dataset, benchmarking state-of-the-art models, and
integrating explainability, this work will provide a significant contribution to NLP
research for low-resource Indian languages and offer a practical tool for
promoting a healthier information ecosystem.

7. References

1. Koru, G. K., & Uluyol, Ç. (2024). Detection of Turkish Fake News from
Tweets with BERT Models. IEEE Access. (Based on
Detection_of_Turkish_Fake_News_from_Tweets_with_BE.pdf)
2. Sunitha, D., et al. (2022). Fake News Detection. International Research
Journal of Modernization in Engineering Technology and Science. (Based
on fin_irjmets1652279101.pdf)

3. Alnabhan, M. Q., & Branco, P. (2024). Fake News Detection Using Deep
Learning: A Systematic Literature Review. IEEE Access. (Based on
Fake_News_Detection_Using_Deep_Learning_A_Systemat.pdf)

4. Bashaddadh, O., et al. (2025). Machine Learning and Deep Learning

Approaches for Fake News Detection: A Systematic Review... IEEE Access.
(Based on
Machine_Learning_and_Deep_Learning_Approaches_for_Fake_News_Detec
tion...pdf)

5. Sharif, O., Hossain, E., & Hoque, M. M. (2021). NLP-

CUET@DravidianLangTech-EACL2021: Offensive Language Detection from
Multilingual Code-Mixed Text using Transformers. arXiv. (Based on
2103.00455v1.pdf)

6. Kowsher, M., et al. (2022). Bangla-BERT: Transformer-Based Efficient Model

for Transfer Learning and Language Understanding. IEEE Access. (Based
on Bangla-BERT...pdf)

7. Kannan, R. R., Rajalakshmi, R., & Kumar, L. (2021). IndicBERT based

approach for Sentiment Analysis on Code-Mixed Tamil Tweets. CEUR
Workshop Proceedings. (Based on T3-16.pdf)

8. Rahman, M. M., et al. (2025). MSM_CUET@DravidianLangTech 2025: XLM-

BERT and MuRIL Based Transformer Models for Detection of Abusive Tamil
and Malayalam Text... Proceedings of the Fifth Workshop on Speech,
Vision, and Language Technologies for Dravidian Languages. (Based on
2025.dravidianlangtech-1.42.pdf)

2024.dravidianlangtech 1.30
No ratings yet
2024.dravidianlangtech 1.30
7 pages
61 Fake News Detection in Dravidi
No ratings yet
61 Fake News Detection in Dravidi
5 pages
Final Review
No ratings yet
Final Review
32 pages
FYP Copy
No ratings yet
FYP Copy
42 pages
Fake News Detection Report
No ratings yet
Fake News Detection Report
46 pages
Similarity-ManpreetKaur5521 BTP Final Proje
No ratings yet
Similarity-ManpreetKaur5521 BTP Final Proje
19 pages
Initial
No ratings yet
Initial
23 pages
Fabricated News
No ratings yet
Fabricated News
3 pages
Fake News Detection Using Enhanced BERT
No ratings yet
Fake News Detection Using Enhanced BERT
8 pages
Fake News Detection Report New
No ratings yet
Fake News Detection Report New
16 pages
Fake News Detection Overview
No ratings yet
Fake News Detection Overview
16 pages
Ai Fake News Detection
No ratings yet
Ai Fake News Detection
3 pages
Fake News Detection
No ratings yet
Fake News Detection
11 pages
Case Study DL
No ratings yet
Case Study DL
8 pages
Advancements in Fake News Detection Integrating NLP and Multi-Modal Approaches
No ratings yet
Advancements in Fake News Detection Integrating NLP and Multi-Modal Approaches
5 pages
8th Sem Research Paper
No ratings yet
8th Sem Research Paper
3 pages
CSE1015 - Machine Learning Essentials: J Component Report
No ratings yet
CSE1015 - Machine Learning Essentials: J Component Report
18 pages
Fake News Detection
No ratings yet
Fake News Detection
12 pages
AI Phase5
No ratings yet
AI Phase5
26 pages
Fake News Detection
No ratings yet
Fake News Detection
5 pages
Fake News Detection2
No ratings yet
Fake News Detection2
12 pages
BTPFINALPROJECT
No ratings yet
BTPFINALPROJECT
11 pages
Detection of Fake News With RoBERTa Based Embedding and Modified Deep Neural Network Architecture
No ratings yet
Detection of Fake News With RoBERTa Based Embedding and Modified Deep Neural Network Architecture
5 pages
Fake News Detection PPT 1
No ratings yet
Fake News Detection PPT 1
13 pages
AI Phase4
No ratings yet
AI Phase4
6 pages
Fake News Detection With Different Model
No ratings yet
Fake News Detection With Different Model
15 pages
(NetCrypt) Review Paper
No ratings yet
(NetCrypt) Review Paper
7 pages
Proposal FN
No ratings yet
Proposal FN
9 pages
Fake News Detection PDF
No ratings yet
Fake News Detection PDF
10 pages
Fake News Detection Based On A Hybrid Bert and Lightgbm Models
No ratings yet
Fake News Detection Based On A Hybrid Bert and Lightgbm Models
12 pages
Fake News Synopsis 1
No ratings yet
Fake News Synopsis 1
6 pages
Futureinternet 17 00028 v2
No ratings yet
Futureinternet 17 00028 v2
29 pages
FAke News Report
No ratings yet
FAke News Report
16 pages
BTPFINALPROJECT
No ratings yet
BTPFINALPROJECT
10 pages
Geetha Internship
No ratings yet
Geetha Internship
17 pages
Fai Batch 4 PDF
No ratings yet
Fai Batch 4 PDF
14 pages
Fake News Detection
No ratings yet
Fake News Detection
9 pages
(NetCrypt) Review Paper PDF
No ratings yet
(NetCrypt) Review Paper PDF
5 pages
NM Project Phase-2
No ratings yet
NM Project Phase-2
9 pages
Pid - 235
No ratings yet
Pid - 235
14 pages
Final Year of Computer Engineering 2022-23 Semester VII Project Synopsis
No ratings yet
Final Year of Computer Engineering 2022-23 Semester VII Project Synopsis
11 pages
Fake News Detection-1
No ratings yet
Fake News Detection-1
37 pages
AAT Cover Page
No ratings yet
AAT Cover Page
17 pages
Fake News Detection Expanded
No ratings yet
Fake News Detection Expanded
5 pages
BT P Final Project
No ratings yet
BT P Final Project
11 pages
Fake News Detection
No ratings yet
Fake News Detection
27 pages
ML Techniques for Fake News Detection
No ratings yet
ML Techniques for Fake News Detection
27 pages
Optimizing Fake News Detection A Hybrid Transformer-Based Model For Enhanced Performance
No ratings yet
Optimizing Fake News Detection A Hybrid Transformer-Based Model For Enhanced Performance
12 pages
Fake News Classification Methodology With Enhanced BERT
No ratings yet
Fake News Classification Methodology With Enhanced BERT
12 pages
FAKE NEWS Paper
No ratings yet
FAKE NEWS Paper
45 pages
IEEE Conference Template 2
No ratings yet
IEEE Conference Template 2
5 pages
Fake News Detection with ML
No ratings yet
Fake News Detection with ML
24 pages
Fake News Detection Expanded Presentation
No ratings yet
Fake News Detection Expanded Presentation
20 pages
Laporan
No ratings yet
Laporan
4 pages
25C17 Fake News Detection 1
No ratings yet
25C17 Fake News Detection 1
27 pages
Fake News Detection
No ratings yet
Fake News Detection
11 pages
Fake News
No ratings yet
Fake News
2 pages
Fake News Detection
No ratings yet
Fake News Detection
9 pages
Introduction To ICT
No ratings yet
Introduction To ICT
3 pages
Noun Clauses Acting As Noun
No ratings yet
Noun Clauses Acting As Noun
2 pages
RUCKUS SmartZone (ST GA) AAA Interface Guide 7.0.0 RevA 20240222
No ratings yet
RUCKUS SmartZone (ST GA) AAA Interface Guide 7.0.0 RevA 20240222
149 pages
Case Ih Tractor Precision Air 2230 2280 2330 3380 3430 Air Cart Complete Service Manual 84329233
100% (4)
Case Ih Tractor Precision Air 2230 2280 2330 3380 3430 Air Cart Complete Service Manual 84329233
22 pages
Spanish Bullfighting Lesson Plan
No ratings yet
Spanish Bullfighting Lesson Plan
7 pages
English Sanctified Bece Mock - May, 2024
No ratings yet
English Sanctified Bece Mock - May, 2024
11 pages
Introduction To DAA
No ratings yet
Introduction To DAA
140 pages
Definite Integration - JEE Main 2023 April Chapterwise PYQ - MathonGo
No ratings yet
Definite Integration - JEE Main 2023 April Chapterwise PYQ - MathonGo
8 pages
Guru Stotram-1
No ratings yet
Guru Stotram-1
5 pages
Oe Bright Glossary
No ratings yet
Oe Bright Glossary
111 pages
English Cet Notes
No ratings yet
English Cet Notes
8 pages
Fatgen 103
No ratings yet
Fatgen 103
35 pages
"Our World: Vocabulary & Exercises"
No ratings yet
"Our World: Vocabulary & Exercises"
90 pages
Junior Java Developer PDF
No ratings yet
Junior Java Developer PDF
1 page
The Blaze of Non-Dual Bodhicittas
100% (1)
The Blaze of Non-Dual Bodhicittas
431 pages
Conditional Sentences Guide
0% (1)
Conditional Sentences Guide
13 pages
Vocabulary Selecting AWL
No ratings yet
Vocabulary Selecting AWL
26 pages
Class Diagram For Example ATM System
No ratings yet
Class Diagram For Example ATM System
24 pages
Egg-X ProCall - Programação
No ratings yet
Egg-X ProCall - Programação
47 pages
What A Beautiful Name-chords-D
No ratings yet
What A Beautiful Name-chords-D
2 pages
Teach Yourself Advanced C in 21 Days (Sams-1994) PDF
No ratings yet
Teach Yourself Advanced C in 21 Days (Sams-1994) PDF
904 pages
Soal Bahasa Inggris Kelas X Semester Genap 2022
No ratings yet
Soal Bahasa Inggris Kelas X Semester Genap 2022
7 pages
AnonForce Walkthrough
No ratings yet
AnonForce Walkthrough
7 pages
Nitesh Yadav Resume
No ratings yet
Nitesh Yadav Resume
3 pages
EL295 U8 Sup Handout
No ratings yet
EL295 U8 Sup Handout
3 pages
Lesson 1.3 Operations On Sets Part1
No ratings yet
Lesson 1.3 Operations On Sets Part1
26 pages
Christian Reformed Churches
No ratings yet
Christian Reformed Churches
4 pages
THREADS PPT (1) - 1
No ratings yet
THREADS PPT (1) - 1
13 pages
Module Bahasa Inggris Kelas 10 KD 3.8
No ratings yet
Module Bahasa Inggris Kelas 10 KD 3.8
25 pages
Module 2 App Testing Tools
No ratings yet
Module 2 App Testing Tools
21 pages

Write Up

Uploaded by

Write Up

Uploaded by

Fake News Detection in Tamil Social Media using Multilingual

The proliferation of fake news on social media platforms presents a significant

mBERT against the Indic language-specialized IndicBERT. A key innovation of

The spread of intentionally fabricated information on social media has profound

1. The Code-Mixing Challenge: Social media users in the Tamil-speaking

2. Scarcity of Labeled Data: There is a significant lack of large-scale,

3. Model Generalizability: General multilingual models like mBERT are

4. Lack of Transparency: Advanced models like transformers are often

2. Existing System & Methods

 Traditional Machine Learning: Early systems relied on algorithms like

 Early Deep Learning Models: The introduction of Recurrent Neural

 State-of-the-Art Transformers: Transformer-based models like BERT

mBERT's performance can be limited due to its mixed-language vocabulary and

This project proposes a systematic, multi-stage approach to build and validate a

1. Dataset Curation: A new, high-quality Tamil fake news dataset will be

o Fake News Collection: We will gather articles and social media

o Baseline Model: An SVM classifier with TF-IDF features will be

o Generalist Transformer: The mBERT model will be fine-tuned on

o Specialist Transformer: The IndicBERT model, which is pre-

IndicBERT will outperform mBERT due to its specialized training, a concept

3. Explainable AI (XAI) Integration: To address the "black box" problem,

o LIME: Will be used to identify which words in a specific tweet

o SHAP: Will be used to provide a more comprehensive

4. Evaluation: All models will be evaluated using a standardized set of

The project will follow a structured pipeline:

o Fine-tune and evaluate the mBERT model.

o Fine-tune and evaluate the IndicBERT model.

5. Software and Hardware Requirements

 Pandas, NumPy: For data manipulation and processing 26.

 Scikit-learn: For implementing the baseline SVM model.

 PyTorch or TensorFlow: The deep learning framework for

 Hugging Face Transformers: To access pre-trained mBERT and

 LIME, SHAP: For implementing the XAI component.

o Development Environment: Jupyter Notebooks or Google

o CPU/RAM: A standard computer with a minimum of 4GB of RAM for

o GPU: Fine-tuning transformer models is computationally expensive

4. Bashaddadh, O., et al. (2025). Machine Learning and Deep Learning

5. Sharif, O., Hossain, E., & Hoque, M. M. (2021). NLP-

6. Kowsher, M., et al. (2022). Bangla-BERT: Transformer-Based Efficient Model

7. Kannan, R. R., Rajalakshmi, R., & Kumar, L. (2021). IndicBERT based

8. Rahman, M. M., et al. (2025). MSM_CUET@DravidianLangTech 2025: XLM-

You might also like