Fake News Detection in Tamil Social Media using Multilingual
Transformers
Abstract
The proliferation of fake news on social media platforms presents a significant
societal challenge, particularly in regional languages like Tamil where resources
for automated detection are scarce1111. The prevalent use of code-mixed
"Tanglish" (Tamil-English) text further complicates automated detection for
existing systems2. This project aims to develop a robust system to identify fake
news in Tamil social media content. We will curate a novel, labelled dataset for
this task, addressing the scarcity of high-quality data. The core of this work
involves the implementation and rigorous benchmarking of advanced
multilingual transformer models, specifically comparing the performance of the
general-purpose
mBERT against the Indic language-specialized IndicBERT. A key innovation of
this project is the integration of Explainable AI (XAI) techniques, namely LIME and
SHAP, to address the "black box" nature of deep learning models and provide
transparent, interpretable results—a critical research gap in this domain 3333. The
expected outcome is a high-accuracy, trustworthy, and context-aware model that
serves as a new benchmark for fake news detection in the Tamil language.
1. Problem Statement
The spread of intentionally fabricated information on social media has profound
consequences, ranging from influencing public opinion to eroding trust in
institutions4444. While fake news detection is a well-researched area for English,
low-resource languages like Tamil have been largely overlooked 55555555. This
project addresses four key challenges:
1. The Code-Mixing Challenge: Social media users in the Tamil-speaking
world frequently use a combination of Tamil and English words, often
written in Roman script ("Tanglish")6666. This linguistic complexity makes it
difficult for traditional, monolingual NLP models to comprehend the text's
semantics and context accurately7777.
2. Scarcity of Labeled Data: There is a significant lack of large-scale,
publicly available, and properly annotated datasets for Tamil fake news
detection8888. This data scarcity is a major bottleneck for training robust
deep learning models.
3. Model Generalizability: General multilingual models like mBERT are
trained on over 100 languages simultaneously 9. This can dilute their
effectiveness for a specific language like Tamil, as the model's parameters
are shared and it is often pre-trained on formal text like Wikipedia, not
informal social media content10101010.
4. Lack of Transparency: Advanced models like transformers are often
"black boxes," making it difficult to understand their decision-making
process11111111. For a sensitive task like flagging content as "fake," this lack
of interpretability is a major barrier to user trust and practical
deployment12.
2. Existing System & Methods
The field of fake news detection has evolved through several stages, with each
approach having distinct capabilities and limitations.
Traditional Machine Learning: Early systems relied on algorithms like
Support Vector Machines (SVM), Naïve Bayes, and Random Forest
combined with feature extraction techniques like TF-IDF 131313. While
interpretable, these methods are largely unable to capture the deep
contextual and semantic meaning of text, performing poorly on nuanced
and code-mixed content.
Early Deep Learning Models: The introduction of Recurrent Neural
Networks (RNNs), LSTMs, and Convolutional Neural Networks (CNNs)
marked an improvement141414141414141414. These models could learn from
sequential data and extract more complex patterns than traditional ML.
However, they can still struggle with the long-range dependencies and
linguistic complexities found in social media text.
State-of-the-Art Transformers: Transformer-based models like BERT
revolutionized NLP15. The availability of
mBERT (multilingual BERT) provided a powerful tool for many languages 16.
However, as shown in research on other low-resource languages like Bangla,
mBERT's performance can be limited due to its mixed-language vocabulary and
training on structured data, which is different from informal social media
posts17171717.
3. Proposed Solution
This project proposes a systematic, multi-stage approach to build and validate a
high-performance, interpretable model for Tamil fake news detection.
1. Dataset Curation: A new, high-quality Tamil fake news dataset will be
created.
o Fake News Collection: We will gather articles and social media
posts that have been debunked by credible Tamil fact-checking
organizations.
o Real News Collection: We will gather posts from the official social
media accounts of reputable Tamil news outlets.
2. Model Benchmarking: We will implement and rigorously compare three
levels of models to demonstrate a clear performance improvement.
o Baseline Model: An SVM classifier with TF-IDF features will be
implemented as a baseline to represent traditional methods 18.
o Generalist Transformer: The mBERT model will be fine-tuned on
our dataset to test the performance of a general multilingual
approach19191919.
o Specialist Transformer: The IndicBERT model, which is pre-
trained specifically on 12 Indian languages including Tamil, will be
fine-tuned20202020. We hypothesize that
IndicBERT will outperform mBERT due to its specialized training, a concept
supported by parallel research like the Bangla-BERT paper 21212121.
3. Explainable AI (XAI) Integration: To address the "black box" problem,
we will integrate XAI techniques on the best-performing model from the
benchmarking phase.
o LIME: Will be used to identify which words in a specific tweet
contributed most to its "fake" or "real" classification 22.
o SHAP: Will be used to provide a more comprehensive
understanding of global feature importance, revealing which words
are generally the strongest indicators of fake news across the
dataset23.
4. Evaluation: All models will be evaluated using a standardized set of
metrics: Accuracy, Precision, Recall, and F1-
Score24242424242424242424242424242424. The F1-Score will be the primary metric for
comparison, as it provides a balanced measure for potentially imbalanced
datasets.
4. System Architecture
The project will follow a structured pipeline:
1. Data Collection & Preprocessing: Gather real and fake news from Tamil
sources and clean the text data.
2. Benchmarking:
o Train and evaluate the SVM (Baseline) model using TF-IDF vectors.
o Fine-tune and evaluate the mBERT model.
o Fine-tune and evaluate the IndicBERT model.
3. Model Selection: The model with the highest F1-Score on the test set will
be selected.
4. XAI Integration: Apply LIME and SHAP to the selected model to generate
explanations.
5. Final Output: The system will output both a classification (Real/Fake) and
a visual explanation for the decision.
5. Software and Hardware Requirements
Software Requirements:
25
o Programming Language: Python
o Core Libraries:
Pandas, NumPy: For data manipulation and processing 26.
Scikit-learn: For implementing the baseline SVM model.
PyTorch or TensorFlow: The deep learning framework for
running transformers.
Hugging Face Transformers: To access pre-trained mBERT and
IndicBERT models27.
LIME, SHAP: For implementing the XAI component.
o Development Environment: Jupyter Notebooks or Google
Colaboratory.
Hardware Requirements:
o CPU/RAM: A standard computer with a minimum of 4GB of RAM for
initial data handling28.
o GPU: Fine-tuning transformer models is computationally expensive
and requires a GPU. Google Colab provides free access to NVIDIA
T4 GPUs, which will be sufficient for this project. Training on a CPU-
only machine is not feasible.
6. Conclusion
This project will address a critical gap in the fight against misinformation by
developing a robust and transparent fake news detection system for the Tamil
language. By creating a new dataset, benchmarking state-of-the-art models, and
integrating explainability, this work will provide a significant contribution to NLP
research for low-resource Indian languages and offer a practical tool for
promoting a healthier information ecosystem.
7. References
1. Koru, G. K., & Uluyol, Ç. (2024). Detection of Turkish Fake News from
Tweets with BERT Models. IEEE Access. (Based on
Detection_of_Turkish_Fake_News_from_Tweets_with_BE.pdf)
2. Sunitha, D., et al. (2022). Fake News Detection. International Research
Journal of Modernization in Engineering Technology and Science. (Based
on fin_irjmets1652279101.pdf)
3. Alnabhan, M. Q., & Branco, P. (2024). Fake News Detection Using Deep
Learning: A Systematic Literature Review. IEEE Access. (Based on
Fake_News_Detection_Using_Deep_Learning_A_Systemat.pdf)
4. Bashaddadh, O., et al. (2025). Machine Learning and Deep Learning
Approaches for Fake News Detection: A Systematic Review... IEEE Access.
(Based on
Machine_Learning_and_Deep_Learning_Approaches_for_Fake_News_Detec
tion...pdf)
5. Sharif, O., Hossain, E., & Hoque, M. M. (2021). NLP-
CUET@DravidianLangTech-EACL2021: Offensive Language Detection from
Multilingual Code-Mixed Text using Transformers. arXiv. (Based on
2103.00455v1.pdf)
6. Kowsher, M., et al. (2022). Bangla-BERT: Transformer-Based Efficient Model
for Transfer Learning and Language Understanding. IEEE Access. (Based
on Bangla-BERT...pdf)
7. Kannan, R. R., Rajalakshmi, R., & Kumar, L. (2021). IndicBERT based
approach for Sentiment Analysis on Code-Mixed Tamil Tweets. CEUR
Workshop Proceedings. (Based on T3-16.pdf)
8. Rahman, M. M., et al. (2025). MSM_CUET@DravidianLangTech 2025: XLM-
BERT and MuRIL Based Transformer Models for Detection of Abusive Tamil
and Malayalam Text... Proceedings of the Fifth Workshop on Speech,
Vision, and Language Technologies for Dravidian Languages. (Based on
2025.dravidianlangtech-1.42.pdf)