A
CPE PROJECT REPORT
ON
“Spam Comment Detection For YouTube”
SUBMITTED BY -
ROLL NO. NAME OF THE STUDENT CLASS
03 Atigre Vaibhavi Rahul AN6I
06 Bhakare Mansi Mukund AN6I
08 Chougale Samruddhi Adinath AN6I
18 Kandekari Shruti Rahul AN6I
23 Killedar Aditi Babaso AN61
31 Patil Apurva Dilip AN6I
UNDER THE GUIDANCE OF
Prof. G.S.Kshirsagar
Shree Prince Shivaji Maratha Boarding House’s
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & MACHINE LEARNING (AN)
(2024-25)
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Atigre Vaibhavi Rahul of AN6I class from New Institute Of
Technology, Unchgaon, Kolhapur (0047) having Enrollment
2200470343 has completed project of final year having title Fake
Spam Comment Detection For YouTube during the academic year
2024-2025. The project completed in a group consisting of 6 persons
under the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Bhakare Mansi Mukund of AN6I class from New Institute Of
Technology, Unchgaon, Kolhapur (0047) having Enrollment
2200470323 has completed project of final year having title Fake
Spam Comment Detection For YouTube during the academic year
2024-2025. The project completed in a group consisting of 6 persons
under the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Chougale Samruddhi Adinath of AN6I class from New Institute
Of Technology, Unchgaon, Kolhapur (0047) having Enrollment
2200470327 has completed project of final year having title Fake
Spam Comment Detection For YouTube during the academic year
2024-2025. The project completed in a group consisting of 6 persons
under the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Kandekari Shruti Rahul of AN6I class from New Institute Of
Technology, Unchgaon, Kolhapur (0047) having Enrollment
2200470340 has completed project of final year having title Fake
Spam Comment Detection For YouTube during the academic year
2024-2025. The project completed in a group consisting of 6 persons
under the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Killedar Aditi Babaso of AN6I class from New Institute Of
Technology, Unchgaon, Kolhapur (0047) having Enrollment
220047346 has completed project of final year having title Spam
Comment Detection For YouTube during the academic year 2024-
2025. The project completed in a group consisting of 6 persons under
the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
MAHARASHTRA STATE BOARD OF TECHNICAL
EDUCATION (MSBTE), MUMBAI
Affiliated
NEW INSTITUTE OF TECHNOLOGY, KOLHAPUR
SHANTINAGAR, UCHAGAON (E), KOLHAPUR-416005
CERTIFICATE
This is to certify that
Ms. Patil Apurva Dilip of AN6I class from New Institute Of
Technology, Unchgaon, Kolhapur (0047) having Enrolment
2200470358 has completed project of final year having title Fake
Spam Comment Detection For YouTube during the academic year
2024-2025. The project completed in a group consisting of 6 persons
under the guidance of the Faculty Guide Prof. G.S.Kshirsagar.
Date:-
Place:-
Prof.G.S.Kshirsagar Mr. V. S. Gavali Dr. S. H. Dabhole
(Capstone Project Guide) (HOD) (Director)
External Examiner
Acknowledgement
It’s our pleasure to express sincere and heartiest
thanks to our Capstone Project guide G.S.Kshirsagar for her valuable
guidance, encouragement and support for the accomplishment of this
Capstone Project.
We also express our sincere thanks to Director
Dr. S. H. Dabhole and Head of the Department Mr. V. S. Gavali for their
constant encouragement and motivation for accomplishment of Capstone
Project Proposal by expressing their guidelines regarding importance of
Capstone Project Proposal in developing our career.
We are thankful to all the group members of our
Capstone Project for their valuable contribution. We are thankful to all
those persons who directly or indirectly helped us in completion of this
Project Proposal.
Name of student & Roll No.
03. Vaibhavi Rahul Atigre
06. Mansi Mukund Bhakare
08. Sammrudhi Adinath Chougale
18. Shruti Rahul Kandekari
23. Aditi Babaso Killedar
31. Apurva Dilip Patil
Abstract (In one paragraph not more than 150 words)
This project presents a method for detecting spam comments on YouTube,
addressing the growing concern over content quality and user experience.
Spam comments not only detract from meaningful interactions but also pose
challenges for creators in managing their channels. We propose a machine
learning approach using the Random Forest algorithm, which effectively
classifies comments as spam or non-spam based on features such as text
length, sentiment analysis, and the frequency of certain keywords. The model
was trained on a labeled dataset of YouTube comments, achieving a high
accuracy rate. Our results demonstrate that this algorithm can significantly
enhance the detection of spam comments, thereby improving user
engagement and fostering a more authentic community. This approach has the
potential to be integrated into YouTube’s comment moderation system.
INDEX
Sr. No. Contents Page No.
1 Introduction
2 Brief Literature survey
3 Problem Statement
4 Choice of the topic with reasoning
5 Objectives
6 Methodology/ Planning of work
7 Main references
8 Facilities Available & Requirements
9 Planning of work chart
LIST OF FIGURES
Fig. No. Figure Name Page No.
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
LIST OF TABLES
Fig. No. Figure Name Page No.
Table.1.
1. Introduction :
Spam comment detection for YouTube has become a crucial aspect of
maintaining a healthy online environment. YouTube, being one of the largest
platforms for video sharing and user interaction, often experiences an overwhelming
influx of comments. Among these, spam comments—often promotional, irrelevant, or
misleading—are a significant challenge. These comments can not only degrade the
quality of discourse but also mislead viewers or promote malicious content. Effective
spam detection systems help ensure that the platform remains user-friendly, safe, and
free of disruptive elements.
At its core, spam detection involves identifying and filtering out comments that violate
community guidelines or serve no constructive purpose. This is done using various
techniques, including machine learning models that are trained to recognize patterns in
text. These models analyze different features, such as the frequency of certain
keywords, the use of repetitive phrases, or suspicious links, to determine whether a
comment is spam. The increasing sophistication of spammers, who employ tactics to
bypass simple filters, makes it essential for detection systems to evolve continuously .
YouTube uses a combination of automated systems and human moderation to address
spam. Machine learning algorithms process vast amounts of data to detect and flag
potential spam automatically. However, manual intervention is often required for
more nuanced cases where the algorithms might struggle. For example, spam
detection must strike a balance between filtering harmful content and ensuring that
genuine comments, especially those from diverse cultures or languages, are not
mistakenly flagged as spam.
Despite advances in spam detection, challenges remain. Spammers are constantly
adapting to avoid detection, using more advanced methods such as randomizing their
messages or using non-standard characters. As a result, YouTube’s spam detection
systems must continue evolving, leveraging advancements in artificial intelligence and
natural language processing to stay ahead of emerging threats.
2. Brief Literature survey:
The research paper by Niranjan et al. (2017) [1] explores spam comment detection on
YouTube, addressing the detrimental effects of spam on user engagement. The
authors present a comprehensive approach that utilizes machine learning and natural
language processing techniques to classify comments as spam or legitimate. Their
methodology includes data collection, feature extraction focused on linguistic
characteristics, and the application of algorithms such as decision trees. The study
evaluates the effectiveness of their model using metrics like accuracy and precision,
demonstrating significant improvements in spam detection. The findings underscore
the necessity of robust moderation tools to enhance user experience on social media
platforms.
The research paper by Kumar et al. (2020) [2] investigates spam comment detection
on YouTube, highlighting the challenges posed by the large volume of user-generated
content. The authors propose an innovative approach that combines machine learning
and deep learning techniques to effectively classify comments as spam or legitimate.
Their methodology includes comprehensive data collection and feature extraction
from comment text and user behavior. The study evaluates various algorithms,
reporting significant improvements in detection accuracy and other performance
metrics compared to traditional methods. The findings emphasize the importance of
effective spam detection mechanisms to enhance the overall user experience on social
media platforms.
The research paper by Xu et al. (2021) [3] focuses on spam comment detection on
YouTube, proposing a hybrid model that combines traditional machine learning
techniques with deep learning methods, such as convolutional neural networks (CNNs).
The authors outline their methodology, which includes extensive data collection and
feature extraction from comment text and metadata. They evaluate the performance
of their model against a well-structured dataset, reporting significant improvements in
accuracy and F1 score compared to prior approaches. The study highlights the
effectiveness of integrating multiple techniques for robust spam detection and its
implications for enhancing content moderation on social media platforms.
The research paper by Mohammed et al. (2023) [4] examines spam comment detection on
YouTube, focusing on the evolving tactics used by spammers. The authors introduce a novel
framework that leverages advanced machine learning and deep learning techniques,
potentially including transformer models, to enhance contextual understanding of
comments. Their methodology involves comprehensive data preprocessing and feature
extraction, analyzing linguistic features and user interactions. The study demonstrates
significant improvements in detection accuracy and other performance metrics using a
large, diverse dataset, emphasizing the framework's potential to improve automated
content moderation systems on social media platforms.
The research paper by Almukaynizi et al. (2023) [5] investigates spam comment detection
on YouTube, addressing the challenges of sophisticated spam tactics. The authors propose a
comprehensive detection framework that utilizes machine learning and deep learning
algorithms, including ensemble methods, to enhance classification accuracy. Their
methodology includes thorough data collection and feature extraction based on linguistic
patterns and user behavior. The study evaluates the effectiveness of their models against a
curated dataset, showcasing significant improvements in detection metrics such as accuracy
and precision. The findings highlight the importance of adaptive techniques for effective
content moderation on social media platforms.
The research paper by Banerjee et al. (2018) [6] focuses on spam comment detection on
YouTube, examining its impact on user engagement and content quality. The authors
propose a robust detection framework that combines traditional machine learning
algorithms with natural language processing techniques. Their methodology includes data
collection from YouTube comments, feature extraction based on linguistic and contextual
cues, and the use of classifiers like Naive Bayes and Support Vector Machines. The study
demonstrates significant improvements in detection performance metrics, emphasizing the
need for effective moderation tools to enhance user experience on social media platforms.
The research paper by Barbieri et al. (2020) [7] explores spam comment detection on
YouTube, focusing on the complexities of identifying spam in diverse comment styles. The
authors introduce a novel approach that integrates machine learning techniques with deep
learning models to enhance classification accuracy. Their methodology involves extensive
data collection and feature extraction, utilizing both linguistic features and user interaction
patterns. The study evaluates the proposed models using a robust dataset, demonstrating
significant improvements in detection performance metrics such as precision and recall. The
findings underscore the importance of advanced detection mechanisms for maintaining
content quality and user engagement on social media platforms.
The research paper by Garcia et al. (2022) [8] investigates spam comment detection on
YouTube, highlighting the increasing challenges posed by sophisticated spam techniques.
The authors propose an innovative detection framework that combines traditional machine
learning methods with advanced deep learning approaches, such as recurrent neural
networks. Their methodology includes comprehensive data collection and feature extraction
focused on linguistic cues and user behavior. The study evaluates the effectiveness of their
model using a large dataset, reporting notable improvements in accuracy and F1 score
compared to existing methods. The findings emphasize the critical need for effective spam
detection systems to enhance user experience and maintain content integrity on social
media platforms.
In conclusion, spam comment detection on platforms like YouTube is an evolving field that
has seen significant advancements through machine learning and deep learning
techniques. As spamming tactics become more sophisticated, continuous research and
development of more robust detection methods remain critical for maintaining the
integrity of online interactions. The combination of traditional approaches, deep learning,
and hybrid models offers promising avenues for further enhancing spam detection
capabilities in the future.
3. Problem Statement:
The rise of spam comments on YouTube disrupts user interactions and diminishes
content quality. Traditional filters are ineffective against spammers' evolving tactics,
like disguising links and mimicking real comments. The challenge is to create an
automated system that accurately detects and removes spam in real time, while
handling large-scale, multilingual data. This system must minimize false positives
and negatives, adapting to new spam techniques to maintain a positive user
experience.
4. Choice of the topic with reasoning:
The topic of spam comment detection for YouTube is highly relevant due to the
platform’s massive user base and the increasing volume of interactions in the
comment section. Spam comments disrupt genuine discussions, promote misleading
or harmful content, and negatively impact user experience. Detecting spam is crucial
for protecting users from malicious links, phishing attempts, and scams. Additionally,
content creators are often burdened with managing spam, which detracts from the
quality of their channels. By developing efficient detection methods, we can enhance
community trust, safeguard the platform's integrity, and improve overall user
engagement.
5. Objectives:
1.Enhance User Experience: Identify and filter out spam comments to maintain
meaningful and relevant discussions, ensuring a positive and engaging experience for
YouTube users.
2. Protect Users from Malicious Content: Prevent the spread of harmful links, scams,
and phishing attempts by accurately detecting and removing spam, safeguarding users
from potential threats.
3. Support Content Creators and Platform Integrity: Reduce the burden on content
creators to manually manage spam, preserving the credibility of their channels while
maintaining YouTube’s overall platform integrity
6. Methodology/ Planning of work :
1. Data Collection and Preprocessing:
Collect YouTube comment data from public datasets or web scraping tools.
Preprocess the data by cleaning, tokenizing, and converting it into a usable format
(e.g., using TF - IDF, Word2Vec for feature extraction).
2. Model Building:
Develop machine learning models (Naïve Bayes, SVM) and deep learning models
(LSTM, CNN).
Use the preprocessed data to train the models, optimizing them for better
performance.
3. Model Training and Evaluation:
- Split the dataset into training and testing sets.
- Evaluate the models using accuracy, precision, recall, and F1-score, and tune the
parameters for best results.
4. Deployment and Monitoring:
- Deploy the best-performing model into a real-time YouTube comment system for
spam detection.
- Continuously monitor and update the model to adapt to new spam patterns.
7. Main references:
[1]. Niranjan, P., et al. (2017).
"Machine Learning Approach for Spam Detection in YouTube Comments"
Published in: International Journal of Advanced Research in Computer Science
This paper explores the use of machine learning algorithms like SVM and Random
Forest for YouTube spam detection.
[2]. Kumar, R., & et al. (2020).
"LSTM-based YouTube Comment Spam Detection"
Published in: IEEE International Conference on Machine Learning and
Applications
Focuses on using LSTM networks for spam comment detection, leveraging the
sequential nature of comments.
[3]. Xu, Y., & et al. (2021).
"Leveraging BERT for Spam Comment Detection"
Published in: Proceedings of the Association for Computational Linguistics (ACL)
This paper applies the BERT model for improved contextual understanding in spam
comment detection.
[4]. Mohammed, A., et al. (2023).
"A Hybrid Model for Spam Comment Detection in YouTube"
Published in: Journal of Artificial Intelligence Research
Proposes a hybrid model combining SVM with CNN for enhanced spam detection
performance.
[5]. Almukaynizi, M., et al. (2019).
"Detecting Spam in YouTube Comments Using Machine Learning Techniques" Published
in: IEEE Access
Investigates several machine learning techniques to compare their effectiveness in
detecting spam comments
[6]. Banerjee, A., & et al. (2018).
"YouTube Spam Comment Detection Using Deep Neural Networks"
Published in: International Journal of Computer Science and Network Security
(IJCSNS)
Discusses the application of deep learning models, particularly CNNs, for improving
YouTube spam detection accuracy.
[7]. Barbieri, M., et al. (2020).
"Text Mining for Spam Detection: A YouTube Comment Case Study" Published in:
Springer Lecture Notes in Computer Science.
Examines text mining techniques in conjunction with machine learning to filter YouTube
spam comments.
[8]. Garcia, A., et al. (2022).
"Multilingual YouTube Comment Spam Detection Using Transfer Learning"
Published in: IEEE Transactions on Multimedia
Introduces a multilingual spam detection approach using transfer learning,
addressing YouTube’s diverse global audience.
8. Facilities Available & Requirements:
• Hardware Requirements : High-performance CPU, GPU/TPU, RAM, Storage.
• Software Requirements : Machine Learning Frameworks, Natural
Language Processing (NLP) Tools, Data Handling and Analysis Tools
9. Planning of work chart :
Time Period Work to be Completed
01-08-2024 To Team Formation, Searching and Finalizing the Project Problem
31-08-2024 Statement
01-09-2024 To Collect data from sources and arrange this data into Excel sheet
30-09-2024
01-10-2024 To Discuss and plan the project execution steps.
30-11-2024
01-12-2024 To Development of Machine Learning model for prediction
31-12-2024
01-01-2025 To Use the prepared dataset for training and validating Machine
28-02-2025 Learning models.
01-03-2025 To Test and evaluate trained machine learning models
31-03-2025
01-04-2024 To Document project execution and making project for
10-04-2025 presentation.
Ms.G.S.Kshirsagar Mr. V. S. Gavali
(CPP Project Guide) (HOD)