A bot comment detection system using kcElectra and machine learning. (currently support youtube only) imported from https://github.com/MisileLab/h3/commits/main/projects/dsb/vivian
noMoreSpam is a tool designed to identify and filter bot comments on YouTube videos. It uses embeddings and machine learning techniques to classify comments as either bot-generated or human-written.
- YouTube comment collection and processing
- Bot comment classification using transformer-based models
- Interactive UI for model training and evaluation using Marimo notebooks
- Support for Korean text via KcELECTRA model
- Data visualization tools for model performance analysis
- Python 3.13.4 or higher
- YouTube API key for comment collection
- PyTorch (CPU or ROCm version available)
- OpenAI API key (for LLM-based classification)
-
Clone the repository:
git clone https://github.com/misilelab/noMoreSpam cd noMoreSpam
-
Set up the environment:
uv sync
-
Set up your API keys as environment variables:
export YOUTUBE_API_KEY=your_api_key_here export OPENAI_KEY=your_openai_api_key_here
-
Collect YouTube videos:
python data/get_videos.py
-
Collect comments from videos:
python main.py data/get_comments.py
-
Run the ML-based classification model:
python classify.py
-
Run the LLM-based classification model:
python classify_llm.py
-
Evaluate comments interactively:
python evaluate.py
-
Train the model with custom data:
python train.py
-
Split data into training and test sets:
python data/train_test_split.py
-
Merge processed data:
python merge.py
-
Clear temporary data:
python clear.py
classify.py
: Runs the ML-based bot comment classification modelclassify_llm.py
: Runs the LLM-based bot comment classification using OpenAI modelsclear.py
: Clears temporary OpenAI filesdata/
: Directory containing data processing scriptsget_comments.py
: Collects comments from YouTube videosget_videos.py
: Collects YouTube video informationtrain_test_split.py
: Splits data into training and test sets
evaluate.py
: Interactive tool for evaluating comments with the trained modelmain.py
: Entry point for running modulesmerge.py
: Merges processed embedding datatrain.py
: Trains the bot detection modelutils.py
: Utility functions and data models
The bot detection system uses a SpamUserClassifier based on the KcELECTRA model with:
- Frozen initial transformer layers
- Custom classification layers with dropout for regularization
- Focal Loss to handle class imbalance
- Combined CLS token and mean pooling for improved performance
This project is licensed under the MIT License - see the LICENSE file for details.