SMS spam filter 2024-25
Chapter 6
Testing
1. Dataset Collection - Obtain a dataset: Use an existing SMS dataset like the "SMS
Spam Collection Dataset" from UCI or Kaggle.
- Create a dataset: Collect SMS messages and label them as "spam" or "ham" (not spam).
2. Preprocessing
- Text cleaning: Remove unnecessary characters (punctuation, special symbols, etc.).
- Tokenization: Split messages into words or tokens.
- Lowercasing: Convert all text to lowercase for uniformity.
- Stopword removal: Remove common words that don’t add much meaning (e.g., "the",
"and").
- Stemming/Lemmatization: Reduce words to their root form.
3. Feature Engineering
- Convert text to numerical data:
- Bag of Words (BoW).
- TF-IDF (Term Frequency-Inverse Document Frequency).
- Word embeddings: Pre-trained embeddings like Word2Vec or GloVe, or embeddings from
transformer models (e.g., BERT).
4. Model Selection
- Use machine learning models like:
- Naive Bayes.
- Support Vector Machines (SVM).
- Logistic Regression.
- Random Forest.
- Or deep learning models:
- Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks.
- Transformer-based models (e.g., BERT, DistilBERT).
5. Train/Test Split
- Split the dataset into training and testing subsets (e.g., 80/20 split).
Department of CS&BS P a g e | 53
SMS spam filter 2024-25
6. Model Training
- Train the model using the training dataset.
7. Evaluation
- Use metrics like:
- Accuracy: Percentage of correct predictions.
- Precision: Ratio of correctly predicted spam messages to total predicted spam messages.
- Recall (Sensitivity): Ratio of correctly predicted spam messages to actual spam messages.
- F1 Score: Harmonic mean of precision and recall.
8. Testing
- Use the test dataset to evaluate the model's performance.
- Input example SMS texts to check the filter's accuracy.
9. Deployment
- Deploy the model in a real-world application to classify incoming SMS messages.
Department of CS&BS P a g e | 54
SMS spam filter 2024-25
Chapter 7
Result Analysis
Department of CS&BS P a g e | 55