Email Spam Filtering
Computer Security Seminar
N.Muthiyalu Jothir – 271120
Media Informatics
12/18/24 Email Spam Filtering - Muthiyalu Jothir 1
Agenda
What is Spam ?
Statistics
Who Benefits from it?
Spam Filtering Techniques
Combining Filters
Conclusion
12/18/24 Email Spam Filtering - Muthiyalu Jothir 2
What is Spam?
Spam Unsolicited email
Emails that involves sending identical
or nearly identical messages to
thousands (or millions) of recipients.
Caution !
“SPAM - Spiced Ham ” is a popular
American canned meat brand…
12/18/24 Email Spam Filtering - Muthiyalu Jothir 3
Problem
With a tiny investment, a spammer can send over
100,000 bulk emails per hour.
Junk mails waste storage and transmission
bandwidth.
ISP’s investment Cost we absorb as ISP’s
customer
Spam is a problem because the cost is forced
onto us, the recipient.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 4
Statistics
Email considered Spam 40% of all
email
Daily Spam emails sent 12.4 billion
Daily Spam received per 6
person
Annual Spam received per 2,200
person
Spam cost to all non-corp. Internet users $255 million
Spam cost to all U.S. $8.9 billion
Corporations in 2002
Estimated Spam increase 63%
by 2007
Users who reply to Spam 28%
email
Users who purchased from Spam email 8%
Wasted corporate time per Spam email 4-5 seconds
12/18/24 Email Spam Filtering - Muthiyalu Jothir 5
Who benefits from Spam?
Financial Firms e.g. Mortgage
Information about
interested customers
Recipient replies here
Lead Generators Recipient
(Gain 2% of
Loan value
per customer data) Spammers
(Share the profit
with Lead Generators)
12/18/24 Email Spam Filtering - Muthiyalu Jothir 6
Spam Control Techniques
Fight Back techniques Filtering Techniques
• Reporting Spam to ISP • Challenge-Response Filtering
• Fight back filters • Blacklists and White lists
• Slow Senders • Content based filters
Rule based
• Law ??? Bayesian filters
• etc.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 7
Reporting Spam To ISPs
Original spam solution
Legitimate ISPs respond to such
complaints
Spammers kicked off
Disadvantage
Disguised Spammers.
Naïve users cannot interpret the
email headers
12/18/24 Email Spam Filtering - Muthiyalu Jothir 8
Filters that Fight Back (FFB)
Majority of spam contain links to web pages.
Spam filters could auto retrieve the URLs and crawl back to
those pages, which would increase the load on the server.
If all the spam receivers do this at the same time, the server
might be crashed and so the cost of spamming increases.
Caution !
FFB usually works with blacklists (of malicious
servers) in order to avoid the attack on innocent
servers.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 9
Filtering Techniques
12/18/24 Email Spam Filtering - Muthiyalu Jothir 10
Spam Vs Ham
Care to be taken in any Spam filtering technique
“All the Spam could be allowed to pass thro; but,
not even a single legitimate mail should be filtered.”
False Positive – Legitimate mail classified as spam.
Least false positive rate desired…
Caution : Check your junk folder before deleting
Don’t believe your Spam filter
12/18/24 Email Spam Filtering - Muthiyalu Jothir 11
Challenge-Response Filtering
Emails from unknown senders will receive an auto-reply
message asking them to verify themselves
Senders “Challenged" to type in a word that is hidden
within a graphic or a sound file
Mail is forwarded to receiver’s inbox, only after successful
“response”
This technique almost filters all spam . No spammer would
be interested to take the extra effort to prove him / her self.
Commercial product “spamarrest”
Disadvantage
This technique is rude
Sometimes senders don’t or forget to reply to the challenge
12/18/24 Email Spam Filtering - Muthiyalu Jothir 12
Blacklists and White lists
Blacklists of misbehaving servers or known spammers that
are collected by several sites.
Sender id in the email is compared with the blacklist
White lists are complementary to black lists, and contain
addresses of trusted contacts
Use blacklists and white lists for the first level filtering
(before applying content checks) and not used as the only
tool for making decision.
Disadvantage
Prone to wrong configurations with legitimate servers unable to
exit from a list where they had been incorrectly inserted.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 13
Content based filters
Not a good idea to filter mails just based
on blacklists
Wiser decision Consider the actual
content of the email
Almost all the successful spam filters use
this technique
Major types : Rule-based and Bayesian
12/18/24 Email Spam Filtering - Muthiyalu Jothir 14
Rule Based Filters
Rule based filters work based on some
static rules to decide whether a mail is a
spam or not.
Rules could be
• words and phrases
• lots of uppercase characters
• exclamation points
• special characters
• Web links
• HTML messages
• background colors
• crazy Subject lines etc.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 15
Rule based filters
Rules are given scores, based on importance
Incoming mails are parsed and checked for
known malicious patterns
Total score calculated for the triggered rules
If Final Score > Threshold, classify as spam.
Otherwise, classify as legitimate mail.
Threshold decided by the user.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 16
Rule Based Filters
“Spamassasin”, a popular spam filtering product
uses rule based filtering.
Perl Regex (Regular expressions) used for pattern
checking
Example rules
• header __LOCAL_FROM_NEWS From /news@example\.com/i
• body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/
• score LOCAL_NEWS_SALES_FIGURES 0.8
12/18/24 Email Spam Filtering - Muthiyalu Jothir 17
Rule Based Filters
Advantage
Easy to implement
No training required
Disadvantage
Static rules too general
Spammers find new ways to deceive the
rules
12/18/24 Email Spam Filtering - Muthiyalu Jothir 18
Bayesian Filters
Bayesian filters are the latest in spam
filtering technology and the most
successful.
Bayes classifiers were used extensively in
the field of pattern recognition.
Given an unlabeled example, the classifier
will calculate the most likely classification
with some degree of probability.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 19
Bayesian Filters
Steps in Bayes Filtering
Training
Validation
Implementation
Training starts with two collections of mails : one of spam and
one of legitimate mail.
For every word in these emails, it calculates a spam probability
based on the proportion of spam occurrences.
Bayesian filters are quite accurate, and adapt automatically as
spam evolves.
False positives are minimized by Bayesian filtering because
they consider evidence of innocence as well as evidence of
spam.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 20
Bayesian Filtering
Bayes Probability,
Pr (spam | words) = Pr (spam) * Pr (words | Spam)
Pr (words)
Probability closer to 1 would be classified as spam and
closer to 0 is classified as ham.
0.5 is set as the threshold.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 21
Neural Network for Training
Neural Network Structure
12/18/24 Email Spam Filtering - Muthiyalu Jothir 22
Neural Networks for Training
Neural networks are used to train the
spam filter (Rule-based or Bayesian) and
itself is not a filter
Input words or rules etc.
Trained over multiple samples of the
user’s mails (both spam and ham)
Weights of the links are altered till the
desired output is obtained.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 23
Supervised Learning
Supervised learning Training with a
“teacher” signal
Train the system till we get optimized
unaltered weights for the edges.
Caution!
Take care not to over train the network.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 24
Combining Spam Filters
Goal Combined filter aims to improve
individual filters performance.
Combined Filter = Original Filter (OF) + Received Filter (RF)
Max gain Received filter contains some feature
sets not found in the original filter.
E.g.
Original Filter = {“Share Market”, “Higher Studies”}
Received filter = {“Share Market”, “Job Alerts”}
12/18/24 Email Spam Filtering - Muthiyalu Jothir 25
Challenges
Decisions (Spam / Ham) made by both
filters individually
Decisions agree No Problem
Disagreement Due to difference of
feature sets
Challenges
• “How do we select the correct decision or filter?”
• “Who selects it?”
12/18/24 Email Spam Filtering - Muthiyalu Jothir 26
Filter Selector (FS)
Training Phase FS predicts the unique
features (e.g. words) of RF
Parse the emails of training set and
extract the features
‘Bag’ of (predicted) features for RF
Text similarity comparison between the
current e-mail's features and the feature
sets of the filters.
12/18/24 Email Spam Filtering - Muthiyalu Jothir 27
Algorithm Flowchart
1. Training Phase
2. Final Verdict
12/18/24 Email Spam Filtering - Muthiyalu Jothir 28
TF – IDF Similarity Measure
Commonly used in Information Retrieval
applications.
More frequent words would be key to
accurate classification of emails
FS predicted feature set is unique
“Query – Document” retrieval procedure.
• 2 documents – Feature sets
• Query – Current email
12/18/24 Email Spam Filtering - Muthiyalu Jothir 29
Experiments & Results
12/18/24 Email Spam Filtering - Muthiyalu Jothir 30
Conclusion
We discussed the techniques to “kill” spam
Comparison between various techniques
So far, Bayesian seems to be reliable
Discussed a new approach to combine filters
Future work :
Learning techniques for Filter Selector
Better Similarity measures
12/18/24 Email Spam Filtering - Muthiyalu Jothir 31
Thank You
12/18/24 Email Spam Filtering - Muthiyalu Jothir 32