[go: up one dir, main page]

0% found this document useful (0 votes)
46 views72 pages

Data Mining Alternative Classification Notes

This document discusses rule-based classifiers for data classification. Rule-based classifiers use "if-then" rules to classify records, where the rule condition is a conjunction of attribute values and the conclusion is a class label. Examples of rules are presented for classifying animals. The document also discusses how rule-based classifiers work by applying rules to new records, and characteristics of rule-based classifiers such as mutually exclusive and exhaustive rules. Finally, it shows how rules can be generated from decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views72 pages

Data Mining Alternative Classification Notes

This document discusses rule-based classifiers for data classification. Rule-based classifiers use "if-then" rules to classify records, where the rule condition is a conjunction of attribute values and the conclusion is a class label. Examples of rules are presented for classifying animals. The document also discusses how rule-based classifiers work by applying rules to new records, and characteristics of rule-based classifiers such as mutually exclusive and exhaustive rules. Finally, it shows how rules can be generated from decision trees.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Data Mining

Classification: Alternative Techniques

© Tan,Steinbach, Kumar Introduction to Data Mining 1

Lecture Notes for Chapter 5

Introduction to Data Mining


by
Tan, Steinbach, Kumar
Rule-Based Classifier

Classify records by using a collection of


“if…then…” rules

Rule: (Condition) → y
– where
◆ Condition is a conjunctions of attributes
◆ y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
◆ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
◆ (Taxable Income < 50K)  (Refund=Yes) → Evade=No
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal

Rule Coverage and Tid Refund Marital Taxable


Status Income Class
Accuracy 1 Yes Single 125K No
2 No Married 100K No

Coverage of a rule: 3 No Single 70K No


4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
10 No Single 90K Yes
– Fraction of records that satisfy the antecedent
of a rule Accuracy of a rule:
– Fraction of records that satisfy both the
antecedent and consequent of a 10

rule (Status=Single) → No
Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Class
Water
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
This example illustrates two IMPORTANT PROPERTIES of Rule Based
Classifier (when it triggers more than one rule and when triggers no rule)

Characteristics of Rule-Based Classifier

Mutually exclusive rules

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule

Exhaustive rules
– Classifier has exhaustive coverage if it accounts
for every possible combination of attribute
values
– Each record is covered by at least one rule

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


From Decision Trees To Rules

Classification Rules
Refund (Refund=Yes) ==> No
Yes No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marital
{Single, Status (Refund=No, Marital Status={Single,Divorced},
{Married } Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K

NO YES
R
ules are mutually exclusive and exhaustive

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule set contains as much information as the
tree

Rules Can Be Simplified


Refund
Tid Refund Marital Taxable
Yes No Status Income Cheat

NO Marital 1 Yes Single 125K No


{Single, Status 2 No Married 100K No
{Married }
Divorced} 3 No Single 70K No

Taxable 4 NO
Yes Married 120K No
Income 5 No Divorced 95K Yes
< 80K > 80K
6 No Married 60K No
NO YES 7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


10 No Single 90K Yes
10

Initial Rule: (Refund=No)  (Status=Married) → No


Simplified Rule: (Status=Married) → No
Effect of Rule Simplification

Rules are no longer mutually exclusive – A


record may trigger more than one rule –
Solution?
◆ Ordered rule set (order by accuracy or coverage
etc)
◆ Unordered rule set – use voting schemes

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rules are no longer exhaustive – A
record may not trigger any rules –
Solution?
◆ Use a default class
Ordered Rule Set

Rules are rank ordered according to their


priority
– An ordered rule set is known as a decision list

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


When a test record is presented to the
classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class turtle cold no no sometimes ?

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule Ordering Schemes

Rule-based ordering
– Individual rules are ranked based on their quality

Class-based ordering
– Rules that belong to the same class appear together

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule-based Ordering Class-based Ordering
(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Building Classification Rules

Direct Method:
◆ Extract rules directly from data
◆ e.g.: RIPPER, CN2, Holte’s 1R

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ e.g: C4.5rules
Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
Example of Sequential Covering

(i) Original Data (ii) Step 1

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Example of Sequential Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Aspects of Sequential Covering

Rule Growing

Rule Evaluation

Instance Elimination

Stopping Criterion

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule Pruning

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule Growing

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Two common strategies
Yes: 3
{} No: 4

Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule Evaluation

RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information
gain measure:
◆ R0: {} => class (initial rule)
◆ R1: {A} => class (rule after adding conjunct)
◆ Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
◆ where t: number of positive instances covered by both R0
and R1 p0: number of positive instances covered by R0
n0: number of negative instances covered by R0 p1:
number of positive instances covered by R1 n1:
number of negative instances covered by R1

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Rule Evaluation (Through Foil’s Information Gain)

If a training set contains 60 positive examples


and 100 negative examples. We are given two
candidate rules
– Rule r1: covers 50 positive examples and 5
negative examples.
– Rule r2: covers 2 positive examples and no
negative examples.
◆Gain (R0, R1)=50*(log(50/55)-log(60/160)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


=50(-0.1372+1.412)=63.72 ◆
Gain(R0, R2)= ~3
– Choose rule R1 as it has more gain
Rule Evaluation by Other Metrics

Accuracy
Metrics:– =nc Rule R1 has accuracy=50/55, R2 has accuracy

n =2/2 nc +1

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Laplace = R1 has 51/57=89.47% n+k
n : Number of instances
n+k R2 has ¾= 75% covered by rule
nc : Number of positive
instances covered by rule
k : Number of classes p :
+
– M-estimate = n kp
c Prior probability for the
positive class

when p=1/k M-estimate


becomes same as Laplace

Rule Evaluation Metric (Likelihood ratio)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


k


f  i

where R = 2  i=1 fi

log ei  k is the
numbe of classes
fi is the observed frequencey of class i
ei is the expeceted frequency of class
i

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


(for exampl Rule r1, f1 = 50, f2 = 5,e1 = 55*60/160,e2 =
55*100/160)
R statistic for Rule r1, R(r1) = 99.9
R statistic for Rule r2, R(r2) = 5.66
So Rule r1 is a better rule than r2
Instance Elimination
R3 R2
Why do we need to R1 + + + +
+
+ ++ +
eliminate instances? class = +
+
+++
+ +
+
+
+
+ + +
– Otherwise, the next rule is + + +
+ + +
+ +
identical to previous rule - - -
- - - - -
- -
class = - - - -
- -
-
© Tan,Steinbach, Kumar Introduction to Data Mining - ‹#›
- -
Why do we remove positive instances?
– Ensure that the next rule is different
Why do we remove
negative instances? Before starting accu(r1)=12/15 or 80%

– Prevent underestimating Accu(r2)=7/10 or 70% and accu(r3)=8/12 or 66.7%


accuracy of rule First choose R1 as more accuracy, remove both
positive and –ve instances covered by it,
– Compare rules R2 and R3 recalculate accuracy of r2, r3 in the diagram
Now accu(r2)=70% but accu(r3)=6/8=75% so
better choice R1 and R3

Stopping Criterion and Rule Pruning

Stopping criterion
– Compute the gain
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
– If gain is not significant, discard the new rule

Rule Pruning
– Similar to post-pruning of decision trees –
Reduced Error Pruning:
◆ Remove one of the conjuncts in the rule
◆ Compare error rate on validation set before and
after pruning
◆ If error improves, prune the conjunct

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Summary of Direct Method

Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Repeat
Direct Method: RIPPER

For 2-class problem, choose one of the classes as


positive class, and the other as negative class
– Learn rules for positive class
– Negative class will be default class
For multi-class problem
– Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Learn the rule set for smallest class first, treat the rest
as negative class
– Repeat with next smallest class as positive class
Direct Method: RIPPER
Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced error pruning
Measure for pruning: v = (p-n)/(p+n)
◆ p: number of positive examples covered by the rule in
the validation set
◆ n: number of negative examples covered by the rule in
the validation set
◆If the metric v improves after pruning then the conjunct is removed

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Pruning method: delete any final sequence of conditions that maximizes v
– For example ABCD→y, Ripper checks whether D should be removed first,
followed by CD, BCD etc.
– The original rule covers only positive example where pruned rule may
cover some of the negative examples from training set

Indirect Methods

P
No Yes

Q R

No Yes No Yes

- + + Q

No Yes
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
Rule Set r1: (P=No,Q=No) ==>
r2: (P=No,Q=Yes) ==> + r3:
(P=Yes,R=No) ==> + r4:
(P=Yes,R=Yes,Q=No) ==> r5:
(P=Yes,R=Yes,Q=Yes) ==> +

Indirect Method: C4.5rules

Extract rules from an unpruned decision tree


For each rule, r: A → y,

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– consider an alternative rule r’: A’ → y where A’
is obtained by removing one of the conjuncts
in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the r’s has lower pessimistic
error rate
– Repeat until we can no longer improve
generalization error

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Indirect Method: C4.5rules

Instead of ordering the rules, order subsets of rules (class


ordering)
– Each subset is a collection of rules with the same rule
consequent (class)
– Compute description length of each subset
◆ Description length = L(error) + g L(model)
◆ g is a parameter that takes into account the presence of
redundant attributes in a rule set (default value = 0.5)
– g value is small if it contains many redundant attributes

– The class that has the lowest description length is


given highest priority as it contains best set of rules

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Example
Name Give Birth Lay Eggs Can Fly Live in Have Legs Class
Water
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


C4.5 versus C4.5rules versus RIPPER

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Give C4.5rules:
Birth? (Give Birth=No, Can Fly=Yes)
→ Birds
(Give Birth=No, Live in Water=Yes)
→ Fishes
Yes No
(Give Birth=Yes)
→ Mammals
(Give Birth=No, Can Fly=No, Live in Water=No)
→ Reptiles
Mammals Live In ( ) → Amphibians
Water?
Yes No RIPPER:
(Live in Water=Yes)
→ Fishes
Sometimes (Have Legs=No)
→ Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Fishes Amphibians Can
→ Reptiles
Fly?
(Can Fly=Yes,Give Birth=No)
→ Birds
Yes No () → Mammals

Birds Reptiles

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


C4.5 versus C4.5rules versus RIPPER
C4.5 and C4.5rules:
PREDICTED CLASS
Amphibian Fishes Reptiles Birds Mammals
s
ACTUAL Amphibian 2 0 0 0 0
s
CLASS Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
RIPPER:
PREDICTED CLASS
Amphibian Fishes Reptiles Birds Mammals
s
ACTUAL Amphibian 0 0 0 0 2

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


s
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4

Advantages of Rule-Based Classifiers

As highly expressive as decision trees


Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Instance-Based Classifiers

Set of Stored Cases

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
Instance Based Classifiers

Examples:
– Rote-learner
◆ Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly

– Nearest neighbor
◆ Uses k “closest” points (nearest neighbors) for
performing classification

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Nearest Neighbor Classifiers

Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Nearest-Neighbor Classifiers
Unknown record
Requires three
things

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine the class label of
unknown record (e.g., by taking majority vote)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


K-nearest neighbors of a record x are data points that
have the k smallest distance to x
Nearest Neighbor Classification

Compute distance between two points:


– Euclidean distance

d(p,q) =  (p −q ) i i i
2

Determine the class from nearest neighbor list

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
◆ weight factor, w = 1/d2
Nearest Neighbor Classification…

Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


X

Nearest Neighbor Classification…

Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
one of the attributes
– Example:
◆ height of a person may vary from 1.5m to 1.8m
◆ weight of a person may vary from 90lb to 300lb
◆ income of a person may vary from $10K to $1M
Nearest neighbor Classification…

k-NN classifiers are lazy learners


– It does not build models explicitly

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive
Bayes Classifier

A probabilistic framework for solving classification


problems
Conditional Probability:
P(A,C)
P(C | A) =

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


P(A)
P(A,C)
P(A| C) = P(C) Bayes theorem:

P(A| C)P(C)
P(C | A) =
P(A)
Example of Bayes Theorem

Given:
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability


he/she has meningitis?

P(M | S) = P(S | M )P(M ) =


0.51/50000 = 0.0002
P(S) 1/ 20

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Bayesian Classifiers

Consider each attribute and class label as random


variables

Given a record with attributes (A1, A2,…,An)


– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Can we estimate P(C| A1, A2,…,An ) directly from
data?
Bayesian Classifiers
Approach:
– compute the posterior probability P(C | A1, A2, …, An)
for all values of C using the Bayes theorem
P(AAA | C)P(C)
P(C | A A A ) =
1 2 n 1 2
n

P(A A A ) 1 2 n

– Choose value of C that maximizes


P(C | A1, A2, …, An)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)

How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier

Assume independence among attributes Ai when class is


given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


– New point is classified to Cj if P(Cj)  P(Ai| Cj) is
maximal.

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


How to Estimate Probabilities from Data?

c c c c Class: P(C)
= Nc/N
Tid Refund Marital Taxable
Status Income Evade – e.g., P(No) = 7/10, P(Yes) =
3/10
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No For discrete attributes:
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›
10 No Single 90K Yes
How to Estimate Probabilities from Data?

P(Ai | Ck) = |Aik|/ Nc k


– where |Aik| is number of instances having attribute Ai and
belongs to class Ck – Examples:
P(Status=Married|No) = 4/7
10
P(Refund=Yes|Yes)=0
For continuous attributes:
– Discretize the range into bins
◆ one ordinal attribute per bin
◆ violates independence assumption
– Two-way split: (A < v) or (A > v)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


How to Estimate Probabilities from Data?
◆ choose only one of the two splits as new attribute –
Probability density estimation:
◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)

c c c

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


How to Estimate Probabilities from Data?
Tid Refund Marital
Status
Taxable
Income Evade Normal distribution:
1 Yes Single 125K No
2 No Married 100K No P(A | c ) = 1
i j 2 e −

3 No Single 70K No ( Ai2−ij2ij )2

4 Yes Married 120K No


5 No Divorced 95K Yes 2 ij

6 No Married 60K No
7 Yes Divorced 220K No
– One for each (Ai,ci)
8 No Single 85K Yes
pair
9 No Married 75K No
10 No Single 90K Yes For (Income, Class=No):

– If Class=No

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


How to Estimate Probabilities from Data?
◆ sample mean = 110
10
◆ sample variance = 2975

1
2 (54.54)
(120−110)2

P(Income=120| No) = e − 2(2975)


= 0.0072

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Example of Naïve Bayes Classifier
Given a Test Record:

X= (Refund = No,Married,Income =120K)


naive Bayes Classifier:
P(Refund=Yes|No) = 3/7 P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0  P(Married| Class=No)
P(Refund=No|Yes) = 1  P(Income=120K| Class=No)
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
= 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7 P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Married| Class=Yes)
P(Marital Status=Married|Yes) = 0
 P(Income=120K| Class=Yes)
For taxable income: = 1  0  1.2  10-9 = 0
If class=No: sample mean=110
sample variance=2975
If class=Yes: sample mean=90
© Tan,Steinbach, sample
Kumar variance=25 Introduction to Data Mining ‹#›
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Naïve Bayes Classifier

If one of the conditional


probability is zero, then
the entire expression
becomes zero
Probability estimation: Nic =number of examples that have attribute Ai and class c

N
Original: P(Ai |C) = ic Nc =number of examples that are from class C

Nc

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


c: number of classes

+1
Laplace: P(Ai |C) = N ic
p: prior probability Nc +c
m: parameter

+mp
m-estimate : P(Ai |C) = N ic

Nc +m

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Name Give Birth Can Fly Live in Have Legs Class
Water
human
python
yes
no
no
no
no
no
yes
no
mammals
non-
mammals
Example of Naïve
salmon

whale
no

yes
no

no
yes

yes
no

no
non-
mammals
mammals
Bayes Classifier
frog no no sometimes yes non-
mammals
komodo no no no yes non- A: attributes
mammals
bat yes yes no yes mammals
M: mammals
pigeon no yes no yes non- N: non-mammals
mammals
cat yes no no yes mammals
leopard shark yes no yes no non-
mammals
turtle no no sometimes yes non-
mammals
penguin no no sometimes yes non-
mammals
porcupine yes no no yes mammals
eel no no yes no non-
mammals
salamander no no sometimes yes non-
mammals
gila monster no no no yes non-
mammals
platypus no no no yes mammals
owl no yes no yes non-
mammals
dolphin yes no yes no mammals
© Tan,Steinbach,
eagle no Kumaryes no Introductionnon-
yes to Data Mining ‹#›
mammals
P(A| M ) =    = 0.06 P(A| N) =    = 0.0042 P(A| M )P(M ) =

0.06 = 0.021 P(A| N)P(N) = 0.004 = 0.0027


P(A|M)P(M) > P(A|N)P(N)
Give Birth Can Fly Live in Have Legs Class
Water => Mammals
yes no yes no ?
Naïve Bayes
(Summary)

Robust to isolated noise points

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›


Handle missing values by ignoring the instance
during probability estimate calculations

Robust to irrelevant attributes

Independence assumption may not hold for some


attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)

© Tan,Steinbach, Kumar Introduction to Data Mining ‹#›

You might also like