Bayesian Classification
Bayes Theorem - Thomas Bayes
Bayes Theorem - Basics
Bayesian Classification
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
Bayes’ Theorem: Basics
M
P(B)
Total probability Theorem: P(B | A )P( A )
i i
i 1
Bayes’ Theorem:P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)
P(X)
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed
data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
5 31..40, medium income
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)
P(X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of
many probabilities, involving significant computational
cost
6
Naïve Bayes Classification
Let D be a training set of tuples and their associated
class labels, and each tuple is represented by an n-D
attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior
probability, conditioned on X. That is, the naıve
Bayesian classifier predicts that tuple X belongs to the
class Ci if and only if
7
Classification Is to Derive the
Maximum Posteriori
Classification is to derive the maximum
posteriori, i.e., the maximal P(Ci|X) This
can be derived from Bayes’ theorem Since
P(X) is constant for all classes, only P (X|
Ci)P(Ci) needs to be maximized.
P(X | C )P(C )
P(C | X) i i P(C | X) P(X | C )P(C )
i P(X) i i i
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
n
P ( X | C i ) P ( x | C i ) P ( x | C i ) P ( x | C i ) ... P ( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only
counts the class distribution
If Ak is categorical, P(xk|Ci) is the number of tuples in
Ci having value xk for Ak divided by |Ci, D| (number of
tuples of Ci in D)
If Ak is continous-valued, P(xk|C 1 i) is
usually computed
2
( x )
( x, , )
2
2
based on Gaussian g distribution 2
ewith a mean μ and
standard deviation σ
P ( X | C i ) g ( xk , C , C )
and P(xk|Ci) is
i i
9
How to Predict a class label using naıve
Bayesian classification?
Given class labeled training tuples from
AllElectronics Customer Database.
The data tuples are described by the
attributes age, income, student, and
credit rating.
Naïve Bayes Classifier: Training Dataset
Class: C1:buys_computer = ‘yes’ & C2:buys_computer = ‘no’
11
P(Ci): P(buys_computer = “yes”) = 9/14
= 0.643
age
P(buys_computer
income studentcredit_rating
= “no”) =
buys_computer
5/14=<=30
0.357
<=30
high
high
no
no
fair
excellent
no
no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
The tuple we wish to classify is
X = (youth , income = medium, student
=yes, credit_rating = fair)
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
Naïve Bayes Classifier: An Example >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
Compute P(X|Ci) for each class >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “youth” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 =
0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 =
0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 =
0.4
X = (youth , income = medium, student = yes, credit_rating =
fair)
13
Conditio
n
satisfied
Therefore the Naïve Bayesian Classifier predicts
buys_computer = yes for tuple X
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires each conditional
prob. be non-zero. Otherwise, the predicted prob.
will be zero n
P( X | C i) P( x k | C i)
k 1
Ex. Suppose a dataset with 1000 tuples, income=low
(0), income= medium (990), and income = high (10)
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
15“uncorrected” counterparts
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayes Classifier
16
Thank you….