Bayesian Theory & Naïve Bayes Classifiers
Course 4232: Machine Learning
Dept. of Computer Science
Faculty of Science and Technology
Lecturer No: Week No: 03 Semester: Summer 21-22
Instructor: Dr. M M Manjurul Islam
Bayesian Classifier
A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A basic Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network
classifiers
Standard: Even when Bayesian methods are computationally intractable,
they can provide a standard of optimal decision making against which
other methods can be measured
Bayes’ Theorem: Basics
M
Total probability Theorem: P(B) = P(B | A )P( A )
i i
i =1
Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability): the probability that
the hypothesis holds given the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes’ theorem
P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)
P(X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among
all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many probabilities,
involving significant computational cost
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
Since P(X) is constant for all classes, only
needs to be maximized P(C | X) = P(X | C )P(C )
i i i
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent (i.e.,
no dependence relation between attributes):
n
P(X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
This greatly reduces the computation cost: Only counts the class
distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− )2
1 −
and P(xk|Ci) is g ( x, , ) = e 2 2
2
P(X | Ci) = g ( xk , Ci , Ci )
Bayesian Methods
Learning and classification methods based on probability
theory.
Bayes theorem plays a critical role in probabilistic learning
and classification.
Uses prior probability of each category given no information
about an item.
Categorization produces a posterior probability distribution
over the possible categories given a description of an item.
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. It is known
that the test returns a correct positive result in only 99% of the cases and a
correct negative result in only 95% of the cases. Furthermore, only 0.03 of
the entire population has this disease.
1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the diagnosis?
Maximum A Posterior
Based on Bayes Theorem, we can compute the Maximum A
Posterior (MAP) hypothesis for the data
We are interested in the best hypothesis for some space H given
observed training data D.
hMAP argmax P(h | D)
hH
P ( D | h) P ( h)
= argmax
hH P ( D)
= argmax P( D | h) P(h)
hH
H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data is constant
(and independent of the hypothesis).
Maximum Likelihood
Now assume that all hypotheses are equally probable a priori,
i.e., P(hi ) = P(hj ) for all hi, hj belong to H.
This is called assuming a uniform prior. It simplifies computing
the posterior:
This hypothesis is called the maximum likelihood hypothesis.
hML = arg max P( D | h)
hH
Desirable Properties of Bayes Classifier
Incrementality: with each training example, the prior and the
likelihood can be updated dynamically: flexible and robust to
errors.
Combines prior knowledge and observed data: prior probability of
a hypothesis multiplied with probability of the hypothesis
given the training data
Probabilistic hypothesis: outputs not only a classification, but a
probability distribution over all classes
Bayes Classifiers
Assumption: training set consists of instances of different classes
described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of attribute values
into one of the classes cj C
Key idea: assign the most probable class cMAP using Bayes
Theorem.
cMAP = argmax P(c j | x1 , x2 ,, xn )
c j C
P( x1 , x2 , , xn | c j ) P(c j )
= argmax
c j C P( x1 , x2 ,, xn )
= argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Parameters estimation
P(cj)
Can be estimated from the frequency of classes in the training
examples.
P(x1,x2,…,xn|cj)
O(|X|n•|C|) parameters
Could only be estimated if a very, very large number of training
examples was available.
Independence Assumption: attribute values are conditionally
independent given the target value: naïve Bayes.
P( x1 , x 2 ,, x n | c j ) = P( xi | c j )
i
c NB = arg max P(c j ) P( xi | c j )
c j C i
Properties
Estimating P( xi | c j ) instead of P( x1 , x2 ,, xn | c j ) greatly
reduces the number of parameters (and the data sparseness).
The learning step in Naïve Bayes consists of estimating
and P( xi | c j ) based on the frequencies in the training data
An unseen instance is classified by computing the class that
maximizes the posterior P(c j )
When conditioned independence is satisfied, Naïve Bayes
corresponds to MAP classification.
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40
15
medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
P(buys_computer = “no”) = 5/14= 0.357
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
Compute P(X|Ci) for each class <=30
>40
low yes fair
medium yes fair
yes
yes
<=30 medium yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
31…40
medium
high
no excellent
yes fair
yes
yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
What is discriminant Function
For classification problem, for each class, define a function
such that we choose Ci if
gi ( x), i = 1,..., K
gi ( x) = max g k ( x)
k
Discriminant Functions
choose if (x ) = max (x ) (x ) =1
− ( | x )
(x ) = ( | x )
(x | ) ( )
R1,...,R
R = x | (x ) = max (x )
1
8
K=2 Classes
Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
1 if (x ) 0
choose
2 otherwise
Log odds:
log
( 1 | x)
( 2 | x)
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks (Chapter
9)
Textbook/ Reference Materials
Introduction to Machine Learning by Ethem Alpaydin
Machine Learning: An Algorithmic Perspective by
Stephen Marsland
Pattern Recognition and Machine Learning by
Christopher M. Bishop