[go: up one dir, main page]

0% found this document useful (0 votes)
5 views42 pages

Bayesian Learning

The document discusses Bayesian learning, which uses probability distributions for making optimal decisions based on observed data. It highlights the usefulness of Bayesian methods in machine learning, including the Naive Bayes classifier, and explains key concepts such as prior and posterior probabilities, Bayes theorem, and maximum a posteriori (MAP) hypothesis. Additionally, it covers applications of Bayesian algorithms and introduces Gaussian Naive Bayes for continuous attributes.

Uploaded by

bunny kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

Bayesian Learning

The document discusses Bayesian learning, which uses probability distributions for making optimal decisions based on observed data. It highlights the usefulness of Bayesian methods in machine learning, including the Naive Bayes classifier, and explains key concepts such as prior and posterior probabilities, Bayes theorem, and maximum a posteriori (MAP) hypothesis. Additionally, it covers applications of Bayesian algorithms and introduces Gaussian Naive Bayes for continuous attributes.

Uploaded by

bunny kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

BAYESIAN LEARNING

MODULE: 4

1
Introduction
• Bayesian reasoning provides a probabilistic approach to learning and
inference

• It is based on the assumption that the quantities of interest are governed


by probability distributions and that optimal decisions can be made by
reasoning about these probabilities together with observed data.

2
Usefulness of Bayesian Learning

Bayesian learning methods are relevant to our study of machine learning for
two different reasons:

• First, Bayesian learning algorithms that calculate explicit probabilities for


hypotheses, such as the naive Bayes classifier, are among the most
practical approaches to certain types of learning problems.

• They provide a useful perspective for understanding many learning


algorithms that do not explicitly manipulate probabilities

3
Features of Bayesian learning
1. Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct.
– This provides a more flexible approach to learning than algorithms
that completely eliminate a hypothesis if it is found to be inconsistent
with any single example.

2. Prior knowledge can be combined with observed data to determine


the final probability of a hypothesis.
In Bayesian learning, prior knowledge is provided by asserting
(1) a prior probability for each candidate hypothesis,
(2) a probability distribution over observed data for each possible
hypothesis.

4
Features of Bayesian learning
3. Bayesian methods can accommodate hypotheses that make probabilistic
predictions (e.g., hypotheses such as "this pneumonia patient has a 93%
chance of complete recovery").

4. New instances can be classified by combining the predictions of


multiple hypotheses, weighted by their probabilities.

5. Even in cases where Bayesian methods prove computationally intractable,


they can provide a standard of optimal decision making against which
other practical methods can be measured.

5
Bayes Theorem in Machine Learning
• In machine learning we are often interested in determining the best
hypothesis from some space H, given the observed training data D.

• One way to specify what we mean by the best hypothesis is to say that we
demand the most probable hypothesis, given the data D plus any initial
knowledge about the prior probabilities of the various hypotheses in H.

• Bayes theorem provides a direct method for calculating such probabilities.

6
Prior probability
• We shall write P(h) to denote the initial probability that hypothesis h holds,
before we have observed the training data.

• P(h) is often called the prior probability of h and may reflect any
background knowledge we have about the chance that h is a correct
hypothesis.

• If we have no such prior knowledge, then we might simply assign the


same prior probability to each candidate hypothesis.

• Similarly, we will write P(D) to denote the prior probability that training data
D will be observed (i.e., the probability of D given no knowledge about which
hypothesis holds).

7
Probability
• P(D|h) denotes the probability of observing data D given some world
in which hypothesis h holds.

• In machine learning problems, we are interested in the probability P(h|D)


that h holds given the observed training data D.

• P(h|D) is called the posterior probability of h, because it reflects our


confidence that h holds after we have seen the training data D.

• Notice the posterior probability P(h|D) reflects the influence of the


training data D, in contrast to the prior probability P(h), which is
independent of D.

8
9
▪ P(h|D) : Probability that the customer will buy a computer given that we know his
age, credit rating and income. (Posterior Probability of h)

▪ P(h) : Probability that the customer will buy a computer regardless of age, credit
rating, income (Prior Probability of h)

▪ P(D|h) : Probability that the customer is 35 yrs old, have fair credit rating and
earns $40,000, given that he has bought our computer (Posterior Probability of
D)

▪ P(D) : Probability that a person from our set of customers is 35 yrs old, have fair
credit rating and earns $40,000. (Prior Probability of D)

10
11
Maximum a posteriori (MAP) hypothesis
• In many learning scenarios, the learner considers some set of
candidate hypotheses H and is interested in finding the most probable
hypothesis h ε H given the observed data D (or at least one of the
maximally probable if there are several).

• Any such maximally probable hypothesis is called a


maximum a posteriori (MAP) hypothesis.

• We can determine the MAP hypotheses by using Bayes theorem to


calculate the posterior probability of each candidate hypothesis.

12
P(D) is a
constant
independent
of h.

P(D|h) is called the


likelihood of D given h and
any hypothesis that
maximizes this is the
Maximum Likelihood 13 h
Example of the Bayes rule
• Consider a medical diagnosis problem in which there are two
alternative hypotheses:
(1) that the patient has a particular form of cancer, and
(2) that the patient does not.

• The available data is from a particular laboratory test with two


possible outcomes: + (positive) and - (negative).
• We have prior knowledge that over the entire population of people only .008
have this disease.
• Furthermore, the lab test is only an imperfect indicator of the disease.
• The test returns a correct positive result in only 98% of the cases in which the
disease is actually present and a correct negative result in only 97% of the cases
in which the disease is not present. In other cases, the test returns the opposite
result.
14
Computing probabilities
• The situation can be summarized by the following probabilities:
– P(cancer) = .008, P(¬cancer) =.992
– P(+|cancer) = .98, P(-|cancer) = .02
– P(+|¬cancer) = .03, P(-|¬cancer) = .97
• Suppose we now observe a new patient for whom the lab test returns a
positive result.
• Should we diagnose the patient as having cancer or not?
• The maximum a posteriori hypothesis can be found using hmap
– P(+|cancer)P(cancer) = (.98) .008 = .0078
– P(+|¬cancer)P(¬cancer) = (.03).992 =.0298
• Thus, hmap = ¬cancer.
15
Naive Bayes Classifier

16
Naïve Bayes Classifier
Along with decision trees, neural networks, nearest neighor, one of
the most practical learning methods.

When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification

Successful applications:
• Diagnosis
• Classifying text documents
17
Naïve Bayes Classifier
• What can we do if our data d has several attributes?
• Naïve Bayes assumption: Attributes that describe data instances are
conditionally independent given the classification hypothesis

P
(
d|
h)=
P(
a1 a
|
Th=
,...,
)P
(
at|h
)
t

– it is a simplifying assumption, obviously it may be violated in reality


– in spite of that, it works well in practice

• The Bayesian classifier that uses the Naïve Bayes assumption and computes
the MAP hypothesis is called Naïve Bayes classifier

18
An Illustrative Example of NBC

19
An Illustrative Example

20
21
22
23
Example 2:

Given the data for symptoms and whether patient have flu or not, classify following:
x = (chills = Y, runny nose = N, headache = mild, fever = Y)
■ P(Flu = Y) = 5/8
■ P(Flu = N) = 3/8 Test Case:
■ P(chills = Y | Y) = 3/5 x = (chills = Y, runny nose = N, headache = mild, fever = Y)
■ P(chills = Y | N) = 1/3
■ P(runny nose = N | Y) = 1/5
■ P(runny nose = N | N) = 2/3
■ P(headache = Mild | Y) = 2/5
■ P(headache = Mild | N) = 1/3
■ P(fever = Y | Y) = 4/5
■ P(fever = Y | N) = 1/3

■ P(Yes|x) = [P(chills=Y|Y) P(runny nose=Y|Y) P(headache=Mild|Y) P(fever=Y|Y)] P(flu=Y) =

3/5 * 1/5 * 2/5 *4/5 * 5/8 = 0.024 (Maximum value and hence the predicted class)

■ P(No|x) = [P(chills=Y|N) P(runny nose=N|N) P(headache=Mild|N) P(fever=Y|N)] P(flu=N) =


1/3 * 2/3 *1/3 *1/3 * 3/8 = 0.0009
zero probability error
❑ There is a chance that the probability of a hypothesis becomes zero due to an element
having zero value in the numerator

❑ This when multiplied by other probabilities leads to a final zero probability.

❑ This can be avoided by applying the smoothing technique called Laplace correction which means if there
are zero instances of a particular feature, just add one which will not make much of a difference

❑ Ex: if probability = 0/400, it can be changed to 1/400 which will not make much of a difference
Applications of Naïve Bayes Algorithms

■ Naïve Bayes is fast and thus can be used for making real time
predictions
■ It can predict probability of multiple classes of target variables
■ It can be used for text classification, spam filtering, sentiment analysis
■ Naïve Bayes and collaborative filtering together can help make
recommendation systems to filter unseen information and predict
whether user would like a given resource or not
Bayes Optimal Classifier
Bayes Optimal Classifier
• So far we have considered the question "what is the most probable
hypothesis given the training data?"

• In fact, the question that is often of most significance is the


closely related question:
■ – what is the most probable classification of the new
instance given the training data?

• Although it may seem that this second question can be answered


by simply applying the MAP hypothesis to the new instance, in
fact it is possible to do better.
Bayes Optimal Classifier
• To develop some intuitions consider a hypothesis space
containing three hypotheses, h1, h2, and h3.
• Suppose that the posterior probabilities of these hypotheses given the training
data are 0.4, 0.3, and 0.3 respectively.

• Thus, h1 is the MAP hypothesis.

• Suppose a new instance x is encountered, which is classified positive by h1, but


negative by h2 and h3.

• Taking all hypotheses into account, the probability that x is positive is 0.4 (the
probability associated with h1), and the probability that it is negative is therefore
0.6.

• The most probable classification (negative) in this case is different from the
classification generated by the MAP hypothesis.
Bayes Optimal Classifier
• In general, the most probable classification of the new instance is
obtained by combining the predictions of all hypotheses, weighted by their
posterior probabilities.
• If the possible classification of the new example can take on any value vj
from some set V, then the probability P(vj|D) that the correct classification for
the new instance is vj, is just:

• The optimal classification of the new instance is the value Vj for which P(vj|D)
is maximum.

arg max  P(v j | hi )P(hi | D)


v j V
h H
Bayes Optimal Classifier Example
Bayes optimal classification
arg max  P(v j | hi )P(hi | D)
v j V
h iH
Example:
P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0
therefore
 P(+ | h )P(h
i i | D) = .4
hi H

 P(− | h )P(h
i i | D) = .6
hi H

and
arg max  P(v j | hi )P(hi | D) = -
v j V
h iH
Bayes Optimal Classifier Example
Bayes optimal classification
arg max  P(v j | hi )P(hi | D)
v j V
h iH

P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1


P(h2|D)=.3, P(-|h2)=0, P(+|h2)=1
P(h3|D)=.5, P(-|h3)=1, P(+|h3)=0
therefore
 P(+ | h )P(h
i i | D) = .7
hi H

 P(− | h )P(h
i i | D) = .5
hi H

and
arg max  P(v j | hi )P(hi | D) = +
v j V
h iH
Gaussian Naïve Bayes for Continuous Attributes

Normal Distribution formula to be used for


probability distribution computation
Gaussian Naïve Bayes for Continuous Attributes
We calculate the mean and standard deviation of the
attributes PSA and Age for Cancer Class

Standard deviation = S
Mean = x̄
Gaussian Naïve Bayes for Continuous Attributes
We calculate the mean and standard deviation of the
attributes PSA and Age for Healthy Class

Standard deviation = S
Mean = x̄
Gaussian Naïve Bayes for Continuous Attributes

Test Case – Predict for ->


Gaussian Naïve Bayes for Continuous Attributes
Gaussian Naïve Bayes for Continuous Attributes
Gaussian Naïve Bayes for Continuous Attributes
Gaussian Naïve Bayes for Continuous Attributes
Gaussian Naïve Bayes for Continuous Attributes

Class is Cancer

You might also like