[go: up one dir, main page]

0% found this document useful (0 votes)
8 views77 pages

Lecture 7 Bayesian Models - Merged

The document discusses Bayesian models for estimating the potential of gold deposits in a specified study area using various data sets and probabilistic approaches. It explains the difference between frequentist and Bayesian probability, emphasizing the use of prior knowledge in Bayesian statistics. Additionally, it outlines the weights of evidence model for calculating the probability of mineral deposits based on geological features.

Uploaded by

pythontest3960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views77 pages

Lecture 7 Bayesian Models - Merged

The document discusses Bayesian models for estimating the potential of gold deposits in a specified study area using various data sets and probabilistic approaches. It explains the difference between frequentist and Bayesian probability, emphasizing the use of prior knowledge in Bayesian statistics. Additionally, it outlines the weights of evidence model for calculating the probability of mineral deposits based on geological features.

Uploaded by

pythontest3960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Lecture 7

Baysesian Models
Study area

Area: 144 sq km
Unit Cell Size: 1 sq km

What is the potential of each unit cell to have a gold deposit?

How do we estimate this potential???


Data sets: Input GIS layers
Px – Proximal
Ds – Distal
Distance to fault layer
H - High
L - Low

Arsenic content layer

Mineral deposit layer


Training data
Px – Proximal
Ds – Distal

H - High
Distance to fault layer L – Low

D – Deposit
ND – Non deposit

Arsenic content layer

Deposit layer
Training data
Px – Proximal
Ds – Distal

H - High
Distance to fault layer L – Low

D – Deposit
ND – Non deposit

Arsenic content layer

Deposit layer

Probabilistic approach:
Class of pixel (1,1) given high arsenic and proximal to fault
Probability & Statistics

Sampling
Population Sample Data

Parameters Statistics
Inferencing

Statistical inference = generalizing from a sample to a population


Probability = Likelihood of a sample belonging to a specific population
Probability & Statistics
assumes it to be given
(perfect prior knowledge)
A container contains red balls and blue balls. A probabilist starts by knowing the
proportion of each and asks the probability of drawing a red ball. A statistician infers the
proportion of red balls by sampling from the jar……………..

Statistician’s approach Probabilist’s approach

Draw out a few balls randomly If the proportion of red balls to


from the container and estimate blue balls is 4:1, what is the
the proportion of red to blue balls probability that the ball I will draw
out randomly is red?

From the observations we compute statistics that we use to estimate population


parameters, which index the probability density, from which we can compute the
probability of a future observation from that population.
Statistical inference = generalizing from a sample to a population
Probability = Likelihood of a sample belonging to a specific population
Variables
• Quantities measured for a sample. May be
– Quantitative i.e. numerical
• Continuous (e.g. pH of a sample, radiance, magnetic
field, distance from a feature)
• Discrete (e.g. DN value on an image)
– Categorical
• Nominal (e.g. gender, land-use class)
• Ordinal (ranked e.g. mild, moderate or severe; small or
large, cool, warm and hot). Often ordinal variables are re-
coded to be quantitative.

8
Frequency Distributions

• An (Empirical) Frequency Distribution or Histogram for a


continuous variable presents the counts of observations grouped
within pre-specified classes or groups

• A Relative Frequency Distribution presents the corresponding


proportions of observations within the classes

• A Barchart presents the frequencies for a categorical variable

9
Example – Uranium in groundwater
• Water samples taken from 36 locations in Powai as part of a study to
determine the natural variation of total dissolved solids in the area.

• The Uranium concentrations measured in (PPM/I) are as follows:

10
U in a study area groundwater samples

121 82 100 151 68 58


95 145 64 201 101 163
84 57 139 60 78 94
119 104 110 113 118 203
62 83 67 93 92 110
25 123 70 48 95 42

11
Frequency Distribution
8

Frequency
4

20 40 60 80 100 120 140 160 180 200 220

= Probability distributions
(when idealized and fitted to mathematical functions)
Probability: the “frequentist” approach
• probability should be assessed in purely
objective terms
• no room for subjectivity on the part of individual
researchers
• knowledge about probabilities comes from the
relative frequency of a large number of trials
– this is a good model for coin tossing
– not so useful for predicting complex problems, where
many of the factors are unknown…e.g., stock market
Frequentist: "The probability of a coin landing heads is 50% because we
observed it in many trials."
Bayesian: "I believe the coin is fair based on prior evidence, and I’ll update
my belief as I see new data."
Probability: the Bayesian approach
• Bayes Theorem
– Thomas Bayes
– 18th century English clergyman

• concerned with integrating “prior knowledge” into


calculations of probability
• problematic for frequentists
– prior knowledge = bias, subjectivity…
Dealing with a ‘random phenomenon’
• a random phenomenon is a situation in which
we know what outcomes could happen, but we
don’t know which particular outcome did or will
happen.
• when dealing with probability, we will be
dealing with many random phenomena.
• examples: coin, cards, survey, experiments
Recall that…….
• probability of event = p
0 <= p <= 1
0 = certain non-occurrence
1 = certain occurrence

• .5 = even odds
• .1 = 1 chance out of 10
Probability

“something-has-to-happen rule”:
– The probability of the set of all possible outcomes of a
trial must be 1.
– P(S) = 1
(S represents set of all possible outcomes.)
CAUTION:: are outcomes are equally
likely??

Winning Lottery?? 50-50??


Rain Today?? Yes-No 50-50??

Just because there are two outcomes does not mean they are 50-50.
In a desert: P(Rain) nearly 1%

In a rainforest: P(Rain) nearly 90%

Mistake: Ignoring base rates (prior knowledge).


Example: Random Exploration
• The Lottery (also known as a tax on people who are
bad at math…)

• A certain lottery works by picking 6 numbers from 1 to


49. It costs $ 1.00 to play the lottery, and if you win,
you win $ 2 Million after taxes.

• If you play the lottery once, what are your expected


winnings or losses?
Lottery
Calculate the probability of winning in 1 try:

1 1 1 “49 choose 6”
= = = 7.2 x 10 -8
 49  49! 13,983,816 Out of 49 numbers,
 
 6  43!6! this is the number of
distinct combinations
of 6.
The probability function (note, sums to 1.0):
x p(x)
-1 $. .999999928

+ 2 Million $ 7.2 x 10--8


Expected Value
x p(x)
-1 $ .999999928

+ 2 Million $ 7.2 x 10--8

Expected Value
E(X) = P(win)*$2,000,000 + P(lose)*-$1.00
= 2.0 x 106 * 7.2 x 10-8+ .999999928 (-1) = .144 - .999999928 = -$.0.86

Negative expected value is never good!


You shouldn’t play if you expect to lose money!
Subjective probability
• we use the language of probability in everyday speech to
express a degree of uncertainty without basing it on long-
run relative frequencies.
• such probabilities are called subjective or personal
probabilities.
• personal probabilities don’t display the kind of consistency
that we will need probabilities to have, so we’ll stick with
formally defined probabilities.

I’m 90% sure Tesla will drop tomorrow—Elon’s tweet seemed off
Rules of probability:
addition rule
Definition: events that have no outcomes in common (and,
thus, cannot occur together) are called mutually exclusive.

For two mutually exclusive events A and B, the probability


that one or the other occurs is the sum of the probabilities
of the two events.

P(A or B) = P(A) + P(B), provided that


A and B are mutually exclusive.
rules of probability:
the general addition rule
For any two events A and B,
P(A or B) = P(A) + P(B) – P(A and B).
Rules of probability:
multiplication rule
– for two independent events A and B, the probability that
both A and B occur is the product of the probabilities of
the two events.
– P(A & B) = P(A) x P(B), provided that A and B are
independent.
Independent events
• one event has no influence on the outcome of
another event
• if P(A&B) = P(A) x P(B)
then events A & B are independent
• coin flipping
if P(H) = P(T) = .5 then
P(HTHTH) = P(HHHHH) =
.5*.5*.5*.5*.5 = .55 = .03125
independent ≠ mutually exclusive
• mutually exclusive events cannot be independent. Well,
why not?
• since we know that mutually exclusive events have no
outcomes in common, knowing that one occurred
means the other didn’t.
• thus, the probability of the second occurring changed
based on our knowledge that the first occurred.
• it follows, then, that the two events are not independent.
Conditional probability
• concern the odds of one event occurring, given that
another event has occurred

• P(A|B)=Prob of A, given B
Conditional probability (cont.)

• P(B|A) = P(A&B)/P(A)
Independence….???
With notation for conditional probabilities, we can now
formalize the definition of independence
• events A and B are independent whenever
P(B|A) = P(B)

{if A and B are independent, then


P(B|A) = P(A&B)/P(A)
= P(A)*P(B)/P(A)
or P(B|A) = P(B) }
so, the general multiplication rule
– For any two events A and B,
P(A & B) = P(A) x P(B|A) or
P(A & B) = P(B) x P(A|B)
Bayes’ Rule

P(d | h)P(h)
p(h | d) =
P(h)P(d | h) + P(~ h) P(d |~ h)
Does the patient have cancer or not?
• A patient takes a lab test and the result comes back positive. It is
known that the test returns a correct positive result in 98% of the
cases and a correct negative result in 97% of the cases.
Furthermore, only 0.008 of the entire population has this disease.

1. What is the probability that this patient has cancer?


2. What is the probability that he does not have cancer?
3. What is the diagnosis?
Probability of test being positive when there was no cancer = =1-0.97 = 0.03
1 – Probability of test being negative when there was no
cancer

= 0.208511
Probability of test being positive when there was no cancer = 1
– Probability of test being negative when there was no cancer
= 1-0.97 = 0.03

= 0.791489
Choosing Hypotheses
• Maximum Likelihood hypothesis:
h
ML=
arg
max
P(
d |h
)

hH

• Generally we want the most


probable hypothesis given training h =
arg
MAP max
P(
h |d
)
data.This is the maximum a 
hH
posteriori hypothesis:
– Useful observation: it does not depend
on the denominator P(d)
Now we compute the diagnosis
– To find the Maximum Likelihood hypothesis, we evaluate P(d|h)
for the data d, which is the positive lab test and chose the
hypothesis (diagnosis) that maximises it:
P(+ | cancer) = 0.98
P(+ | ¬cancer) = 0.03
 Diagnosis : hML = 0.98
– To find the Maximum A Posteriori hypothesis, we evaluate
P(d|h)P(h) for the data d, which is the positive lab test and chose
the hypothesis (diagnosis) that maximises it. This is the same as
choosing the hypotheses gives the higher posterior probability.
P(+ | cancer)P(cancer) = 0.20
P(+ | ¬cancer)P(¬cancer) = 0.79
 Diagnosis : hMAP = ......................
Weights of evidence model

Study area (S)

Target deposits
10k
Geological Feature (B1)

Geological Feature (B2)

10k

Objective: To estimate the probability of occurrence of D in each unit cell of the study area

Approach: Use BAYES’ THEOREM for updating the prior probability of the occurrence of
mineral deposit to posterior probability based on the conditional probabilities (or
weights of evidence) of the geological features.
Weights of evidence model
Step 1: Calculation of prior probability
1k
1k Study area (S)
Unit cell
Target deposits
10k

10k
• The probability of the occurrence of the targeted mineral deposit type when
no other geological information about the area is available or considered.

Total study area = Area (S) = 10 km x 10 km = 100 sq km = 100 unit cells


Area where deposits are present = Area (D) = 10 unit cells
Prior Probability of occurrence of deposits = P {D} = Area(D)/Area(S)= 10/100 = 0.1
Prior odds of occurrence of deposits = P{D}/(1-P{D}) = 0.1/0.9 = 0.11
Weights of Evidence
Step 3 Calculation of weights of evidence
Bayes’ Equation
Inference Observation

𝑃(𝐷&𝐵) 𝑃 𝐵𝐷
𝑃 𝐷𝐵 = =𝑃 𝐷
𝑃(𝐵) 𝑃 𝐵


𝑃(𝐷&𝐵) 𝑃 𝐵ത 𝐷
𝑃 𝐷 𝐵ത = =𝑃 𝐷

𝑃(𝐵) 𝑃 𝐵ത

Converting probabilities into odds and


logarithms:

+ 𝑃(𝐵|𝐷) −

𝑃(𝐵|𝐷)
𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔
ഥ)
𝑃(𝐵|𝐷 ത 𝐷
𝑃(𝐵| ഥ)
Step 3 Calculation of weights of evidence
+ 𝑃(𝐵|𝐷) −

𝑃(𝐵|𝐷)
𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔
ഥ)
𝑃(𝐵|𝐷 ത 𝐷)
𝑃(𝐵| ഥ
B1

𝑛(𝐷) D
𝑃 𝐷 =
𝑛(𝑆)

𝑛(𝐵 ∩ 𝐷)
𝑃 𝐵|𝐷 =
𝑛(𝐷)

ഥ)
𝑛(𝐵 ∩ 𝐷
ഥ =
𝑃 𝐵|𝐷
𝑛(𝐷ഥ)

𝑛(𝐵ത ∩ 𝐷) 𝑛 𝐷 − 𝑛(𝐷 ∩ 𝐵)

𝑃 𝐵|𝐷 = =
𝑛(𝐷) 𝑛(𝐷)

𝑛(𝐵ത ∩ 𝐷
ഥ ) 𝑛 𝑆 − 𝑛 𝐵 − 𝑛(𝐷) + 𝑛(𝐵 ∩ 𝐷)
ത𝐷
𝑃 𝐵| ഥ = =
𝑛(𝐷 ഥ) 𝑛(𝐷ഥ)
Exercise 10k Unit cell size = 1 sq km & each deposit
S occupies 1 unit cell

B1
10k Geological Feature (B1)

Geological Feature (B2)

B2

10k

Calculate the weights of evidence (W+ and W-) and Contrast values for B1 and B2

𝑛(𝐵 ∩ 𝐷) ത
𝑃 𝐵|𝐷 = + 𝑃(𝐵|𝐷) − 𝑃(𝐵|𝐷)
𝑛(𝐷) 𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔

𝑃(𝐵|𝐷) ത 𝐷)
𝑃(𝐵| ന
𝑛(𝐵 ∩ 𝐷ഥ)
ഥ =
𝑃 𝐵|𝐷

𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷) 𝑛 𝐷 − 𝑛(𝐷 ∩ 𝐵)

𝑃 𝐵|𝐷 = =
𝑛(𝐷) 𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷
ഥ ) 𝑛 𝑆 − 𝑛 𝐵 − 𝑛(𝐷) + 𝑛(𝐵 ∩ 𝐷)
ത ഥ
𝑃 𝐵|𝐷 = =
𝑛(𝐷ഥ) 𝑛(𝐷ഥ)
Exercise 10k Unit cell size = 1 sq km & each deposit
S occupies 1 unit cell

B1
10k Geological Feature (B1)

Geological Feature (B2)

B2

10k

𝑛(𝐵∩𝐷)
𝑃 𝐵|𝐷 = =4/10 𝑊𝐵1 += 1.09888; 𝑊𝐵1 −= −0.3678
𝑛(𝐷)

ഥ 𝑊𝐵2 += 0.2060; 𝑊𝐵1 −= −0.0763


ഥ = 𝑛(𝐵∩𝐷)=12/90
𝑃 𝐵|𝐷 ഥ 𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷) 𝑛 𝐷 − 𝑛(𝐷 ∩ 𝐵)

𝑃 𝐵|𝐷 = = = 6/10
𝑛(𝐷) 𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷)
ഥ 𝑛 𝑆 − 𝑛 𝐵 − 𝑛(𝐷) + 𝑛(𝐵 ∩ 𝐷)
ത ഥ
𝑃 𝐵|𝐷 = = = 78/90

𝑛(𝐷) ഥ
𝑛(𝐷)
Step 3 Calculation of weights of evidence

Contrast (C) measures the net strength of spatial association between the
geological feature and mineral deposits

Contrast = W+ – W-

+ ive Contrast – net positive spatial association

-ive Contrast – net negative spatial association

zero Contrast – no spatial association

Can be used to test spatial associations


Step 4 Combining weights of evidence
Assuming conditional independence – Naïve!

+/− +/− +/−


𝐿𝑜𝑔[𝑂 𝐷 𝐵1, 𝐵2, 𝐵3. . ] = 𝐿𝑜𝑔 𝑂 + 𝑊𝐵1 + 𝑊𝐵2 + 𝑊𝐵3 +…….

𝐸𝑥𝑝[𝑂 𝐷 𝐵1, 𝐵2, 𝐵3. . ]


𝑃 𝐷 𝐵1, 𝐵2, 𝐵3 … . =
1 + 𝐸𝑥𝑝[𝑂 𝐷 𝐵1, 𝐵2, 𝐵3. . ]
Combining Weights of Evidence: Posterior Probability

Loge (O{D|B1, B2}) = Loge(O{D}) + W+/-B1 + W+/-B2

Loge(O{D}) = Loge(0.11) = -2.2073

Calculate posterior probability given: B1


1. Presence of B1 and B2;
2. Presence of B1 and absence of B2;
3. Absence of B1 and presence of B2;
4. Absence of both B1 and B2
B2
Prior Prb = 0.10
Prior Odds = 0.11

54
Loge (O{D|B1, B2}) = Loge(O{D}) + W+/-B1 + W+/-B2

Loge(O{D}) = Loge(0.11) = -2.2073


S
For the areas where both B1 and B2 are present
Loge (O{D|B1, B2}) = -2.2073 + 1.0988 + 0.2050 = -0.8585 B1
O{D|B1, B2} = Antiloge (-0.8585) = 0.4238

P = O/(1+O) = (0.4238)/(1.4238) = 0.2968

For the areas where B1 is present but B2 is absent


Loge (O{D|B1, B2}) = -2.2073 + 1.0988 - 0.0763 = -1.1848
B2
O{D|B1, B2} = Antiloge (- 1.1848) = 0.3058 Prior Prb = 0.10

P = O/(1+O) = (0.3058)/(1.3058) = 0.2342 Prospectivity Map

For the areas where B1 is absent but B2 is present


Loge (O{D|B1, B2}) = -2.2073 - 0.3678 + 0.2050 = -2.3701

O{D|B1, B2} = Antiloge (-2.3701) = 0.0934

P = O/(1+O) = (0.0934)/(1.0934) = 0.0854

For the areas where both B1 and B2 are absent


Loge (O{D|B1, B2}) = -2.2073 - 0.3678 - 0.0763 = -2.6514

O{D|B1, B2} = Antiloge (-2.6514) = 0.0705 Posterior probability

0.2968 0.0854
P = O/(1+O) = (0.0705)/(1.0705) = 0.0658 55
0.2342 0.0658
Bayesian Networks
& Classifiers

56
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model

P(C|X) C = c1 ,  , cL , X = (X1 ,  , Xn )

P(c1 |x) P(c2 |x) P(c L |x)


•••

Discriminative
Probabilistic Classifier

•••
x1 x2 xn
x = ( x1 , x2 ,  , xn )
57
Probabilistic Classification
• Establishing a probabilistic model for classification (cont.)
– Generative model

P(X|C) C = c1 ,  , cL , X = (X1 ,  , Xn )

P(x|c1 ) P(x|c2 ) P(x|c L )

Generative Generative Generative


Probabilistic Model Probabilistic Model ••• Probabilistic Model
for Class 1 for Class 2 for Class L
••• ••• •••
x1 x2 x n x1 x2 xn x1 x2 xn

x = ( x1 , x2 ,  , xn )

58
The Joint Probability Distribution
• Joint probabilities can be between any
A B C P(A,B,C)
number of variables
false false false 0.1
eg. P(A = true, B = true, C = true)
false false true 0.2
• For each combination of variables, we
need to say how probable that false true false 0.05
combination is false true true 0.05
• The probabilities of these combinations true false false 0.3
need to sum to 1 true false true 0.1
true true false 0.05
• Once you have the joint probability
distribution, you can calculate any true true true 0.15
probability involving A, B, and C

Examples of things you can compute:


• P(A=true) = sum of P(A,B,C) in rows with A=true
• P(A=true, B = true | C=true) = P(A = true, B = true, C = true)
59
The Problem with the Joint Distribution
• Lots of entries in the table to fill A B C P(A,B,C)
up! false false false 0.1
false false true 0.2
• For k Boolean random variables, false true false 0.05
you need a table of size 2k false true true 0.05
true false false 0.3
• How do we use fewer numbers? true false true 0.1
Need the concept of true true false 0.05
independence
true true true 0.15

60
Independence
Variables A and B are independent if any of the following hold:
• P(A,B) = P(A) P(B)
• P(A | B) = P(A)
• P(B | A) = P(B)

This says that knowing the outcome of A does not tell me


anything new about the outcome of B.

• E.g., P(Au, As) = P(Au) x P(As)


P(Au | As) = P(Au)
P(As | P(Au) = P(As)
61
Independence
How is independence useful?
• Suppose you have n coin flips and you want to calculate the joint distribution
P(C1, …, Cn)
• If the coin flips are not independent, you need 2n values in the table
• If the coin flips are independent, then
n
P(C1 ,..., Cn ) =  P(Ci )
i =1

Each P(Ci) table has 2 entries and there


are n of them for a total of 2n values

62
Conditional Independence
Variables A and B are conditionally independent given C if any of the
following hold:
• P(A, B | C) = P(A | C) P(B | C)
• P(A | B, C) = P(A | C)
• P(B | A, C) = P(B | C)

Knowing C tells me everything about B.


I don’t gain anything by knowing A
(either because A doesn’t influence B
or because knowing C provides all the
information knowing A would give)

63
A Bayesian Network
Suppose there are four binary variables: A, B, C, D such that
• A is independent => P(A|B,C,D) = P(A) => A does not have parents

• B is dependent on A; independent of C and D => P(B|A,C,D) = P(B|A)


=> A is a parent of B or B is a child of A
• C and D are dependent on B => P(C|B,D) = P(C|B); P(D|B,C) = P(D|B)
=> B is the parent of C and D (or C and D are the children of B)

Objective: Estimate joint probability distribution of A,B,C,D


That is, P(A,B,C,D)
A Directed Acyclic Graph (DAG)

Each node in the


graph is a random Arrow indicates the direction of
variable dependence or parent-child
relationship; e.g., the arrow from A to
B indicates the that A is a parent of B
A

Informally, an arrow from node X to


B node Y means X has a direct influence
on Y

C D

65
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph (DAG) 2. Parameters: Conditional probability distribution
table for each node
A P(A)

A false 0.6
true 0.4

A B P(B|A)
false false 0.01
false true 0.99
B true false 0.7
true true 0.3

B D P(D|B)
B C P(C|B)
false false 0.02
false false 0.4 C D
false true 0.98
false true 0.6
true false 0.05
true false 0.9
true true 0.95
true true 0.1
Node parameters
Conditional probability distribution for C given B

B C P(C|B)
false false 0.4
false true 0.6
true false 0.9 For a given combination of values of the parents (B in this
true true 0.1 example), the entries for the child variable must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1

67
Bayesian Networks
Two important properties:
1. Encodes the conditional dependence relationships
between the variables in the graph structure
2. Is a compact representation of the joint probability
distribution over all variables

68
Bayesian Classifier
• One binary variable at the core of the network, called the class variable
• The class variable can have as many child variables (called attribute
variables) as possible
• The class variable has no parent
• All attribute variables have the class variable as their parent
• The attribute variables can have more than one parent variable.

69
Bayesian Classifier
D is the class variable A, B and C are attribute variables

C B
Bayesian Classifier
D is the class variable and A, B and C are attribute D P(D)

variables false 0.6

Naïve classifier Augmented Naïve classifier true 0.4

D P(D) A A D A P(A|D)

false false 0.01


false 0.6
false true 0.99
true 0.4
true false 0.7
D A P(A|D) D D true true 0.3
false false 0.06
D C P(C|D)
false true 0.94
false false 0.01
true false 0.10 C B C B
false true 0.99
true true 0.90
true false 0.7
D B P(B|D) D A P(A|D)

false false 0.02 false false 0.01


Selective true true 0.3

false true 0.99


A Naïve classifier D A B P(B|D,A)
false true 0.98
false false false 0.1
true false 0.05 true false 0.7
false false true 0.2
true true 0.95 true true 0.3 D P(D)
false true false 0.05
D C P(C|B) D C P(C|D)
D false 0.6
false true true 0.05
true 0.4
false false 0.4 false false 0.01 true false false 0.3

false true 0.6 false true 0.99 true false true 0.1

true false 0.9 true false 0.7


C true true false 0.05
true true true 0.15
true true 0.1 true true 0.3
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
He will play tennis or not?
PlayTennis Problem
You have data on weather conditions on the days Jones played
tennis in past
• Outlook – [Sunny, Overcast, Rain]
• Temperature – [Hot, Cool, Mild]
• Humidity – [High, Normal]
• Wind – [Weak, Strong]

73
PlayTennis: Bayesian Network
Outlook

Wind PlayTennis Temp

Humidity

74
PlayTennis: Bayesian Network
Outlook

Wind PlayTennis Temp

Humidity

75
Naïve Bayesian Classifier
• Bayes classification
P(C|X)  P(X|C)P(C) = P(X1 ,  , Xn |C)P(C)
Difficulty: learning the joint probability P(X1 ,  , Xn |C)
• Naïve Bayes classification
– Assumption that all input features are conditionally independent!

P( X 1 , X 2 ,  , X n | C ) = P( X 1 | C ) P( X 2 | C )    P( X n | C )

– MAP classification rule: for x = ( x1 , x2 ,  , xn )


[P( x1 |c* )    P( xn |c* )]P(c* )  [P( x1 |c)    P( xn |c)]P(c), c  c* , c = c1 ,  , cL

76
Naïve Bayesian Classifier

2
For each target value of ci (ci = c1 , c0 )
Pˆ (C = c1)  estimate P(C = c1 ) with examples in S;
For every feature value x jk of each feature X j ( j = 1,  , F ; k = 1,  , N j )
Pˆ ( X j = x jk | C = c1 )  estimate P( X j = x jk | C = c1 ) with examples in S;
Pˆ ( X j = x jk | C = c0 )  estimate P( X j = x jk | C = c0 ) with examples in S;
2
X = (a1 ,  , an )
c1
[ Pˆ (a1 | c1 )    Pˆ (an | c1 )]Pˆ (c1 )  [ Pˆ (a1 | co )    Pˆ (an | co )]Pˆ (co )

77
Example: PlayTennis
Sunny
Overcast
Outlook

Strong yes
Weak No
Wind PlayTennis Temp
Hot
Mild
Cool
Humidity
High
Normal
Low

x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)


He will play tennis or not?
Example

Learning Phase: generate lookup tables


Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Sunny 2/9 3/5 Hot 2/9 2/5

Overcast 4/9 0/5 Mild 4/9 2/5

Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5

Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)


He will play tennis or not?
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

81
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule


P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

82
PlayTennis Problem
With numeric data
Temperatures on previous 14 days and Jones’s playing history
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.
No: 27.3, 30.1, 17.4, 29.5, 15.1

Temperature is 23 C. Play tennis or not?

83
Naïve Bayesian Classifier
Numeric (floating point) data
• Algorithm: Continuous-valued Features
– Numberless values for a feature
– Conditional probability often modeled with the normal distribution
1  ( X j − ji )2 
Pˆ ( X j |C = ci ) = exp − 
2 ji  2 ji 
2

 ji : mean (avearage) of feature values X j of examples for which C = ci
 ji : standard deviation of feature values X j of examples for which C = ci

– Learning Phase: for X = (X1 ,  , Xn ), C = c1 ,  , cL


Output: n L normal distributions and P(C = ci ) i = 1,  , L
– Test Phase: Given an unknown instance X = (a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all the normal
distributions achieved in the learning phrase
• Apply the MAP rule to make a decision
PlayTennis Problem
With numeric data
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35
 =  xn ,  =  ( xn − )2
2
 No = 23.88 , No = 7.09
N n=1 N n=1

– Learning Phase: output two Gaussian models for P(temp|C)


1  ( x − 21.64) 2  1  ( x − 21.64) 2 
Pˆ ( x | Yes) = exp −  = exp − 
2.35 2  2  2.35 2
 2.35 2  11.09 
ˆ 1  ( x − 23.88) 2  1  ( x − 23.88) 2 
P( x | No) = exp −  = exp − 
7.09 2  2  7.09  7.09 2
2
 50.25 
85

You might also like