Lecture 7 Bayesian Models - Merged
Lecture 7 Bayesian Models - Merged
Baysesian Models
Study area
Area: 144 sq km
Unit Cell Size: 1 sq km
H - High
Distance to fault layer L – Low
D – Deposit
ND – Non deposit
Deposit layer
Training data
Px – Proximal
Ds – Distal
H - High
Distance to fault layer L – Low
D – Deposit
ND – Non deposit
Deposit layer
Probabilistic approach:
Class of pixel (1,1) given high arsenic and proximal to fault
Probability & Statistics
Sampling
Population Sample Data
Parameters Statistics
Inferencing
8
Frequency Distributions
9
Example – Uranium in groundwater
• Water samples taken from 36 locations in Powai as part of a study to
determine the natural variation of total dissolved solids in the area.
10
U in a study area groundwater samples
11
Frequency Distribution
8
Frequency
4
= Probability distributions
(when idealized and fitted to mathematical functions)
Probability: the “frequentist” approach
• probability should be assessed in purely
objective terms
• no room for subjectivity on the part of individual
researchers
• knowledge about probabilities comes from the
relative frequency of a large number of trials
– this is a good model for coin tossing
– not so useful for predicting complex problems, where
many of the factors are unknown…e.g., stock market
Frequentist: "The probability of a coin landing heads is 50% because we
observed it in many trials."
Bayesian: "I believe the coin is fair based on prior evidence, and I’ll update
my belief as I see new data."
Probability: the Bayesian approach
• Bayes Theorem
– Thomas Bayes
– 18th century English clergyman
• .5 = even odds
• .1 = 1 chance out of 10
Probability
“something-has-to-happen rule”:
– The probability of the set of all possible outcomes of a
trial must be 1.
– P(S) = 1
(S represents set of all possible outcomes.)
CAUTION:: are outcomes are equally
likely??
Just because there are two outcomes does not mean they are 50-50.
In a desert: P(Rain) nearly 1%
1 1 1 “49 choose 6”
= = = 7.2 x 10 -8
49 49! 13,983,816 Out of 49 numbers,
6 43!6! this is the number of
distinct combinations
of 6.
The probability function (note, sums to 1.0):
x p(x)
-1 $. .999999928
Expected Value
E(X) = P(win)*$2,000,000 + P(lose)*-$1.00
= 2.0 x 106 * 7.2 x 10-8+ .999999928 (-1) = .144 - .999999928 = -$.0.86
I’m 90% sure Tesla will drop tomorrow—Elon’s tweet seemed off
Rules of probability:
addition rule
Definition: events that have no outcomes in common (and,
thus, cannot occur together) are called mutually exclusive.
• P(A|B)=Prob of A, given B
Conditional probability (cont.)
• P(B|A) = P(A&B)/P(A)
Independence….???
With notation for conditional probabilities, we can now
formalize the definition of independence
• events A and B are independent whenever
P(B|A) = P(B)
P(d | h)P(h)
p(h | d) =
P(h)P(d | h) + P(~ h) P(d |~ h)
Does the patient have cancer or not?
• A patient takes a lab test and the result comes back positive. It is
known that the test returns a correct positive result in 98% of the
cases and a correct negative result in 97% of the cases.
Furthermore, only 0.008 of the entire population has this disease.
= 0.208511
Probability of test being positive when there was no cancer = 1
– Probability of test being negative when there was no cancer
= 1-0.97 = 0.03
= 0.791489
Choosing Hypotheses
• Maximum Likelihood hypothesis:
h
ML=
arg
max
P(
d |h
)
hH
Target deposits
10k
Geological Feature (B1)
10k
Objective: To estimate the probability of occurrence of D in each unit cell of the study area
Approach: Use BAYES’ THEOREM for updating the prior probability of the occurrence of
mineral deposit to posterior probability based on the conditional probabilities (or
weights of evidence) of the geological features.
Weights of evidence model
Step 1: Calculation of prior probability
1k
1k Study area (S)
Unit cell
Target deposits
10k
10k
• The probability of the occurrence of the targeted mineral deposit type when
no other geological information about the area is available or considered.
𝑃(𝐷&𝐵) 𝑃 𝐵𝐷
𝑃 𝐷𝐵 = =𝑃 𝐷
𝑃(𝐵) 𝑃 𝐵
ത
𝑃(𝐷&𝐵) 𝑃 𝐵ത 𝐷
𝑃 𝐷 𝐵ത = =𝑃 𝐷
ത
𝑃(𝐵) 𝑃 𝐵ത
+ 𝑃(𝐵|𝐷) −
ത
𝑃(𝐵|𝐷)
𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔
ഥ)
𝑃(𝐵|𝐷 ത 𝐷
𝑃(𝐵| ഥ)
Step 3 Calculation of weights of evidence
+ 𝑃(𝐵|𝐷) −
ത
𝑃(𝐵|𝐷)
𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔
ഥ)
𝑃(𝐵|𝐷 ത 𝐷)
𝑃(𝐵| ഥ
B1
𝑛(𝐷) D
𝑃 𝐷 =
𝑛(𝑆)
𝑛(𝐵 ∩ 𝐷)
𝑃 𝐵|𝐷 =
𝑛(𝐷)
ഥ)
𝑛(𝐵 ∩ 𝐷
ഥ =
𝑃 𝐵|𝐷
𝑛(𝐷ഥ)
𝑛(𝐵ത ∩ 𝐷) 𝑛 𝐷 − 𝑛(𝐷 ∩ 𝐵)
ത
𝑃 𝐵|𝐷 = =
𝑛(𝐷) 𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷
ഥ ) 𝑛 𝑆 − 𝑛 𝐵 − 𝑛(𝐷) + 𝑛(𝐵 ∩ 𝐷)
ത𝐷
𝑃 𝐵| ഥ = =
𝑛(𝐷 ഥ) 𝑛(𝐷ഥ)
Exercise 10k Unit cell size = 1 sq km & each deposit
S occupies 1 unit cell
B1
10k Geological Feature (B1)
B2
10k
Calculate the weights of evidence (W+ and W-) and Contrast values for B1 and B2
𝑛(𝐵 ∩ 𝐷) ത
𝑃 𝐵|𝐷 = + 𝑃(𝐵|𝐷) − 𝑃(𝐵|𝐷)
𝑛(𝐷) 𝑊 = 𝐿𝑜𝑔 ; 𝑊 = 𝑙𝑜𝑔
ഥ
𝑃(𝐵|𝐷) ത 𝐷)
𝑃(𝐵| ന
𝑛(𝐵 ∩ 𝐷ഥ)
ഥ =
𝑃 𝐵|𝐷
ഥ
𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷) 𝑛 𝐷 − 𝑛(𝐷 ∩ 𝐵)
ത
𝑃 𝐵|𝐷 = =
𝑛(𝐷) 𝑛(𝐷)
𝑛(𝐵ത ∩ 𝐷
ഥ ) 𝑛 𝑆 − 𝑛 𝐵 − 𝑛(𝐷) + 𝑛(𝐵 ∩ 𝐷)
ത ഥ
𝑃 𝐵|𝐷 = =
𝑛(𝐷ഥ) 𝑛(𝐷ഥ)
Exercise 10k Unit cell size = 1 sq km & each deposit
S occupies 1 unit cell
B1
10k Geological Feature (B1)
B2
10k
𝑛(𝐵∩𝐷)
𝑃 𝐵|𝐷 = =4/10 𝑊𝐵1 += 1.09888; 𝑊𝐵1 −= −0.3678
𝑛(𝐷)
Contrast (C) measures the net strength of spatial association between the
geological feature and mineral deposits
Contrast = W+ – W-
54
Loge (O{D|B1, B2}) = Loge(O{D}) + W+/-B1 + W+/-B2
0.2968 0.0854
P = O/(1+O) = (0.0705)/(1.0705) = 0.0658 55
0.2342 0.0658
Bayesian Networks
& Classifiers
56
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C|X) C = c1 , , cL , X = (X1 , , Xn )
Discriminative
Probabilistic Classifier
•••
x1 x2 xn
x = ( x1 , x2 , , xn )
57
Probabilistic Classification
• Establishing a probabilistic model for classification (cont.)
– Generative model
P(X|C) C = c1 , , cL , X = (X1 , , Xn )
x = ( x1 , x2 , , xn )
58
The Joint Probability Distribution
• Joint probabilities can be between any
A B C P(A,B,C)
number of variables
false false false 0.1
eg. P(A = true, B = true, C = true)
false false true 0.2
• For each combination of variables, we
need to say how probable that false true false 0.05
combination is false true true 0.05
• The probabilities of these combinations true false false 0.3
need to sum to 1 true false true 0.1
true true false 0.05
• Once you have the joint probability
distribution, you can calculate any true true true 0.15
probability involving A, B, and C
60
Independence
Variables A and B are independent if any of the following hold:
• P(A,B) = P(A) P(B)
• P(A | B) = P(A)
• P(B | A) = P(B)
62
Conditional Independence
Variables A and B are conditionally independent given C if any of the
following hold:
• P(A, B | C) = P(A | C) P(B | C)
• P(A | B, C) = P(A | C)
• P(B | A, C) = P(B | C)
63
A Bayesian Network
Suppose there are four binary variables: A, B, C, D such that
• A is independent => P(A|B,C,D) = P(A) => A does not have parents
C D
65
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph (DAG) 2. Parameters: Conditional probability distribution
table for each node
A P(A)
A false 0.6
true 0.4
A B P(B|A)
false false 0.01
false true 0.99
B true false 0.7
true true 0.3
B D P(D|B)
B C P(C|B)
false false 0.02
false false 0.4 C D
false true 0.98
false true 0.6
true false 0.05
true false 0.9
true true 0.95
true true 0.1
Node parameters
Conditional probability distribution for C given B
B C P(C|B)
false false 0.4
false true 0.6
true false 0.9 For a given combination of values of the parents (B in this
true true 0.1 example), the entries for the child variable must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1
67
Bayesian Networks
Two important properties:
1. Encodes the conditional dependence relationships
between the variables in the graph structure
2. Is a compact representation of the joint probability
distribution over all variables
68
Bayesian Classifier
• One binary variable at the core of the network, called the class variable
• The class variable can have as many child variables (called attribute
variables) as possible
• The class variable has no parent
• All attribute variables have the class variable as their parent
• The attribute variables can have more than one parent variable.
69
Bayesian Classifier
D is the class variable A, B and C are attribute variables
C B
Bayesian Classifier
D is the class variable and A, B and C are attribute D P(D)
D P(D) A A D A P(A|D)
false true 0.6 false true 0.99 true false true 0.1
73
PlayTennis: Bayesian Network
Outlook
Humidity
74
PlayTennis: Bayesian Network
Outlook
Humidity
75
Naïve Bayesian Classifier
• Bayes classification
P(C|X) P(X|C)P(C) = P(X1 , , Xn |C)P(C)
Difficulty: learning the joint probability P(X1 , , Xn |C)
• Naïve Bayes classification
– Assumption that all input features are conditionally independent!
P( X 1 , X 2 , , X n | C ) = P( X 1 | C ) P( X 2 | C ) P( X n | C )
76
Naïve Bayesian Classifier
2
For each target value of ci (ci = c1 , c0 )
Pˆ (C = c1) estimate P(C = c1 ) with examples in S;
For every feature value x jk of each feature X j ( j = 1, , F ; k = 1, , N j )
Pˆ ( X j = x jk | C = c1 ) estimate P( X j = x jk | C = c1 ) with examples in S;
Pˆ ( X j = x jk | C = c0 ) estimate P( X j = x jk | C = c0 ) with examples in S;
2
X = (a1 , , an )
c1
[ Pˆ (a1 | c1 ) Pˆ (an | c1 )]Pˆ (c1 ) [ Pˆ (a1 | co ) Pˆ (an | co )]Pˆ (co )
77
Example: PlayTennis
Sunny
Overcast
Outlook
Strong yes
Weak No
Wind PlayTennis Temp
Hot
Mild
Cool
Humidity
High
Normal
Low
81
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
82
PlayTennis Problem
With numeric data
Temperatures on previous 14 days and Jones’s playing history
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.
No: 27.3, 30.1, 17.4, 29.5, 15.1
83
Naïve Bayesian Classifier
Numeric (floating point) data
• Algorithm: Continuous-valued Features
– Numberless values for a feature
– Conditional probability often modeled with the normal distribution
1 ( X j − ji )2
Pˆ ( X j |C = ci ) = exp −
2 ji 2 ji
2
ji : mean (avearage) of feature values X j of examples for which C = ci
ji : standard deviation of feature values X j of examples for which C = ci