0% found this document useful (0 votes)

3 views57 pages

Lecture3-Bayesian Decision Theory

Bayesian Decision theory for machine learning

Uploaded by

yoyoyashu104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views57 pages

Lecture3-Bayesian Decision Theory

Bayesian Decision theory for machine learning

Uploaded by

yoyoyashu104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Today

Conditional distributions (forgot to review)

Bayesian Decision Theory
Two category classification
Multiple category classification
Discriminant Functions
Course Road Map
a lot is
known
”easier”

little is
known
“harder”
Bayesian Decision theory
Know probability distribution of the a lot is
categories known
”easier”
never happens in real world
Do not even need training data
Can design optimal classifier
Example
respected fish expert says that salmon’s length
has distribution N(5,1) and sea bass’s length
has distribution N(10,4)

little is
known
salmon sea bass “harder”
ML and Bayesian parameter estimation
Shape of probability distribution is known a lot is
known
Happens sometimes ”easier”
Labeled training data salmon bass salmon salmon

Need to estimate parameters of probability

distribution from the training data

Example
respected fish expert says salmon’s
length has distribution N(µ1 ,σ1 ) and sea
2

bass’s length has distribution N(µ2 ,σ 22 )

Need to estimate parameters µ1 ,σ12 , µ2 ,σ 22
Then can use the methods from the little is
bayesian decision theory known
“harder”
Linear discriminant functions and Neural Nets
No probability distribution (no shape or a lot is
parameters are known) known

Labeled data salmon bass salmon salmon

The shape of discriminant functions is

known
ba linear
lightness

ss discriminant
function
sa
lm
on

length
Need to estimate parameters of the
little is
discriminant function (parameters of the known
line in case of linear discriminant)
Non-Parametric Methods
Neither probability distribution nor a lot is
known
discriminant function is known ”easier”
Happens quite often
All we have is labeled data
salmon bass salmon salmon

Estimate the probability distribution

from the labeled data

little is
known
“harder”
Unsupervised Learning and Clustering
a lot is
Data is not labeled known
”easier”
Happens quite often

1. Estimate the probability distribution

from the unlabeled data
2. Cluster the data

little is
known
“harder”
Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known

Do not even need training data

Can design optimal classifier
2. ML and Bayesian parameter estimation
Need to estimate Parameters of probability dist.
Need training data
3. Non-Parametric Methods
No probability distribution, labeled data
4. Linear discriminant functions and Neural Nets
The shape of discriminant functions is known
Need to estimate parameters of discriminant functions
5. Unsupervised Learning and Clustering little is
No probability distribution and unlabeled data known
Notation Change (for consistency with textbook)

Pr [A] probability of event A

P(x) probability mass function of
discrete r.v. x
p(x) probability density function of
continuous r.v. x
p(x,y) joint probability density of r.v. x
and y
p(x|y) conditional density of x given y
P(x|y) conditional mass of x given y
More on Probability
For events A and B, we have defined
conditional law of total
probability probability
U n
Pr(A B)
Pr(A|B)= Pr ( A) = Pr (A | Bk ) Pr (Bk )
Pr(B) k =1

Pr (A | B i ) Pr (B i )
Bayes’ rule Pr (Bi | A ) = n
Pr (A | Bk ) Pr (Bk )
k =1

Usually model with random variables not events.

Need equivalents of these laws for mass and density
functions (could go from random variables back to
events, but time consuming)
Conditional Mass Function: Discrete RV
For discrete RV nothing new because mass
function is really a probability law
Define conditional mass function of X given Y=y
by P (x , y )
P (x | y ) =
P (y )
y is fixed
This is a probability mass function because:
P (x , y )
P (y )
P (x | y ) = ∀x
= =1
∀x P (y ) P (y )

This is really nothing new because:

P ( x , y ) Pr [X = x Y = y ]
( )
P x |y = = = Pr [X = x |Y = y ]
P (y ) Pr [Y = y ]
Conditional Mass Function: Bayes Rule

The law of Total Probability:

P (x ) = P (x , y ) = P ( x | y )P (y )
∀y ∀y

The Bayes Rule:

P (y , x ) P ( x | y )P (y )
P (y | x ) = =
P (x ) P ( x | y )P (y )
∀y
Conditional Density Function: Continuous RV
Does it make sense to talk about conditional density
p(x|y) if Y is a continuous random variable? After
all, Pr[Y=y]=0, so we will never see Y=y in practice
Measurements have limited accuracy. Can interpret
observation y as observation in interval [y-ε, y+ε], and
observation x as observation in interval [x-ε, x+ε]
y-ε y+ε x-ε x+ε
y x
Conditional Density Function: Continuous RV
p(x)
Let B(x) denote interval [x-ε,x+ε]
x +ε
Pr [X ∈ B( x )] = p( x )dx ≈ 2ε p( x )
x −ε x-ε x x+ε
Similarly Pr [Y ∈ B(y )] ≈ 2ε p(y )
Pr [X ∈ B( x ) Y ∈ B(y )] ≈ 4ε 2 p( x , y )

Pr [X ∈ B( x ) | Y ∈ B(y )]
Thus we should have p( x | y ) ≈
2ε
Which can be simplified to:
Pr [X ∈ B( x ) Y ∈ B(y )] p( x , y )
p( x | y ) ≈ ≈
2ε Pr [Y ∈ B(y )] p(y )
Conditional Density Function: Continuous RV
Define conditional density function of X given Y=y
by p (x , y )
( )
p x |y =
p (y )
y is fixed

This is a probability density function because:

∞
p ( x , y )dx
p (x , y ) p (y )
∞ ∞
p ( x | y ) dx = dx = −∞
= =1
−∞ −∞
p (y ) p (y ) p (y )

The law of Total Probability:

X continuous, Y discrete
Bayes rule
P (y | x )p ( x )
p (x | y ) =
P (y )
Bayesian Decision Theory

Know probability distribution of the

categories
Almost never the case in real life!
Nevertheless useful since other cases can be
reduced to this one after some work
Do not even need training data
Can design optimal classifier
Cats and Dogs
Suppose we have these conditional probability
mass functions for cats and dogs
P(small ears | dog) = 0.1, P(large ears | dog) = 0.9
P(small ears | cat) = 0.8, P(large ears | cat) = 0.2
Observe an animal with large ears
Dog or a cat?
Makes sense to say dog because probability of
observing large ears in a dog is much larger than
probability of observing large ears in a cat
Pr[large ears | dog] = 0.9 > 0.2= Pr[large ears | cat] = 0.2
We choose the event of larger probability, i.e.
maximum likelihood event
Example: Fish Sorting
Respected fish expert says that
Salmon’ length has distribution N(5,1)
Sea bass’s length has distribution N(10,4)
Recall if r.v. is N(µ,σ 2 ) then it’s density is
( l − µ )2
1 −
p(l ) = e 2σ 2
σ 2π
Thus class conditional densities are
( l −5)2 ( l −10)2
1 − 1 −
p(l |bass) =
−
p(l | salmon) = e 2 e 2*4

2π 2 2π
Likelihood function
Thus class conditional densities are
(l −5)2 ( l −10)2
1 − − 1 −
p(l | salmon) = e 2
p(l |bass) = e 2*4
fixed 2π fixed 2 2π
Fix length, let fish class vary. Then we get
likelihood function (it is not density and not
probability mass)
( l −5 )2
1 − 2 −
e if class = salmon
p(l | class ) = 2π ( l −10 )2
1 −
fixed e 8 if class = bass
2 2π
Likelihood vs. Class Conditional Density

p(l | class)

7 length

Suppose a fish has length 7. How do we classify it?

ML (maximum likelihood) Classifier
We would like to choose salmon if
Pr[length= 7 | salmon] > Pr[length= 7 | bass]
However, since length is a continuous r.v.,
Pr[length= 7 | salmon] = Pr[length= 7 | bass] = 0

Instead, we choose class which maximizes likelihood

p( 7 |bass) Thus we choose

the class (bass)
which is more
p( 7 |salmon) likely to have given
the observation

7
Pr [l ∈ B(7 ) | bass ] ≈ 2ε p(7 | bass )
⇐
>

>
Pr [l ∈ B(7 ) | salmon] ≈ 2ε p(7 | salmon)
Decision Boundary

classify as salmon classify as sea bass

6.70 length 28
Priors
Prior comes from prior knowledge, no data
has been seen yet
Suppose a fish expert says: in the fall, there
are twice as many salmon as sea bass
Prior for our fish sorting problem
P(salmon) = 2/3
P(bass) = 1/3

With the addition of prior to our model, how

should we classify a fish of length 7?
How Prior Changes Decision Boundary?
Without priors
salmon sea bass
6.70 length

How should this change with prior?

P(salmon) = 2/3
P(bass) = 1/3

salmon ? ? sea bass

6.70 length
Bayes Decision Rule
1. Have likelihood functions
p(length | salmon) and p(length | bass)
2. Have priors P(salmon) and P(bass)
Question: Having observed fish of certain
length, do we classify it as salmon or bass?
Natural Idea:
salmon if P(salmon| length) > P(bass| length)
bass if P(bass| length) > P(salmon| length)
Posterior
P(salmon | length) and P(bass | length)
are called posterior distributions, because
the data (length) was revealed (post data)
How to compute posteriors? Not obvious
From Bayes rule:
p(salmon, length) p(length| salmon)P(salmon)
P(salmon| length) = =
p(length) p(length)

Similarly:
p(length| bass)P(bass)
P(bass| length) =
p(length)
MAP (maximum a posteriori) classifier
> salmon
P (salmon | length) ? P (bass | length)
bass <

salmon
p(length | salmon)P (salmon) > p(length | bass )P (bass )
?
p(length ) bass < p(length )

>salmon
p(length| salmon)P (salmon) ? p(length| bass)P (bass)
bass <
Back to Fish Sorting Example
likelihood (l −5)2 (l −10)2
1 −− 1 − −
p(l | salmon) = e 2
p(l |bass) = e 8
2π 2 2π

Priors: P(salmon) = 2/3, P(bass) = 1/3

( l −5 )2 ( l −10 )2
1 − 2 1 − 1
Solve inequality e 2
∗ > e 8
∗
2π 3 2 2π 3
new decision
salmon boundary sea bass
6.70 7.18 length

New decision boundary makes sense since

we expect to see more salmon
Prior P(s)=2/3 and P(b)= 1/3 vs.
Prior P(s)=0.999 and P(b)= 0.001

salmon
bass

7.1 8.9 length

Likelihood vs Posteriors
likelihood
P(salmon|l) P(bass|l)
p(l|fish class)

density with
respect to
p(l|salmon) length, area
p(l|bass) under the
curve is 1

length

posterior P(fish class| l)

mass function with respect to fish class, so for
each l, P(salmon| l )+P(bass| l ) = 1
More on Posterior

posterior density likelihood Prior

(our goal) (given) (given)
P( l | c) P(c)
P(c | l ) =
P(l )
normalizing factor, often do not even need
it for classification since P(l) does not
depend on class c. If we do need it, from
the law of total probability:
P (l ) = p(l | salmon )p(salmon ) + p(l | bass )p(bass )
Notice this formula consists of likelihoods
and priors, which are given
More on Posterior
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
cause (class) c l effect (length)
If cause c is present, it easy to determine the
probability of effect l with likelihood P(l|c)
Usually observe the effect l without knowing cause c.
Hard to determine cause c because there may be
several causes which could produce same effect l
Bayes rule makes I easy to determine posterior
P(c|l), if we know likelihood P(l|c) and prior P(c)
More on Priors
Prior comes from prior knowledge, no data
has been seen yet
If there is a reliable source prior knowledge,
it should be used
Some problems cannot even be solved
reliably without a good prior
However prior alone is not enough, we still
need likelihood
P(salmon)=2/3, P(sea bass)=1/3
If I don’t let you see the data, but ask you to
guess, will you choose salmon or sea bass?
More on Map Classifier
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
Do not care about P(l) when maximizing P(c|l )
P(c | l ) P( l | c) P(c)
proportional
∝
If P(salmon)=P(bass) (uniform prior) MAP classifier
becomes ML classifier P(c | l ) ∝P( l | c)

If for some observation l, P(l|salmon)=P(l|bass), then

this observation is uninformative and decision is
based solely on the prior P(c | l ) ∝ P(c)
Justification for MAP Classifier
Let’s compute probability of error for the
MAP estimate:
> salmon
P (salmon | l ) ? P (bass | l )
bass <

For any particular l, probability of error

P(bass|l) if we decide salmon
Pr[error| l ]=
P(salmon|l) if we decide bass

Thus MAP classifier is optimal for each

individual l !
Justification for MAP Classifier
We are interested to minimize error not just for
one l, we really want to minimize the average
error over all l
∞ ∞
Pr [error ] = p(error , l )dl = Pr [error | l ]p(l )dl
−∞ −∞

If Pr[error| l ]is as small as possible, the integral is

small as possible
But Bayes rule makes Pr[error| l ] as small as
possible

Thus MAP classifier minimizes the probability of error!

Today

Bayesian Decision theory

Multiple Classes
General loss functions
Multivariate Normal Random Variable
Classifiers
Discriminant Functions
More General Case

Let’s generalize a little bit

Have more than one feature x = [x1 , x 2 ,..., xd ]
Have more than 2 classes { c1 , c2 ,...,cm }
More General Case
As before, for each j we have
( )
p x | c j is likelihood of observation x given that
the true class is c j
( )
P c j is prior probability of class c j
( )
P c j | x is posterior probability of class c j given
that we observed data x
Evidence, or probability density for data

( ) ( )
m
p( x ) = p x | cj P cj
j =1
Minimum Error Rate Classification
Want to minimize average probability of error
Pr [error ] = p(error , x )dx = Pr [error | x ]p( x )dx
need to make this
as small as possible

Pr [error | x ] = 1 − P (ci | x ) if we decide class ci

Pr [error | x ] is minimized with MAP classifier

Decide on class ci if 1
( )
P (ci | x ) > P c j | x ∀j ≠ i
1-P(c1|x) 1-P(c |x)
1-P(c3|x)
2
MAP classifier is optimal P(c3|x)
P(c1|x)
If we want to minimize the P(c2|x)
probability of error
General Bayesian Decision Theory
In close cases we may want to refuse to
make a decision (let human expert handle
tough case)
allow actions {α1 ,α 2 ,...,α k }
Suppose some mistakes are more costly
than others (classifying a benign tumor as
cancer is not as bad as classifying cancer
as benign tumor)
Allow loss functions λ (α i | c j ) describing loss
occurred when taking action α i when the true
class is c j
Conditional Risk
Suppose we observe x and wish to take
action α i
If the true class is c j , by definition, we incur
loss λ (αi | c j )
Probability that the true class is c j after
observing x is P (c j | x )
The expected loss associated with taking
action α i is called conditional risk and it is:
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1
Conditional Risk

sum over disjoint events probability of

(different classes) class c j given
observation x

λ (α i | c j )
0 if i = j (no mistake)
=
1 otherwise (mistake)

λ (αi | c j )P (c j | x ) = ( )
m
R(αi | x ) = P cj | x =
j =1 i≠ j

= 1 − P (ci | x ) = Pr [error if decide ci ]

Thus MAP classifier optimizes R(αi|x)

P(ci | x) > P(cj | x) ∀j ≠ i
MAP classifier is Bayes decision rule under
zero-one loss function
Overall Risk
Decision rule is a X α(x1)
x1
function α(x) which for x2 α(x2) {α1 ,α2 ,...,α k }
every x specifies action x3 α(x3)
out of {α1 ,α2 ,...,αk }
The average risk for α(x)
R(α ) = R(α ( x ) | x )p( x )dx
need to make this as small as possible
Bayes decision rule α(x) for every x is the action
which minimizes the conditional risk
λ (α i | c j )P (c j | x )
m
R(α i | x ) =
j =1

Bayes decision rule α(x) is optimal, i.e. gives the

minimum possible overall risk R *
Bayes Risk: Example
Salmon is more tasty and expensive than sea bass
λsb = λ (salmon | bass) = 2 classify bass as salmon
λbs = λ (bass | salmon) = 1 classify salmon as bass
λss = λbb = 0 no mistake, no loss
( l −5 )2 ( l −10)2
1 − 1 −
Likelihoods p(l | salmon) = e 2
p(l |bass) = e 2*4
2π 2 2π
Priors P(salmon)= P(bass)
λ (α | c j )P (c j | x ) = λα s P (s | l ) + λα bP (b | l )
m
Risk R(α | x ) =
j =1

R(salmon | l ) = λssP (s | l ) + λsbP (b | l ) = λsbP (b | l )

Bayes decision rule (optimal for our loss function)

Substituting likelihoods and losses

( l −10 )2 ( l −10 )2 ( l −10 )2
−
− −
− −
−
2 ⋅ 2π exp 8
exp 8
exp 8

( l −5 )2
<1 ⇔ (l −5 )2
< 1 ⇔ ln
( l −5 )2
< ln(1) ⇔
−
− −
− −
−
1 ⋅ 2 2π exp 2
exp 2
exp 2

(l − 10 )2 (l − 5 )2
⇔ −− + + < 0 ⇔ 3l 2 − 20l < 0 ⇔ l < 6.6667
8 2
new decision
salmon boundary sea bass
6.67 6.70 length
Likelihood Ratio Rule
In 2 category case, use likelihood ratio rule
P ( x | c1 ) λ12 − λ22 P (c2 )
>
P ( x | c2 ) λ21 − λ11 P (c1 )

likelihood fixed number

ratio Independent of x

If above inequality holds, decide c1

Otherwise decide c2
Discriminant Functions
All decision rules have the same structure:
at observation x choose class ci s.t.
gi ( x ) > g j ( x ) ∀j ≠ i
discriminant
function

ML decision rule: gi ( x ) = P ( x | ci )

MAP decision rule: gi ( x ) = P (ci | x )

Bayes decision rule: gi ( x ) = −R(ci | x )

Discriminant Functions
Classifier can be viewed as network which
computes m discriminant functions and selects
category corresponding to the largest discriminant
select class
giving maximim

discriminant
g1( x) g2( x) gm(x)
functions

x1 x2 x3 xd
features
gi(x) can be replaced with any monotonically
increasing function, the results will be unchanged
Decision Regions
Discriminant functions split the feature
vector space X into decision regions

g2 ( x ) = max{gi }
c1
c2
c3

c3
c1
Important Points
If we know probability distributions for the
classes, we can design the optimal
classifier
Definition of “optimal” depends on the
chosen loss function
Under the minimum error rate (zero-one loss
function
No prior: ML classifier is optimal
Have prior: MAP classifier is optimal
More general loss function
General Bayes classifier is optimal
Acknowledgment
Selected slides in this presentation
are taken from the course materials
of Prof. Olga Veksler, University of
Western Ontario.
Source: https://www.csd.uwo.ca/~oveksler/
Courses/CS434a_541a/
Used for non-commercial educational purposes
with proper credit to the original author.

MLE and MAP Classifier
No ratings yet
MLE and MAP Classifier
55 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
Lecture 5-Naïve Bayes
No ratings yet
Lecture 5-Naïve Bayes
26 pages
2 Probability
No ratings yet
2 Probability
30 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
2 Mle
No ratings yet
2 Mle
28 pages
Lec 1
No ratings yet
Lec 1
42 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
10 B Bayesian Learning 2
No ratings yet
10 B Bayesian Learning 2
14 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
Bayesian Decision Theory Guide
No ratings yet
Bayesian Decision Theory Guide
64 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
Pks Machine Learning Module 2 2
No ratings yet
Pks Machine Learning Module 2 2
41 pages
The Gaussian Classifier: Nuno Vasconcelos
No ratings yet
The Gaussian Classifier: Nuno Vasconcelos
19 pages
Bayes Decision Theory
No ratings yet
Bayes Decision Theory
53 pages
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
No ratings yet
On Why Discretization Works For Naive-Bayes Classifiers: I I I I I I
8 pages
Lecture 6 - Generative Models
No ratings yet
Lecture 6 - Generative Models
33 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Theory For Classification and Linear Models (I)
No ratings yet
Theory For Classification and Linear Models (I)
32 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
UCS - 401 - Unit-LV - Probabilistic Models Normal Distribution and Its Geometric Interpretations - 03
No ratings yet
UCS - 401 - Unit-LV - Probabilistic Models Normal Distribution and Its Geometric Interpretations - 03
14 pages
Classification (Naive Bayes)
No ratings yet
Classification (Naive Bayes)
40 pages
2 Unit PR Statistical Decision Making
No ratings yet
2 Unit PR Statistical Decision Making
61 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
91 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Pattern Recognition Lecture Bayes Decision Theory: Prof. Dr. Marcin Grzegorzek
100% (1)
Pattern Recognition Lecture Bayes Decision Theory: Prof. Dr. Marcin Grzegorzek
35 pages
Problem Sheet 1 Answers
No ratings yet
Problem Sheet 1 Answers
4 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
ml3 - Text Classification - Naive Bayes
No ratings yet
ml3 - Text Classification - Naive Bayes
50 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
ML - Lec 2 - Review of Probability and Statistics
No ratings yet
ML - Lec 2 - Review of Probability and Statistics
30 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Bayes
No ratings yet
Bayes
10 pages
CCS - Lec 5
No ratings yet
CCS - Lec 5
33 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
Bayesian Classifiers Lecture
No ratings yet
Bayesian Classifiers Lecture
31 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
CH-1 - Introduction-Updated
No ratings yet
CH-1 - Introduction-Updated
55 pages
Black-Karasinski Model - Wikipedia
No ratings yet
Black-Karasinski Model - Wikipedia
1 page
PID Controller Training Guide
0% (1)
PID Controller Training Guide
11 pages
Math4Phys: Master's Guide 2022-23
No ratings yet
Math4Phys: Master's Guide 2022-23
8 pages
Random Variables & Probability Distributions
No ratings yet
Random Variables & Probability Distributions
16 pages
Book 1
No ratings yet
Book 1
6 pages
Cognitive Radar for Engineers
No ratings yet
Cognitive Radar for Engineers
4 pages
Kruskal's and Dijkstra's Algorithm
No ratings yet
Kruskal's and Dijkstra's Algorithm
19 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Chapter 5 Turing Machines
No ratings yet
Chapter 5 Turing Machines
47 pages
Matrix and Tensor Completion Using Tensor Ring Dec
No ratings yet
Matrix and Tensor Completion Using Tensor Ring Dec
17 pages
Operations Management Test 2
No ratings yet
Operations Management Test 2
5 pages
Engineering Inventory Management System
No ratings yet
Engineering Inventory Management System
1 page
Secant and Halley's Method Lab Report
No ratings yet
Secant and Halley's Method Lab Report
14 pages
Sieve of Eratosthenes:: Topics That You Should Know With Sieve
No ratings yet
Sieve of Eratosthenes:: Topics That You Should Know With Sieve
3 pages
Docx
No ratings yet
Docx
13 pages
Hierarchical Clustering Examples
No ratings yet
Hierarchical Clustering Examples
13 pages
Continuous Random Variables Topic Assessment
No ratings yet
Continuous Random Variables Topic Assessment
6 pages
Signal & Linear System: Basil Hamed
No ratings yet
Signal & Linear System: Basil Hamed
63 pages
Asymmetric Key Cryptography Guide
No ratings yet
Asymmetric Key Cryptography Guide
31 pages
Indirect Search Method
No ratings yet
Indirect Search Method
3 pages
Long Short-Term Memory
No ratings yet
Long Short-Term Memory
9 pages
Floyd's Algorithm
No ratings yet
Floyd's Algorithm
4 pages
IAB Unsteady Friction
No ratings yet
IAB Unsteady Friction
22 pages
H2 Math: Differential Equations
No ratings yet
H2 Math: Differential Equations
4 pages
Evaluation Class X
40% (5)
Evaluation Class X
19 pages
Fall 2006 Chris Christensen MAT/CSC 483
No ratings yet
Fall 2006 Chris Christensen MAT/CSC 483
32 pages
Continuity Equations and The Energy-Momentum Tensor
No ratings yet
Continuity Equations and The Energy-Momentum Tensor
7 pages
Ds7201 Adip
No ratings yet
Ds7201 Adip
2 pages

Lecture3-Bayesian Decision Theory

Uploaded by

Lecture3-Bayesian Decision Theory

Uploaded by

Today

Conditional distributions (forgot to review)

Need to estimate parameters of probability

bass’s length has distribution N(µ2 ,σ 22 )

Labeled data salmon bass salmon salmon

The shape of discriminant functions is

Estimate the probability distribution

1. Estimate the probability distribution

Do not even need training data

Pr [A] probability of event A

Usually model with random variables not events.

This is really nothing new because:

The law of Total Probability:

The Bayes Rule:

This is a probability density function because:

The law of Total Probability:

Know probability distribution of the

Suppose a fish has length 7. How do we classify it?

Instead, we choose class which maximizes likelihood

p( 7 |bass) Thus we choose

classify as salmon classify as sea bass

With the addition of prior to our model, how

How should this change with prior?

salmon ? ? sea bass

Priors: P(salmon) = 2/3, P(bass) = 1/3

New decision boundary makes sense since

7.1 8.9 length

posterior P(fish class| l)

posterior density likelihood Prior

If for some observation l, P(l|salmon)=P(l|bass), then

For any particular l, probability of error

Thus MAP classifier is optimal for each

If Pr[error| l ]is as small as possible, the integral is

Thus MAP classifier minimizes the probability of error!

Bayesian Decision theory

Let’s generalize a little bit

Pr [error | x ] = 1 − P (ci | x ) if we decide class ci

Pr [error | x ] is minimized with MAP classifier

sum over disjoint events probability of

= 1 − P (ci | x ) = Pr [error if decide ci ]

Thus MAP classifier optimizes R(αi|x)

Bayes decision rule α(x) is optimal, i.e. gives the

R(salmon | l ) = λssP (s | l ) + λsbP (b | l ) = λsbP (b | l )

Bayes decision rule (optimal for our loss function)

Substituting likelihoods and losses

likelihood fixed number

If above inequality holds, decide c1

MAP decision rule: gi ( x ) = P (ci | x )

Bayes decision rule: gi ( x ) = −R(ci | x )

You might also like