Today
Conditional distributions (forgot to review)
Bayesian Decision Theory
Two category classification
Multiple category classification
Discriminant Functions
Course Road Map
a lot is
known
”easier”
little is
known
“harder”
Bayesian Decision theory
Know probability distribution of the a lot is
categories known
”easier”
never happens in real world
Do not even need training data
Can design optimal classifier
Example
respected fish expert says that salmon’s length
has distribution N(5,1) and sea bass’s length
has distribution N(10,4)
little is
known
salmon sea bass “harder”
ML and Bayesian parameter estimation
Shape of probability distribution is known a lot is
known
Happens sometimes ”easier”
Labeled training data salmon bass salmon salmon
Need to estimate parameters of probability
distribution from the training data
Example
respected fish expert says salmon’s
length has distribution N(µ1 ,σ1 ) and sea
2
bass’s length has distribution N(µ2 ,σ 22 )
Need to estimate parameters µ1 ,σ12 , µ2 ,σ 22
Then can use the methods from the little is
bayesian decision theory known
“harder”
Linear discriminant functions and Neural Nets
No probability distribution (no shape or a lot is
parameters are known) known
Labeled data salmon bass salmon salmon
The shape of discriminant functions is
known
ba linear
lightness
ss discriminant
function
sa
lm
on
length
Need to estimate parameters of the
little is
discriminant function (parameters of the known
line in case of linear discriminant)
Non-Parametric Methods
Neither probability distribution nor a lot is
known
discriminant function is known ”easier”
Happens quite often
All we have is labeled data
salmon bass salmon salmon
Estimate the probability distribution
from the labeled data
little is
known
“harder”
Unsupervised Learning and Clustering
a lot is
Data is not labeled known
”easier”
Happens quite often
1. Estimate the probability distribution
from the unlabeled data
2. Cluster the data
little is
known
“harder”
Course Road Map
1. Bayesian Decision theory (rare case) a lot is
Know probability distribution of the categories known
Do not even need training data
Can design optimal classifier
2. ML and Bayesian parameter estimation
Need to estimate Parameters of probability dist.
Need training data
3. Non-Parametric Methods
No probability distribution, labeled data
4. Linear discriminant functions and Neural Nets
The shape of discriminant functions is known
Need to estimate parameters of discriminant functions
5. Unsupervised Learning and Clustering little is
No probability distribution and unlabeled data known
Notation Change (for consistency with textbook)
Pr [A] probability of event A
P(x) probability mass function of
discrete r.v. x
p(x) probability density function of
continuous r.v. x
p(x,y) joint probability density of r.v. x
and y
p(x|y) conditional density of x given y
P(x|y) conditional mass of x given y
More on Probability
For events A and B, we have defined
conditional law of total
probability probability
U n
Pr(A B)
Pr(A|B)= Pr ( A) = Pr (A | Bk ) Pr (Bk )
Pr(B) k =1
Pr (A | B i ) Pr (B i )
Bayes’ rule Pr (Bi | A ) = n
Pr (A | Bk ) Pr (Bk )
k =1
Usually model with random variables not events.
Need equivalents of these laws for mass and density
functions (could go from random variables back to
events, but time consuming)
Conditional Mass Function: Discrete RV
For discrete RV nothing new because mass
function is really a probability law
Define conditional mass function of X given Y=y
by P (x , y )
P (x | y ) =
P (y )
y is fixed
This is a probability mass function because:
P (x , y )
P (y )
P (x | y ) = ∀x
= =1
∀x P (y ) P (y )
This is really nothing new because:
P ( x , y ) Pr [X = x Y = y ]
( )
P x |y = = = Pr [X = x |Y = y ]
P (y ) Pr [Y = y ]
Conditional Mass Function: Bayes Rule
The law of Total Probability:
P (x ) = P (x , y ) = P ( x | y )P (y )
∀y ∀y
The Bayes Rule:
P (y , x ) P ( x | y )P (y )
P (y | x ) = =
P (x ) P ( x | y )P (y )
∀y
Conditional Density Function: Continuous RV
Does it make sense to talk about conditional density
p(x|y) if Y is a continuous random variable? After
all, Pr[Y=y]=0, so we will never see Y=y in practice
Measurements have limited accuracy. Can interpret
observation y as observation in interval [y-ε, y+ε], and
observation x as observation in interval [x-ε, x+ε]
y-ε y+ε x-ε x+ε
y x
Conditional Density Function: Continuous RV
p(x)
Let B(x) denote interval [x-ε,x+ε]
x +ε
Pr [X ∈ B( x )] = p( x )dx ≈ 2ε p( x )
x −ε x-ε x x+ε
Similarly Pr [Y ∈ B(y )] ≈ 2ε p(y )
Pr [X ∈ B( x ) Y ∈ B(y )] ≈ 4ε 2 p( x , y )
Pr [X ∈ B( x ) | Y ∈ B(y )]
Thus we should have p( x | y ) ≈
2ε
Which can be simplified to:
Pr [X ∈ B( x ) Y ∈ B(y )] p( x , y )
p( x | y ) ≈ ≈
2ε Pr [Y ∈ B(y )] p(y )
Conditional Density Function: Continuous RV
Define conditional density function of X given Y=y
by p (x , y )
( )
p x |y =
p (y )
y is fixed
This is a probability density function because:
∞
p ( x , y )dx
p (x , y ) p (y )
∞ ∞
p ( x | y ) dx = dx = −∞
= =1
−∞ −∞
p (y ) p (y ) p (y )
The law of Total Probability:
∞ ∞
p( x ) = p( x , y ) dy = p( x | y )p(y )dy
−∞ −∞
Conditional Density Function: Bayes Rule
The Bayes Rule:
p (y , x ) p ( x | y )p (y )
p (y | x ) = =
p (x ) ∞
p ( x | y )p (y ) dy
−∞
Mixed Discrete and Continuous
X discrete, Y continuous
Bayes rule
p (y | x )P ( x )
P (x | y ) =
p (y )
X continuous, Y discrete
Bayes rule
P (y | x )p ( x )
p (x | y ) =
P (y )
Bayesian Decision Theory
Know probability distribution of the
categories
Almost never the case in real life!
Nevertheless useful since other cases can be
reduced to this one after some work
Do not even need training data
Can design optimal classifier
Cats and Dogs
Suppose we have these conditional probability
mass functions for cats and dogs
P(small ears | dog) = 0.1, P(large ears | dog) = 0.9
P(small ears | cat) = 0.8, P(large ears | cat) = 0.2
Observe an animal with large ears
Dog or a cat?
Makes sense to say dog because probability of
observing large ears in a dog is much larger than
probability of observing large ears in a cat
Pr[large ears | dog] = 0.9 > 0.2= Pr[large ears | cat] = 0.2
We choose the event of larger probability, i.e.
maximum likelihood event
Example: Fish Sorting
Respected fish expert says that
Salmon’ length has distribution N(5,1)
Sea bass’s length has distribution N(10,4)
Recall if r.v. is N(µ,σ 2 ) then it’s density is
( l − µ )2
1 −
p(l ) = e 2σ 2
σ 2π
Thus class conditional densities are
( l −5)2 ( l −10)2
1 − 1 −
p(l |bass) =
−
p(l | salmon) = e 2 e 2*4
2π 2 2π
Likelihood function
Thus class conditional densities are
(l −5)2 ( l −10)2
1 − − 1 −
p(l | salmon) = e 2
p(l |bass) = e 2*4
fixed 2π fixed 2 2π
Fix length, let fish class vary. Then we get
likelihood function (it is not density and not
probability mass)
( l −5 )2
1 − 2 −
e if class = salmon
p(l | class ) = 2π ( l −10 )2
1 −
fixed e 8 if class = bass
2 2π
Likelihood vs. Class Conditional Density
p(l | class)
7 length
Suppose a fish has length 7. How do we classify it?
ML (maximum likelihood) Classifier
We would like to choose salmon if
Pr[length= 7 | salmon] > Pr[length= 7 | bass]
However, since length is a continuous r.v.,
Pr[length= 7 | salmon] = Pr[length= 7 | bass] = 0
Instead, we choose class which maximizes likelihood
(l −10)
1 − (l −25)
2 2
1 −
p(l | salmon) = e p(l |bass) = e 2*4
2π 2 2π
ML classifier: for an observed l:
bass <
p(l | salmon) ? p(l | bass) in words: if p(l | salmon) > p(l | bass),
classify as salmon, else classify as bass
> salmon
Interval Justification
p( 7 |bass) Thus we choose
the class (bass)
which is more
p( 7 |salmon) likely to have given
the observation
7
Pr [l ∈ B(7 ) | bass ] ≈ 2ε p(7 | bass )
⇐
>
>
Pr [l ∈ B(7 ) | salmon] ≈ 2ε p(7 | salmon)
Decision Boundary
classify as salmon classify as sea bass
6.70 length 28
Priors
Prior comes from prior knowledge, no data
has been seen yet
Suppose a fish expert says: in the fall, there
are twice as many salmon as sea bass
Prior for our fish sorting problem
P(salmon) = 2/3
P(bass) = 1/3
With the addition of prior to our model, how
should we classify a fish of length 7?
How Prior Changes Decision Boundary?
Without priors
salmon sea bass
6.70 length
How should this change with prior?
P(salmon) = 2/3
P(bass) = 1/3
salmon ? ? sea bass
6.70 length
Bayes Decision Rule
1. Have likelihood functions
p(length | salmon) and p(length | bass)
2. Have priors P(salmon) and P(bass)
Question: Having observed fish of certain
length, do we classify it as salmon or bass?
Natural Idea:
salmon if P(salmon| length) > P(bass| length)
bass if P(bass| length) > P(salmon| length)
Posterior
P(salmon | length) and P(bass | length)
are called posterior distributions, because
the data (length) was revealed (post data)
How to compute posteriors? Not obvious
From Bayes rule:
p(salmon, length) p(length| salmon)P(salmon)
P(salmon| length) = =
p(length) p(length)
Similarly:
p(length| bass)P(bass)
P(bass| length) =
p(length)
MAP (maximum a posteriori) classifier
> salmon
P (salmon | length) ? P (bass | length)
bass <
salmon
p(length | salmon)P (salmon) > p(length | bass )P (bass )
?
p(length ) bass < p(length )
>salmon
p(length| salmon)P (salmon) ? p(length| bass)P (bass)
bass <
Back to Fish Sorting Example
likelihood (l −5)2 (l −10)2
1 −− 1 − −
p(l | salmon) = e 2
p(l |bass) = e 8
2π 2 2π
Priors: P(salmon) = 2/3, P(bass) = 1/3
( l −5 )2 ( l −10 )2
1 − 2 1 − 1
Solve inequality e 2
∗ > e 8
∗
2π 3 2 2π 3
new decision
salmon boundary sea bass
6.70 7.18 length
New decision boundary makes sense since
we expect to see more salmon
Prior P(s)=2/3 and P(b)= 1/3 vs.
Prior P(s)=0.999 and P(b)= 0.001
salmon
bass
7.1 8.9 length
Likelihood vs Posteriors
likelihood
P(salmon|l) P(bass|l)
p(l|fish class)
density with
respect to
p(l|salmon) length, area
p(l|bass) under the
curve is 1
length
posterior P(fish class| l)
mass function with respect to fish class, so for
each l, P(salmon| l )+P(bass| l ) = 1
More on Posterior
posterior density likelihood Prior
(our goal) (given) (given)
P( l | c) P(c)
P(c | l ) =
P(l )
normalizing factor, often do not even need
it for classification since P(l) does not
depend on class c. If we do need it, from
the law of total probability:
P (l ) = p(l | salmon )p(salmon ) + p(l | bass )p(bass )
Notice this formula consists of likelihoods
and priors, which are given
More on Posterior
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
cause (class) c l effect (length)
If cause c is present, it easy to determine the
probability of effect l with likelihood P(l|c)
Usually observe the effect l without knowing cause c.
Hard to determine cause c because there may be
several causes which could produce same effect l
Bayes rule makes I easy to determine posterior
P(c|l), if we know likelihood P(l|c) and prior P(c)
More on Priors
Prior comes from prior knowledge, no data
has been seen yet
If there is a reliable source prior knowledge,
it should be used
Some problems cannot even be solved
reliably without a good prior
However prior alone is not enough, we still
need likelihood
P(salmon)=2/3, P(sea bass)=1/3
If I don’t let you see the data, but ask you to
guess, will you choose salmon or sea bass?
More on Map Classifier
likelihood prior
P( l | c) P(c)
posterior
P(c | l ) =
P(l )
Do not care about P(l) when maximizing P(c|l )
P(c | l ) P( l | c) P(c)
proportional
∝
If P(salmon)=P(bass) (uniform prior) MAP classifier
becomes ML classifier P(c | l ) ∝P( l | c)
If for some observation l, P(l|salmon)=P(l|bass), then
this observation is uninformative and decision is
based solely on the prior P(c | l ) ∝ P(c)
Justification for MAP Classifier
Let’s compute probability of error for the
MAP estimate:
> salmon
P (salmon | l ) ? P (bass | l )
bass <
For any particular l, probability of error
P(bass|l) if we decide salmon
Pr[error| l ]=
P(salmon|l) if we decide bass
Thus MAP classifier is optimal for each
individual l !
Justification for MAP Classifier
We are interested to minimize error not just for
one l, we really want to minimize the average
error over all l
∞ ∞
Pr [error ] = p(error , l )dl = Pr [error | l ]p(l )dl
−∞ −∞
If Pr[error| l ]is as small as possible, the integral is
small as possible
But Bayes rule makes Pr[error| l ] as small as
possible
Thus MAP classifier minimizes the probability of error!
Today
Bayesian Decision theory
Multiple Classes
General loss functions
Multivariate Normal Random Variable
Classifiers
Discriminant Functions
More General Case
Let’s generalize a little bit
Have more than one feature x = [x1 , x 2 ,..., xd ]
Have more than 2 classes { c1 , c2 ,...,cm }
More General Case
As before, for each j we have
( )
p x | c j is likelihood of observation x given that
the true class is c j
( )
P c j is prior probability of class c j
( )
P c j | x is posterior probability of class c j given
that we observed data x
Evidence, or probability density for data
( ) ( )
m
p( x ) = p x | cj P cj
j =1
Minimum Error Rate Classification
Want to minimize average probability of error
Pr [error ] = p(error , x )dx = Pr [error | x ]p( x )dx
need to make this
as small as possible
Pr [error | x ] = 1 − P (ci | x ) if we decide class ci
Pr [error | x ] is minimized with MAP classifier
Decide on class ci if 1
( )
P (ci | x ) > P c j | x ∀j ≠ i
1-P(c1|x) 1-P(c |x)
1-P(c3|x)
2
MAP classifier is optimal P(c3|x)
P(c1|x)
If we want to minimize the P(c2|x)
probability of error
General Bayesian Decision Theory
In close cases we may want to refuse to
make a decision (let human expert handle
tough case)
allow actions {α1 ,α 2 ,...,α k }
Suppose some mistakes are more costly
than others (classifying a benign tumor as
cancer is not as bad as classifying cancer
as benign tumor)
Allow loss functions λ (α i | c j ) describing loss
occurred when taking action α i when the true
class is c j
Conditional Risk
Suppose we observe x and wish to take
action α i
If the true class is c j , by definition, we incur
loss λ (αi | c j )
Probability that the true class is c j after
observing x is P (c j | x )
The expected loss associated with taking
action α i is called conditional risk and it is:
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1
Conditional Risk
sum over disjoint events probability of
(different classes) class c j given
observation x
c1 λ(αi|c1)
λ (α i | c j )P (c j | x )
m
R(αi | x ) =
j =1 c 2 λ(αi|c2)
penalty for
taking action α i part of overall penalty c 3 λ(αi|c3)
if observe x which comes from event c 4 λ(αi|c4)
that true class is c j
Example: Zero-One loss function
action α i is decision that true class is ci
λ (α i | c j )
0 if i = j (no mistake)
=
1 otherwise (mistake)
λ (αi | c j )P (c j | x ) = ( )
m
R(αi | x ) = P cj | x =
j =1 i≠ j
= 1 − P (ci | x ) = Pr [error if decide ci ]
Thus MAP classifier optimizes R(αi|x)
P(ci | x) > P(cj | x) ∀j ≠ i
MAP classifier is Bayes decision rule under
zero-one loss function
Overall Risk
Decision rule is a X α(x1)
x1
function α(x) which for x2 α(x2) {α1 ,α2 ,...,α k }
every x specifies action x3 α(x3)
out of {α1 ,α2 ,...,αk }
The average risk for α(x)
R(α ) = R(α ( x ) | x )p( x )dx
need to make this as small as possible
Bayes decision rule α(x) for every x is the action
which minimizes the conditional risk
λ (α i | c j )P (c j | x )
m
R(α i | x ) =
j =1
Bayes decision rule α(x) is optimal, i.e. gives the
minimum possible overall risk R *
Bayes Risk: Example
Salmon is more tasty and expensive than sea bass
λsb = λ (salmon | bass) = 2 classify bass as salmon
λbs = λ (bass | salmon) = 1 classify salmon as bass
λss = λbb = 0 no mistake, no loss
( l −5 )2 ( l −10)2
1 − 1 −
Likelihoods p(l | salmon) = e 2
p(l |bass) = e 2*4
2π 2 2π
Priors P(salmon)= P(bass)
λ (α | c j )P (c j | x ) = λα s P (s | l ) + λα bP (b | l )
m
Risk R(α | x ) =
j =1
R(salmon | l ) = λssP (s | l ) + λsbP (b | l ) = λsbP (b | l )
R(bass | l ) = λbsP (s | l ) + λbbP (b | l ) = λbsP (s | l )
Bayes Risk: Example
R(salmon | l ) = λsbP (b | l ) R(bass | l ) = λbsP (s | l )
Bayes decision rule (optimal for our loss function)
< salmon
λsbP (b | l )? λbsP (s | l )
> bass
P (b | l ) λbs
Need to solve <
P (s | l ) λsb
Or, equivalently, since priors are equal:
P (l | b)P (b)p(l ) P (l | b) λbs
= <
p(l )P (l | s )P (s ) P (l | s ) λsb
Bayes Risk: Example
Need to solve
P (l | b) λbs
<
P (l | s ) λsb
Substituting likelihoods and losses
( l −10 )2 ( l −10 )2 ( l −10 )2
−
− −
− −
−
2 ⋅ 2π exp 8
exp 8
exp 8
( l −5 )2
<1 ⇔ (l −5 )2
< 1 ⇔ ln
( l −5 )2
< ln(1) ⇔
−
− −
− −
−
1 ⋅ 2 2π exp 2
exp 2
exp 2
(l − 10 )2 (l − 5 )2
⇔ −− + + < 0 ⇔ 3l 2 − 20l < 0 ⇔ l < 6.6667
8 2
new decision
salmon boundary sea bass
6.67 6.70 length
Likelihood Ratio Rule
In 2 category case, use likelihood ratio rule
P ( x | c1 ) λ12 − λ22 P (c2 )
>
P ( x | c2 ) λ21 − λ11 P (c1 )
likelihood fixed number
ratio Independent of x
If above inequality holds, decide c1
Otherwise decide c2
Discriminant Functions
All decision rules have the same structure:
at observation x choose class ci s.t.
gi ( x ) > g j ( x ) ∀j ≠ i
discriminant
function
ML decision rule: gi ( x ) = P ( x | ci )
MAP decision rule: gi ( x ) = P (ci | x )
Bayes decision rule: gi ( x ) = −R(ci | x )
Discriminant Functions
Classifier can be viewed as network which
computes m discriminant functions and selects
category corresponding to the largest discriminant
select class
giving maximim
discriminant
g1( x) g2( x) gm(x)
functions
x1 x2 x3 xd
features
gi(x) can be replaced with any monotonically
increasing function, the results will be unchanged
Decision Regions
Discriminant functions split the feature
vector space X into decision regions
g2 ( x ) = max{gi }
c1
c2
c3
c3
c1
Important Points
If we know probability distributions for the
classes, we can design the optimal
classifier
Definition of “optimal” depends on the
chosen loss function
Under the minimum error rate (zero-one loss
function
No prior: ML classifier is optimal
Have prior: MAP classifier is optimal
More general loss function
General Bayes classifier is optimal
Acknowledgment
Selected slides in this presentation
are taken from the course materials
of Prof. Olga Veksler, University of
Western Ontario.
Source: https://www.csd.uwo.ca/~oveksler/
Courses/CS434a_541a/
Used for non-commercial educational purposes
with proper credit to the original author.