The
Bayes Optimal Classifier
Machine Learning
1
Most probable classification
• In Bayesian learning, the primary question is: What is
the most probable hypothesis given data?
• We can also ask: For a new test point, what is the
most probable label, given training data?
• Is this the same as the prediction of the maximum a
posteriori hypothesis?
2
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis?
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x? -1
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
3
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis? h1
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x? -1
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
4
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis? h1
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x? -1
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
5
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis? h1
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x?
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
6
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis? h1
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x? -1
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
7
Most probable classification
Suppose our hypothesis space H has three functions h1, h2 and h3
• P(h1 | D) = 0.4, P(h2 | D) = 0.3, P(h3 | D) = 0.3
• What is the MAP hypothesis? h1
• For a new instance x, suppose h1(x) = +1, h2(x) = -1 and h3(x) = -1
• What is the most probable classification of x? -1
P(+1 | x) = 0.4 P(-1| x) = 0.3 + 0.3
The most probable classification is not the same as the prediction of
the MAP hypothesis
8
Bayes Optimal Classifier
• How should we use the general formalism?
– What should H be?
H can be a collection of functions. H can be a collection of possible
predictions
• Given the training data, choose • Given the data, try to directly
an optimal function choose the optimal prediction
• Then, given new data, evaluate
the selected function on it
These two could be different!
Selecting a function vs. entertaining all options until the last minute
9
Bayes Optimal Classifier
• How should we use the general formalism?
– What should H be?
H can be a collection of functions. H can be a collection of possible
predictions
• Given the training data, choose • Given the data, try to directly
an optimal function choose the optimal prediction
• Then, given new data, evaluate
the selected function on it
These two could be different!
Selecting a function vs. entertaining all options until the last minute
10
Bayes Optimal Classifier
• How should we use the general formalism?
– What should H be?
H can be a collection of functions. H can be a collection of possible
predictions
• Given the training data, choose • Given the data, try to directly
an optimal function choose the optimal prediction
• Then, given new data, evaluate
the selected function on it
These two could be different!
Selecting a function vs. entertaining all options until the last minute
11
Bayes Optimal Classification
Defined as the label produced by the most probable classifier
Computing this can be hopelessly inefficient
And yet an interesting theoretical concept because, no other
classification method can outperform this method on average
(using the same hypothesis space and prior knowledge)
12