Machine Learning (CS251/CS340)
Lecture 03 - Probabilistic
Inference
Spring 2024
Elen Vardanyan
evardanyan@aua.am
Probabilistic Inference
We flip the same coin 10 times:
Probability that the next coin flip is ?
∼0 ∼ 0.3 ∼ 0.38 ∼ 0.5 ∼ 0.76 ∼1
30%
This seems reasonable, but why?
Every flip is random. => So every sequence of flips is random, i.e., it has some
probability to be observed.
For the i-th coin flip we write
To denote that the probability distribution depends on θi, we write
i.e. Fi ∼ Ber(θi)
Note the i in the index! We are trying to reason about θ11.
Maximum Likelihood Estimation
(MLE)
All the randomness of a sequence of flips is governed (modeled) by the
parameters θ1, . . . , θ10:
What do we know about θ1, . . . , θ10? Can we infer something about θ11?
At first sight, there is no connection.
Find θi’s such that
is as high as possible. This is a very important principle:
Maximize the likelihood of our observation.
(Maximum Likelihood)
??? ???
We need to model this.
First assumption: The coin flips do not affect each other — independence.
Notice the i in pi, θi! This indicates: The coin flip at time 1 is different
from the one at time 2, …
But the coin does not change.
Second assumption: The flips are qualitatively the same — identical distribution.
In total: The 10 flips are independent and
identically distributed (i.i.d.).
Remember θ11? With the i.i.d. assumption we can link it to θ1, . . . , θ10. Now we can
write down the probability of our sequence with respect to θ:
Under our model assumptions (i.i.d.):
This can be interpreted as a function . We want to find the
maxima (maximum likelihood) of this function.
Our goal
Very important: the likelihood function is not a
probability distribution over θ since in general
How do we maximize the likelihood function?
Take the derivative , set it to 0, and solve for θ.
Check these critical points by checking the second derivative.
This is possible, but even for our simple f(θ) the math is rather ugly.
Can we simplify the problem?
Luckily, monotonic functions preserve critical points.
Maximum Likelihood Estimation (MLE) for any coin sequence?
Let |T|, |H| denote number of , respectively.
Remember we wanted to find the probability the next coin flip is
This justifies 30% as a reasonable answer to our initial question. Problem solved?!
Just for fun, a totally different sequence (same coin!):
But even a fair coin (θ = 0.5) has 25% chance of showing this result!
The MLE solution seems counter-intuitive. Why?
We have prior beliefs: ”Coins usually don’t land heads all the time”
How can we
1. represent such beliefs mathematically?
2. incorporate them into our model?
Questions
● What distribution did we use to model the coin flips?
● Which conditions did we use in the coin flip example that helped us to find the
parameter?
● What does the Likelihood function represent?
● Why did we use the logarithm of the Likelihood function for the MLE?
Provide an example.
● What was the MLE for the coin flip example?
● What was problematic about the MLE?
Bayesian Inference
How can we represent our beliefs about θ mathematically?
(Subjective) Bayesian interpretation of probability:
Distribution p(θ) reflects our subjective beliefs about θ.
A prior distribution p(θ) represents our beliefs before we observe any data.
How do we choose p(θ)? The only constraints are
1. It must not depend on the data
2. for all
3.
Properties 2 and 3 have to hold on the support (i.e., feasible values) of θ.
In our setting, only values θ ∈ [0, 1] make sense.
This leaves room for (possibly subjective) model choices!
Some possible choices for the prior on θ:
The Bayes formula tells us how to we update our beliefs about θ after observing
the data
posterior ∝ likelihood · prior
Here, is the posterior distribution. It encodes our beliefs in the value of θ
after observing data.
The posterior depends on the following terms:
● is the likelihood.
● is the prior that encodes our beliefs before observing the data.
● is the evidence. It acts as a normalizing constant that ensures that the
posterior distribution integrates to 1.
Usually, we define our model by specifying the likelihood and the prior.
We can obtain the evidence using the sum rule of probability
The Bayes formula tells us how to update our beliefs given the data
Observing more data increases our confidence
Question: What happens if p(θ) = 0 for some particular θ?
Recall:
Posterior will always be zero for that particular θ regardless of the
likelihood/data.
Maximum a Posteriori Estimation
(MAP)
Back to our coin problem: How do we estimate θ from the data?
In MLE, we were asking the wrong question
MLE ignores our prior beliefs and performs poorly if little data is available.
Actually, we should care about the posterior distribution .
What if we instead maximize the posterior probability?
This approach is called maximum a posteriori (MAP) estimation.
Maximum a posterior estimation
We can ignore since it’s a (positive) constant independent of θ
We already know the likelihood from before, how do we choose the
prior ?
Often, we choose the prior to make subsequent calculations easier.
We choose Beta distribution for reasons that will become clear later.
where
● a > 0, b > 0 are the distribution parameters
● Γ(n) = (n − 1)! for n ∈ ℕ is the gamma function
The Beta PDF for different choices of a and b:
Let’s put everything together
because is constant w.r.t. θ.
We know
So we get:
We are looking for
As before, the problem becomes much easier if we consider the logarithm
With some algebra we obtain
Questions
● What conditions does a prior distribution need to have?
● What distribution did we use to model prior for the coin flip example?
Why?
● Why do we become more confident in θ as the data increases?
(+ visual interpretation)
● What happens to the posterior if p(θ) = 0 for some particular θ?
● What was the the difference between MLE and MAP?
● Why would we want the full posterior?
How did we get it for the coin flip example?
Estimating the Posterior
Distribution
What we have so far
The most probable value of θ under the posterior distribution.
Is this the best we can do?
● How certain are we in our estimate?
● What is the probability that θ lies in some interval?
For this, we need to consider the entire posterior distribution ,
not just its mode .
We know the posterior up to a normalizing constant
Finding the true posterior boils down to finding the normalization constant,
such that the distribution integrates to 1.
Option 1: Brute-force calculation
● Computing
● This is tedious, difficult and boring. Any alternatives?
Option 2: Pattern matching
● The unnormalized posterior looks similar to the
PDF of the Beta distribution
● Can we use this fact?
The unnormalized posterior
Beta distribution
Thus, we can conclude that the appropriate normalizing constant in front of the
posterior should be
and the posterior is a Beta distribution
Always remember this trick when you try to solve integrals that
involve known pdfs (up to a constant factor)!
We started with the following prior distribution
And obtained the following posterior
Was this just a lucky coincidence?
No, this is an instance of a more general principle. Beta distribution is a
conjugate prior for the Bernoulli likelihood.
If a prior is conjugate for the given likelihood, then the posterior will be of the
same family as the prior.
In our case, we can interpret the parameters a, b of the prior as the number of
tails and heads that we saw in the past.
What are the advantages of the fully Bayesian approach?
We have an entire distribution, not just a point estimate
We can answer questions such as:
● What is the expected value of under ?
● What is the variance of ?
● Find a credible interval , such that
(not to be confused with frequentist confidence intervals).
We learned about three approaches for parameter estimation:
Maximum likelihood estimation (MLE)
● Goal: Optimization problem
● Result: Point estimate
● Coin example:
Maximum a posteriori (MAP) estimation
● Goal: Optimization problem
● Result: Point estimate
● Coin example:
Estimating the posterior distribution
● Goal: Find the normalizing constant
● Result: Full distribution
● Coin example:
The three approaches are closely connected.
The posterior distribution is
Recall that the mode of is , for .
We see that the MAP solution is the mode of the posterior distribution
If we choose a uniform prior (i.e. a = b = 1) we obtain the MLE solution
All these nice formulas are a consequence of choosing a conjugate prior. Had we
chosen a non-conjugate prior, and could not have a closed form.
How many flips?
We had
Visualize the posterior (for given prior, e.g. a = b = 1):
With more data the posterior becomes more peaky – we are more certain
about our estimate of θ
Alternative view: a frequentist perspective
For MLE we had
Clearly, we get the same result for |T| = 1,|H| = 4 and |T| = 10,|H| = 40.
Which one is better? Why?
How many flips? Hoeffding’s Inequality for a sampling complexity bound:
where N = |T| + |H|.
For example, I want to know θ within ε = 0.1 error with probability at least 1 − δ = 0.99.
We have:
Predicting the Next Flip
Remember that we want to predict the next coin flip...
For MLE:
1. Estimate from the data.
2. The probability that the next flip lands tails is
For MAP:
1. Estimate from the data.
2. The probability that the next flip lands tails is
What if we have the entire posterior?
We have estimated the posterior distribution of the parameter θ.
Now, we want to compute the probability that the next coin flip is T, given
observations and prior belief a, b:
This distribution is called the posterior predictive distribution.
Different from the posterior over the parameters !
So how do we obtain the posterior predictive distribution?
For simplicity, denote the outcome of the next flip as f ∈ {0, 1}.
We already know the posterior over the parameters .
Using the sum rule of probability (law of total probability)
(“reverse” marginalization)
(chain rule of probability)
(conditional independence)
The last equality follows from the conditional independence assumption.
”If we know θ, the next flip f is independent of the previous flips .”
Recall that
and
Substituting these expressions and doing some algebra we get
Note that the posterior predictive distribution doesn’t contain θ — we have marginalized it out!
We call this approach fully Bayesian analysis.
Predictions using different approaches
● MLE:
● MAP:
● Fully Bayesian:
Given the prior a = b = 5 and the counts |T| = 4,|H| = 8
How about if we have |T| = 304,|H| = 306?
As we observe lots of data, the differences in predictions become less
noticeable.
Questions
● What is the connection between prior, likelihood and the posterior?
What about the graphical interpretation?
● What does a conjugate prior for a specific likelihood mean?
● What is the main difference between MAP and full Bayesian approach?
What additional information do we have in the second approach?
● What information does Hoeffding’s inequality provide?
References
● ”Machine Learning: A Probabilistic Perspective” by Murphy
— chapters 3.1 - 3.3, 4.2, 4.5
● “Machine Learning and Pattern Recognition” by Bishop
— chapters 2.1 - 2.4 (more importantly, 2.3-2.4)
● Slides adapted from Stephan Günnemann’s Machine
Learning Course