0% found this document useful (0 votes)

37 views46 pages

ML Lecture 03 - Probabilistic Inference (Spring 2024)

- The document discusses probabilistic inference and different approaches for parameter estimation including maximum likelihood estimation (MLE), maximum a posteriori (MAP) estimation, and estimating the posterior distribution. - It uses the example of coin flips to illustrate these approaches and shows that with more data, our posterior distribution becomes more certain about the value of the parameter θ. - A key point is that choosing a conjugate prior allows the posterior distribution to be of the same family as the prior distribution, leading to closed-form solutions.

Uploaded by

ghukasyans033

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views46 pages

ML Lecture 03 - Probabilistic Inference (Spring 2024)

Uploaded by

ghukasyans033

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Machine Learning (CS251/CS340)

Lecture 03 - Probabilistic
Inference
Spring 2024
Elen Vardanyan

evardanyan@aua.am
Probabilistic Inference
We ﬂip the same coin 10 times:

Probability that the next coin ﬂip is ?

∼0 ∼ 0.3 ∼ 0.38 ∼ 0.5 ∼ 0.76 ∼1

30%
This seems reasonable, but why?

Every ﬂip is random. => So every sequence of ﬂips is random, i.e., it has some
probability to be observed.

For the i-th coin ﬂip we write

To denote that the probability distribution depends on θi, we write

i.e. Fi ∼ Ber(θi)

Note the i in the index! We are trying to reason about θ11.

Maximum Likelihood Estimation
(MLE)
All the randomness of a sequence of ﬂips is governed (modeled) by the
parameters θ1, . . . , θ10:

What do we know about θ1, . . . , θ10? Can we infer something about θ11?

At ﬁrst sight, there is no connection.

Find θi’s such that

is as high as possible. This is a very important principle:

Maximize the likelihood of our observation.

(Maximum Likelihood)
??? ???

We need to model this.

First assumption: The coin ﬂips do not aﬀect each other — independence.

Notice the i in pi, θi! This indicates: The coin ﬂip at time 1 is diﬀerent
from the one at time 2, …

But the coin does not change.

Second assumption: The ﬂips are qualitatively the same — identical distribution.

In total: The 10 ﬂips are independent and

identically distributed (i.i.d.).

Remember θ11? With the i.i.d. assumption we can link it to θ1, . . . , θ10. Now we can
write down the probability of our sequence with respect to θ:
Under our model assumptions (i.i.d.):

This can be interpreted as a function . We want to ﬁnd the

maxima (maximum likelihood) of this function.

Our goal

Very important: the likelihood function is not a

probability distribution over θ since in general
How do we maximize the likelihood function?

Take the derivative , set it to 0, and solve for θ.

Check these critical points by checking the second derivative.

This is possible, but even for our simple f(θ) the math is rather ugly.
Can we simplify the problem?

Luckily, monotonic functions preserve critical points.

Maximum Likelihood Estimation (MLE) for any coin sequence?

Let |T|, |H| denote number of , respectively.

Remember we wanted to ﬁnd the probability the next coin ﬂip is

This justiﬁes 30% as a reasonable answer to our initial question. Problem solved?!
Just for fun, a totally diﬀerent sequence (same coin!):

But even a fair coin (θ = 0.5) has 25% chance of showing this result!

The MLE solution seems counter-intuitive. Why?

We have prior beliefs: ”Coins usually don’t land heads all the time”

How can we

1. represent such beliefs mathematically?

2. incorporate them into our model?

Questions

● What distribution did we use to model the coin ﬂips?

● Which conditions did we use in the coin ﬂip example that helped us to ﬁnd the
parameter?

● What does the Likelihood function represent?

● Why did we use the logarithm of the Likelihood function for the MLE?
Provide an example.

● What was the MLE for the coin ﬂip example?

● What was problematic about the MLE?

Bayesian Inference
How can we represent our beliefs about θ mathematically?

(Subjective) Bayesian interpretation of probability:

Distribution p(θ) reﬂects our subjective beliefs about θ.

A prior distribution p(θ) represents our beliefs before we observe any data.

How do we choose p(θ)? The only constraints are

1. It must not depend on the data

2. for all
3.

Properties 2 and 3 have to hold on the support (i.e., feasible values) of θ.

In our setting, only values θ ∈ [0, 1] make sense.

This leaves room for (possibly subjective) model choices!

Some possible choices for the prior on θ:
The Bayes formula tells us how to we update our beliefs about θ after observing
the data
posterior ∝ likelihood · prior

Here, is the posterior distribution. It encodes our beliefs in the value of θ

after observing data.

The posterior depends on the following terms:

● is the likelihood.
● is the prior that encodes our beliefs before observing the data.
● is the evidence. It acts as a normalizing constant that ensures that the
posterior distribution integrates to 1.
Usually, we deﬁne our model by specifying the likelihood and the prior.
We can obtain the evidence using the sum rule of probability
The Bayes formula tells us how to update our beliefs given the data
Observing more data increases our conﬁdence
Question: What happens if p(θ) = 0 for some particular θ?

Recall:

Posterior will always be zero for that particular θ regardless of the

likelihood/data.
Maximum a Posteriori Estimation
(MAP)
Back to our coin problem: How do we estimate θ from the data?

In MLE, we were asking the wrong question

MLE ignores our prior beliefs and performs poorly if little data is available.

Actually, we should care about the posterior distribution .

What if we instead maximize the posterior probability?

This approach is called maximum a posteriori (MAP) estimation.

Maximum a posterior estimation

We can ignore since it’s a (positive) constant independent of θ

We already know the likelihood from before, how do we choose the

prior ?
Often, we choose the prior to make subsequent calculations easier.

We choose Beta distribution for reasons that will become clear later.

where

● a > 0, b > 0 are the distribution parameters

● Γ(n) = (n − 1)! for n ∈ ℕ is the gamma function
The Beta PDF for diﬀerent choices of a and b:
Let’s put everything together

because is constant w.r.t. θ.

We know

So we get:
We are looking for

As before, the problem becomes much easier if we consider the logarithm

With some algebra we obtain

Questions

● What conditions does a prior distribution need to have?

● What distribution did we use to model prior for the coin ﬂip example?
Why?

● Why do we become more conﬁdent in θ as the data increases?

(+ visual interpretation)

● What happens to the posterior if p(θ) = 0 for some particular θ?

● What was the the diﬀerence between MLE and MAP?

● Why would we want the full posterior?

How did we get it for the coin ﬂip example?
Estimating the Posterior
Distribution
What we have so far

The most probable value of θ under the posterior distribution.

Is this the best we can do?

● How certain are we in our estimate?

● What is the probability that θ lies in some interval?

For this, we need to consider the entire posterior distribution ,

not just its mode .
We know the posterior up to a normalizing constant

Finding the true posterior boils down to ﬁnding the normalization constant,
such that the distribution integrates to 1.

Option 1: Brute-force calculation

● Computing
● This is tedious, diﬃcult and boring. Any alternatives?

Option 2: Pattern matching

● The unnormalized posterior looks similar to the
PDF of the Beta distribution

● Can we use this fact?

The unnormalized posterior

Beta distribution

Thus, we can conclude that the appropriate normalizing constant in front of the
posterior should be

and the posterior is a Beta distribution

Always remember this trick when you try to solve integrals that
involve known pdfs (up to a constant factor)!
We started with the following prior distribution

And obtained the following posterior

Was this just a lucky coincidence?

No, this is an instance of a more general principle. Beta distribution is a

conjugate prior for the Bernoulli likelihood.

If a prior is conjugate for the given likelihood, then the posterior will be of the
same family as the prior.

In our case, we can interpret the parameters a, b of the prior as the number of
tails and heads that we saw in the past.
What are the advantages of the fully Bayesian approach?

We have an entire distribution, not just a point estimate

We can answer questions such as:

● What is the expected value of under ?

● What is the variance of ?
● Find a credible interval , such that
(not to be confused with frequentist conﬁdence intervals).
We learned about three approaches for parameter estimation:

Maximum likelihood estimation (MLE)

● Goal: Optimization problem
● Result: Point estimate
● Coin example:

Maximum a posteriori (MAP) estimation

● Goal: Optimization problem
● Result: Point estimate
● Coin example:

Estimating the posterior distribution

● Goal: Find the normalizing constant
● Result: Full distribution
● Coin example:
The three approaches are closely connected.

The posterior distribution is

Recall that the mode of is , for .

We see that the MAP solution is the mode of the posterior distribution

If we choose a uniform prior (i.e. a = b = 1) we obtain the MLE solution

All these nice formulas are a consequence of choosing a conjugate prior. Had we
chosen a non-conjugate prior, and could not have a closed form.
How many ﬂips?

We had

Visualize the posterior (for given prior, e.g. a = b = 1):

With more data the posterior becomes more peaky – we are more certain
about our estimate of θ
Alternative view: a frequentist perspective

For MLE we had

Clearly, we get the same result for |T| = 1,|H| = 4 and |T| = 10,|H| = 40.
Which one is better? Why?

How many ﬂips? Hoeﬀding’s Inequality for a sampling complexity bound:

where N = |T| + |H|.

For example, I want to know θ within ε = 0.1 error with probability at least 1 − δ = 0.99.
We have:
Predicting the Next Flip
Remember that we want to predict the next coin ﬂip...

For MLE:

1. Estimate from the data.

2. The probability that the next ﬂip lands tails is

For MAP:

1. Estimate from the data.

2. The probability that the next ﬂip lands tails is
What if we have the entire posterior?

We have estimated the posterior distribution of the parameter θ.

Now, we want to compute the probability that the next coin ﬂip is T, given
observations and prior belief a, b:

This distribution is called the posterior predictive distribution.

Diﬀerent from the posterior over the parameters !

So how do we obtain the posterior predictive distribution?

For simplicity, denote the outcome of the next ﬂip as f ∈ {0, 1}.

We already know the posterior over the parameters .

Using the sum rule of probability (law of total probability)

(“reverse” marginalization)

(chain rule of probability)

(conditional independence)

The last equality follows from the conditional independence assumption.

”If we know θ, the next ﬂip f is independent of the previous ﬂips .”
Recall that

and

Substituting these expressions and doing some algebra we get

Note that the posterior predictive distribution doesn’t contain θ — we have marginalized it out!

We call this approach fully Bayesian analysis.

Predictions using diﬀerent approaches

● MLE:
● MAP:
● Fully Bayesian:

Given the prior a = b = 5 and the counts |T| = 4,|H| = 8

How about if we have |T| = 304,|H| = 306?

As we observe lots of data, the diﬀerences in predictions become less

noticeable.
Questions

● What is the connection between prior, likelihood and the posterior?

What about the graphical interpretation?

● What does a conjugate prior for a speciﬁc likelihood mean?

● What is the main diﬀerence between MAP and full Bayesian approach?
What additional information do we have in the second approach?

● What information does Hoeﬀding’s inequality provide?

References

● ”Machine Learning: A Probabilistic Perspective” by Murphy

— chapters 3.1 - 3.3, 4.2, 4.5
● “Machine Learning and Pattern Recognition” by Bishop
— chapters 2.1 - 2.4 (more importantly, 2.3-2.4)
● Slides adapted from Stephan Günnemann’s Machine
Learning Course

03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
Frequentist vs Bayesian Analysis
No ratings yet
Frequentist vs Bayesian Analysis
18 pages
MAP&MLE
No ratings yet
MAP&MLE
44 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Mle & Map
No ratings yet
Mle & Map
21 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
PyCon 2015 - Bayesian Statistics Made Simple
100% (4)
PyCon 2015 - Bayesian Statistics Made Simple
145 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
2.2 Bayesian Statistics
No ratings yet
2.2 Bayesian Statistics
12 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
Probabilistic Theory of Deep Learning
No ratings yet
Probabilistic Theory of Deep Learning
11 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
Notes 19
No ratings yet
Notes 19
11 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Introduction To Bayesian Statistics: Foo Lee Kien (PHD)
No ratings yet
Introduction To Bayesian Statistics: Foo Lee Kien (PHD)
65 pages
MIT18 05S14 ps6 PDF
No ratings yet
MIT18 05S14 ps6 PDF
5 pages
Maximum Likelihood Estimation Guide
No ratings yet
Maximum Likelihood Estimation Guide
34 pages
Machine Learning Homework Guide
No ratings yet
Machine Learning Homework Guide
6 pages
Bayesian Inference Basics
No ratings yet
Bayesian Inference Basics
7 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Math2830 Chapter 08
No ratings yet
Math2830 Chapter 08
9 pages
Bayes' Estimators: The Method
No ratings yet
Bayes' Estimators: The Method
7 pages
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
Bayesian Learning for Graphics
No ratings yet
Bayesian Learning for Graphics
141 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
CH 5
No ratings yet
CH 5
45 pages
2 2assignement
No ratings yet
2 2assignement
11 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
ANNParameter Estimation-II, III
No ratings yet
ANNParameter Estimation-II, III
2 pages
Probability Concepts and Bayesian Analysis
No ratings yet
Probability Concepts and Bayesian Analysis
57 pages
Lec 37
No ratings yet
Lec 37
5 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Lec 4
No ratings yet
Lec 4
35 pages
Maximum Likelihood Estimation & Linear Probability Model
No ratings yet
Maximum Likelihood Estimation & Linear Probability Model
36 pages
20-Bayesian 310456690
No ratings yet
20-Bayesian 310456690
34 pages
1 Inference
No ratings yet
1 Inference
9 pages
Sta255 Week 11-1 Pre
No ratings yet
Sta255 Week 11-1 Pre
37 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Ds 7
No ratings yet
Ds 7
20 pages
Installing MelOn App (Android)
No ratings yet
Installing MelOn App (Android)
7 pages
Offer Letter
No ratings yet
Offer Letter
2 pages
Deskscapes Guide v3.25
No ratings yet
Deskscapes Guide v3.25
63 pages
Tech Internships for Young Pros
No ratings yet
Tech Internships for Young Pros
29 pages
Lab 6 - DHCP Configuration
No ratings yet
Lab 6 - DHCP Configuration
22 pages
Network Setting File Transfer Syntec
No ratings yet
Network Setting File Transfer Syntec
16 pages
SCLM Presentation
No ratings yet
SCLM Presentation
3 pages
Speech Emotion Recognition in IoT
No ratings yet
Speech Emotion Recognition in IoT
5 pages
MN001969A01 KVL 4000 FLASHport Upgrade User Guide RevB
No ratings yet
MN001969A01 KVL 4000 FLASHport Upgrade User Guide RevB
58 pages
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
No ratings yet
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
25 pages
Shortcut Keys of Fanuc CNC System
100% (1)
Shortcut Keys of Fanuc CNC System
3 pages
TW Log Center Data Collection Capabilities Ds
No ratings yet
TW Log Center Data Collection Capabilities Ds
5 pages
Kushal Intern Report
No ratings yet
Kushal Intern Report
22 pages
Equipment List: Leica TS03, TS07, TS10
No ratings yet
Equipment List: Leica TS03, TS07, TS10
20 pages
List of Alumni 2018
No ratings yet
List of Alumni 2018
3 pages
Graphic Design Roadmap
100% (1)
Graphic Design Roadmap
3 pages
Sr. Data Modeler Data Analyst Resume Madison, WI - Hire IT People - We Get IT Done 2
No ratings yet
Sr. Data Modeler Data Analyst Resume Madison, WI - Hire IT People - We Get IT Done 2
1 page
ML Assignment 1
No ratings yet
ML Assignment 1
2 pages
Fake News Detect
No ratings yet
Fake News Detect
4 pages
Epm Manual 42iq
No ratings yet
Epm Manual 42iq
320 pages
UAM Corpus Tool Manual v2.8
No ratings yet
UAM Corpus Tool Manual v2.8
53 pages
Computer Science Engineer Resume
No ratings yet
Computer Science Engineer Resume
1 page
Clerical Skills Test Prep
No ratings yet
Clerical Skills Test Prep
12 pages
ARIS Report Script Changes
No ratings yet
ARIS Report Script Changes
132 pages
Scan
No ratings yet
Scan
72 pages
IMF502 Part 1 Discussion Chapter 4 230319
No ratings yet
IMF502 Part 1 Discussion Chapter 4 230319
8 pages
Webtrust For Certification Authorities
No ratings yet
Webtrust For Certification Authorities
42 pages
Bosepo Ali Hassan
No ratings yet
Bosepo Ali Hassan
16 pages
Call Center Management System
75% (8)
Call Center Management System
58 pages
Cyber Ebook v1.5 Web
No ratings yet
Cyber Ebook v1.5 Web
20 pages